Euredit Logo

Euredit Platform Data Cleaning Components

Contents

Algorithmic descriptions and references

Software is available which implements the following six data cleaning methods:

  1. Expectation Maximization
  2. Blocked Adaptive Computationally-efficient Outlier Nominators (BACON)
  3. Donor Imputation System
  4. Simple Imputation
  5. Tree-Structured Self-Organizing Map
  6. Weighted Automatic Interaction Detection

A brief description of the software for each method is now given.

1. Expectation Maximization

The EM software uses a multivariate normal EM algorithm to impute values.

More information concerning this technique can be found in: Kokic

2. Blocked Adaptive Computationally-efficient Outlier Nominators

The Blocked Adaptive Computationally-efficient Outlier Nominators method, BACON, is decribed in Beguin and Hulliger (b).

The method used to find an initial subset of "good" data points depends on the value of the parameter method in the BACON software. Setting method equal to one uses the Mahalanobis distance to criterion, whereas method equal to zero uses the co-ordinate-wise Euclidean distances from the median values of available data. The EM algorithm is used to impute missing values.

3. Donor Imputation System

gen_imp is the implementation of a general algorithm for finding the best match of a target record with another record in the data matrix. This can be used as the computational centre of the donor imputation software (DIS) system, as described in Rahman.

For each record in a search range, gen_imp computes a score from the record to the target record as

Σ wi f (Ri di) / Σ wi,

where di is a measure of the distance for the ith variable between the target observation and the record, wi is the weight given to the variable and Ri is an optional scaling factor/coefficient.

Seven types of distance are currently included. For continuous variables:

  1. Euclidean distance
  2. Manhattan distance
  3. Regression distance, i.e., the absolute value of the sum the distances
  4. Simple matching using threshold distance

In each case the function of the absolute difference and a scaling factor (or regression coefficient) is multiplied by variable weight and summed. An appropriate scaling factor would by the reciprocal of the range. It is envisaged that all continuous variables using the same distance measure will have the same weights. For Euclidean distance, the square root of the sum of squared values for all variables using this distance is the added to the score. For Manhattan distance, the sum of absolute distances is used. For the regression distance, the absolute value of the sum of distances is added to the score. For thresholding, if the distance is greater than Ri then a score of one is computed.

For categorical variables the distances are:

  1. Simple matching
  2. Scaled rank difference
  3. User defined

In the case of user defined distance for a variable with r categories, an r by r matrix (usually symmetric) containing the distances for any possible pair of values has to be supplied.

The weights can be used to select and give importance to variables when matching. Weights are specified for each variable; if a weight is set to zero then no matching for imputation is carried out for that variable.

The program looks for the records that have the smallest score provided that the values of interest are present. In the case of tied scores up to the first s encountered records are stored where s is user defined. The records being searched can be specified in two ways. Either all records in a range are searched or all records with the same value of a categorical variable as the target record will be searched, i.e. regional matching.

The target records can be specified in three ways. All individual values to be imputed can be listed and either all missing values or all missing values for a selected variable can be targeted.

If for a target case there are missing values in the matching variables and the sum of weights of available matching variables is greater than a user-supplied minimum then the search will continue and a warning given. If the sum of weights is below the minimum, no imputation will take place.

4. Simple Imputation

A simple imputation method is used in which missing variate values by the variate mean or mode.

5. Tree-Structured Self-Organizing Map (TS-SOM)

Imputes missing data using the TS-SOM algorithm, as described in Pilea.

This software has kindly been provided by the University of Jyväakylä, and is described in a user-guide written by Horppu and Koikkalainen

6. Weighted Automatic Interaction Detection (WAID)

Imputes missing data using the univariate WAID algorithm, see Xinqiang and Chambers. In addition to the WAID software, the following utility components are available:

List of components

Click on the component name in the below table for detailed documentation.

Component nameDescription
anova_treeComputes a binary decision tree using an ANOVA condition
baconDetects outliers by using the BACON-EM algorithm
emImputes by using the multivariate normal expectation-maximisation algorithm
gen_impA general imputation by using distance-based methods
gini_treeComputes a binary decision tree using the Gini criterion
simp_impA simple imputation function that replaces missing values with the mean value for continuous variables and the mode for categorical variables
tree_scoreScores a binary decision tree
waid_treeComputes a binary decision tree using a robust ANOVA criterion

Platform software homepage