![]() |
Software is available which implements the following six data cleaning methods:
A brief description of the software for each method is now given.
The EM software uses a multivariate normal EM algorithm to impute values.
More information concerning this technique can be found in: Kokic
The Blocked Adaptive Computationally-efficient Outlier Nominators method, BACON, is decribed in Beguin and Hulliger (b).
The method used to find an initial subset of "good" data points depends on the value of the parameter method in the BACON software. Setting method equal to one uses the Mahalanobis distance to criterion, whereas method equal to zero uses the co-ordinate-wise Euclidean distances from the median values of available data. The EM algorithm is used to impute missing values.
gen_imp is the implementation of a general algorithm for finding the best match of a target record with another record in the data matrix. This can be used as the computational centre of the donor imputation software (DIS) system, as described in Rahman.
For each record in a search range, gen_imp computes a score from the record to the target record as
where di is a measure of the distance for the ith variable between the target observation and the record, wi is the weight given to the variable and Ri is an optional scaling factor/coefficient.
Seven types of distance are currently included. For continuous variables:
In each case the function of the absolute difference and a scaling factor (or regression coefficient) is multiplied by variable weight and summed. An appropriate scaling factor would by the reciprocal of the range. It is envisaged that all continuous variables using the same distance measure will have the same weights. For Euclidean distance, the square root of the sum of squared values for all variables using this distance is the added to the score. For Manhattan distance, the sum of absolute distances is used. For the regression distance, the absolute value of the sum of distances is added to the score. For thresholding, if the distance is greater than Ri then a score of one is computed.
For categorical variables the distances are:
In the case of user defined distance for a variable with r categories, an r by r matrix (usually symmetric) containing the distances for any possible pair of values has to be supplied.
The weights can be used to select and give importance to variables when matching. Weights are specified for each variable; if a weight is set to zero then no matching for imputation is carried out for that variable.
The program looks for the records that have the smallest score provided that the values of interest are present. In the case of tied scores up to the first s encountered records are stored where s is user defined. The records being searched can be specified in two ways. Either all records in a range are searched or all records with the same value of a categorical variable as the target record will be searched, i.e. regional matching.
The target records can be specified in three ways. All individual values to be imputed can be listed and either all missing values or all missing values for a selected variable can be targeted.
If for a target case there are missing values in the matching variables and the sum of weights of available matching variables is greater than a user-supplied minimum then the search will continue and a warning given. If the sum of weights is below the minimum, no imputation will take place.
A simple imputation method is used in which missing variate values by the variate mean or mode.
Imputes missing data using the TS-SOM algorithm, as described in Pilea.
This software has kindly been provided by the University of Jyväakylä, and is described in a user-guide written by Horppu and Koikkalainen
Imputes missing data using the univariate WAID algorithm, see Xinqiang and Chambers. In addition to the WAID software, the following utility components are available:
Click on the component name in the below table for detailed documentation.
Component name | Description |
---|---|
anova_tree | Computes a binary decision tree using an ANOVA condition |
bacon | Detects outliers by using the BACON-EM algorithm |
em | Imputes by using the multivariate normal expectation-maximisation algorithm |
gen_imp | A general imputation by using distance-based methods |
gini_tree | Computes a binary decision tree using the Gini criterion |
simp_imp | A simple imputation function that replaces missing values with the mean value for continuous variables and the mode for categorical variables |
tree_score | Scores a binary decision tree |
waid_tree | Computes a binary decision tree using a robust ANOVA criterion |