Euredit Logo


Euredit Evaluation Software Guide

Introduction

Welcome to the guide to the Euredit Evaluation software. This software is provided under the terms of the Euredit Consortium Agreement.

The Euredit Evaluation software consists of three basic operations.

  1. Computing a small set of summary statistics from the true data. These are then used in the computing of the evaluation statistics
  2. Comparing two files and reporting the differences in a standard way. In practice two pairs of file will be compared
    1. The true values files and the file with errors (perturbed data)
    2. The file with errors and the corrected file (imputed data)
  3. Given the two sets of differences and the summary statistics, the evaluation statistics are computed

The Euredit Evaluation System consists of four windows programs and an underlying DLL:

In addition to descriptions of how to use the above applications, this guide includes test data and results.

Running the programs

1. Esummary.exe

This function computes summary statistics for each variable in a given (true) data set. These summary statistics are written to a file for use with Eval.exe.

To use this program the user has to specify information on the (true) data and results files to be used and the nature of each variable present in a data set. This information is contained in an options file.

1.1 Generating an options file

The options file for a data set can be generated manually or automatically. In both cases the resulting options files should look the same, i.e. similar to the skeleton options file:

// Options file
// Input file
Full path to data file, directories starting with upper case
//Number of observations
 Integer
//Number of variables
 Integer
// Input delineator 
 Integer  {0 - space, 1 - tab, 2 - comma}
// Not applicable bound 
 Value
// Header 
 Integer  {0 - no, 1 - yes}
// Use weights
 Integer {0 - no, 1 - yes}
// Results file
Full path to results file, directories starting with upper case
// Variable information
// Variable <var>{name of variable}<\var>
//Variable type
Integer {1 - continuous, >1 categorical}
{if categorigal
//Categories
Integer  {number of categories}
Space separated list of integers {list of category values}
}
//End of variable
//End of file

In the manual case, an options file can be written by using any text editor. Alternatively, Esummary.exe will generate the information needed if the user fills in the form which pops up when the button Input\Edit Variable information is selected. This form should be completed in the following way.
  1. Data File Description
    1. If your analysis includes weights on each case, define the location of the weight values. For weight values in a separate file, select the Weights file button. In this case a dialogue box appears for the user to type in the full path of the weights file. Alternatively, if weight values are given by a variable in your data, select the Weights Variable button. In this case a dialogue box appears in which the user must type the appropriate variable number (not name). Numbering starts from one for the extreme left variable in the data file. The default value is No Weights
    2. In the selection box below Delimiter highlight how the data in the ASCII data file are delimited. If you are in doubt here, open the data file in a text editor and look for spaces, commas or tabs (usually four spaces) between each data value on a row.
    3. The first line (or row) of some data files may contain information distinct from the raw data values. This header information, e.g., date of file creation, needs to be skipped by the data reading software which otherwise will give an error message. The box below Header present should be ticked if the data file contains a header. If you are in doubt here, open the data file in a text editor. The default setting assumes that there is no header in data files.
  2. Variable information
    1. Enter the number of variables in the dialogue box below Number of variables
    2. Enter the number of observations on each variable in the dialogue box below Number of observations.
    3. For the ith variable in the data set perform the steps:
      • Highlight Var i in the list below Variables
      • If so desired, change the name of the variable as it appears in the dialogue box below Variable Name
      • Select the type for this variable: continuous, ordinal or nominal. If this variable is categorical (i.e. ordinal or nominal), a dialogue box will appear under the caption Categories. This box should be used to input a space-separated list of categories for the variable.
  3. Enter a value in the dialogue box underneath Valid value lower bound. Any data value below this value will be treated as not applicable. The defualt value for this parameter is zero.
  4. Enter the name (including path) of the file into which the summary statistics will be written into the dialogue box below Results file. There is no default value for this parameter.
  5. Enter the name (including path) of the file into which the options information can be written. Saving the file will enable the user to efficiently repeat experiments. There is no default value for this parameter.
  6. Once all the information has been entered on the form click the Save Options file button.
  7. Exit the form by clicking on the Close button.
For an example options file, open the file named example_ss_opt.txt in a text editor.

Note that loading the options file will fail if the software is unable to open the results file for writing, which will occur if the results file is being used by another application.

1.2 Calculating the summary statistics

Once the options information has been loaded or input, click on the button Calculate to create the results file (as named in the options).

2. Ediff.exe

This program is used to create two files which summarise the difference between:

The generated files are used as input to Eval.exe.

2.1 Computing difference files

The interface can be used in the following way to compute a difference file for true and perturbed data.

  1. At the top of the window select True/Perturbed
  2. For the true data file:
    1. Underneath True data file select from the list the data file that contains the true values. If necessary, the left hand pane may be used to change the directory and drive partition
    2. If the first line (or row) of the file contains information distinct from actual values, tick the Header box.
    3. Select from the list beneath Delimiter the character(s) used to separate values in a row of data.
  3. Repeat steps 2(i) to 2(iii) for the perturbed data file
  4. Identify the name and location of the results file in which the difference information will be written:
    1. Select the name of the file from the list. If necessary, the left hand pane may be used to change the directory and drive partition
    2. Type the name of the file in the dialogue box beneath the list. If necessary, the left hand pane may be used to change the directory and drive partition
  5. Enter the number of records in each data file in the dialogue box beneath Number of records
  6. Enter the number of variables in each data file in the dialogue box beneath Number of Variables
  7. If the box beneath Match first column is ticked, the software will check the value of the first variable (usually an identification number) is the same in each file for each data record. If there is a mismatch, the software will stop
  8. Click on the Compare button to compute the differences and write the output to the nominated results file

3. Eval.exe

Eval.exe uses the summary statistics file generated by Esummary.exe and the two difference files generated by Ediff.exe to compute the statistics as described in in the Euredit document entitled Evaluation criteria for statistical editing and imputation.

The links between formulae given in the above paper and the output from the evaluation software are:

Output labelFormulaNotes
alpha1
beta2
delta3
RAE7
RRASE8
RER9
Dcat10
tj11
AREm112Set: K=1
AREm212Set: K=2
W14
D15
Eps16
Dgen17
SlopeSee page 19 in paper; Huber M-estimate used
t-val
mse
R^2
dL119
dL220
dLinf21
K-S25
K-S_126Set: alpha = 1
K-S_226Set: alpha = 2
m_128Set: k = 1
m_228Set: k = 2
MSE30
G13Calculated only if probabilities are supplied
Case
A4
B5
C6

3.1 Computing evaluation statistics

As explained in Evaluation criteria for statistical editing and imputation , not every statistic is calculated for all data since:

Furthermore, the robust calculation of the regression statistics Slope, t-val, mse and R^2 is not possible if the data values are identical or the median absolute deviance of data is zero. For more detail on the implemtation used to calculate these statistics, please refer to the pseudo-code.

The software writes the values of the above statistics into file which can be viewed in Excel. The cells of any statistics that are either not relevant or cannot be computed are left blank. The final row of the Excel spreadsheat contains information on the number of errors detected by the evaluation software:

Column numberDescription
1The number of invalid values for categorical variables that are invalid
2The number of undetected perturbed values for categorical variables that are invalid
3The number of undetected missing values
4The number of true values for categorical variables that are invalid

Complete the following steps to use the interface.

  1. In the list boxes underneath the appropriate headings, select the true/perturbed difference file and the perturbed/imputed difference file as computed by Ediff.exe
  2. In the list box underneath the heading Summary statistics, select the summary statistics file as computed by Esummary.exe.
  3. Input the value of the valid lower bound into the dialogue box beneath Valid value lower bound
  4. If there are probabilities associated with the error location, the Provide probabilities option should be ticked and the file of probabilities specified. These should be in a similar format to the data files. This first column should be an observation identifier followed by the probabilities for all variables for that observation. The space, comma or tab separator may be used. Only rows with non-zero entries need to be included in the file. Any value that is not specified will be assumed to be zero. In order to relate the included probabilities to the observations one of the data files with a complete set of observations needs to be specified. Only the observation identifiers will be read form this file
  5. The Imputations supplied options means that all the statistics are computed. If this is not ticked, only the editing statistics are computed and the imputed values are ignored
  6. The name of the results file is generated automatically from the name of the summary statistics file. This name can, however, be changed manually in the dialogue box beneath Results File
  7. Click on the Calculate button to compute the statistics

4. Euredit_evaluation.exe

This program allows the user to select the three programs described above. It retains information from one program to another, setting appropriate defaults.

A test example

A test example is included with the software. This example uses the five data files:

File nameDescription
example.datTrue data values
example_ss_opt.txtOptions file for example.dat
example.pertPerturbed data values for example.dat
example.impExample of imputed values for the perturbed data
example_prob.datA probabilities file for data in example.dat

The data values in the files in the above table are in tab-separated format, therefore Tab should be highlighted in the interfaces as the delimiter (where appropriate). Users should check their output against the results files:

File nameDescription
test_example_sst.txtSummary statistics file for example.dat
test_example_pert_res.txtDifference file for true/perturbed data
test_example_imp_res.txtDifference file for perturbed/imputed data
test_example_res.xlsEvaluation statistics without the use of probabilities
test_example_res_prob.xlsEvaluation statistics with the use of example_prob.dat

Error codes

The evaluation software contains a number of error checks and resulting error codes. Many of these either should not occur because of checks in the GUI or are unlikely to occur in normal use. The common errors numbers have been trapped by the GUI and simple error messages returned. Any other error number that occurs is simply returned as:

error_input_number (error_info_number),

where error_input_number refers to an input on the GUI and error_info_number refers to the problem within the input. If these errors occur please check the correctness of the inputs to the program. To give and indication of the type of codes used, the following gives a guide to the input for each error_input_number.

GUI applicationInteger codeExplaination
Esummary.exe1Error in data file
Esummary.exe2Error in results file
Esummary.exe12Error in weights
Ediff.exe1Error comparing true/perturbed values
Ediff.exe2Error comparing perturbed/imputed values
Ediff.exe3Error in results file
Eval.exe1Error in perturbed differences file
Eval.exe2Error in imputed differences file
Eval.exe3Error in summary statistics file
Eval.exe4Error in results file
Eval.exe6Error in probabilities file
Eval.exe7Error when matching record identifiers
Eval.exe14Error in weights