Euredit Evaluation Software Guide
Introduction
Welcome to the guide to the Euredit Evaluation software.
This software is provided under the terms of the Euredit Consortium Agreement.
The Euredit Evaluation software consists of three basic operations.
-
Computing a small set of summary statistics from the true data.
These are then used in the computing of the evaluation statistics
-
Comparing two files and reporting the differences in a standard way.
In practice two pairs of file will be compared
-
The true values files and the file with errors (perturbed data)
-
The file with errors and the corrected file (imputed data)
-
Given the two sets of differences and the summary statistics, the
evaluation statistics are computed
The Euredit Evaluation System consists of four windows programs and an underlying DLL:
-
Esummary.exe
that computes the summary statistics (cf. 1)
-
Ediff.exe
that compares the data files (cf. 2)
-
Eval.exe
that computes the evaluation statistics (cf. 3)
-
Euredit_evaluation.exe
that provides a menu for using the above three programs together.
-
ecomp_dll.dll the underlying dll. This must be copied to somewhere on the search path
In addition to descriptions of how to use the above applications, this guide includes
test data and results.
Running the programs
This function computes summary statistics for each variable in a given (true) data set.
These summary statistics are written to a file for use with Eval.exe.
To use this program the user has to specify information on the (true) data and results files
to be used and the nature of each variable present in a data set.
This information is contained in an options file.
1.1 Generating an options file
The options file for a data set can be generated manually or automatically.
In both cases the resulting options files should look the same, i.e. similar to
the skeleton options file:
// Options file
// Input file
Full path to data file, directories starting with upper case
//Number of observations
Integer
//Number of variables
Integer
// Input delineator
Integer {0 - space, 1 - tab, 2 - comma}
// Not applicable bound
Value
// Header
Integer {0 - no, 1 - yes}
// Use weights
Integer {0 - no, 1 - yes}
// Results file
Full path to results file, directories starting with upper case
// Variable information
// Variable <var>{name of variable}<\var>
//Variable type
Integer {1 - continuous, >1 categorical}
{if categorigal
//Categories
Integer {number of categories}
Space separated list of integers {list of category values}
}
//End of variable
//End of file
In the manual case, an options file can be written by using any text editor.
Alternatively, Esummary.exe will generate the information needed if the
user fills in the form which pops up when the button
Input\Edit Variable information is selected. This form should be completed in
the following way.
-
Data File Description
-
If your analysis includes weights on each case, define the location of the
weight values. For weight values in a separate file, select the
Weights file button. In this case a dialogue box appears for the user to
type in the full path of the weights file. Alternatively, if weight values are
given by a variable in your data, select the Weights Variable button.
In this case a dialogue box appears in which the user must type the appropriate
variable number (not name). Numbering starts from one for the extreme left variable
in the data file.
The default value is No Weights
-
In the selection box below Delimiter highlight how the data in the ASCII
data file are delimited. If you are in doubt here, open the data file in a text editor
and look for spaces, commas or tabs (usually four spaces) between each data value on
a row.
-
The first line (or row) of some data files may contain information distinct
from the raw data values. This header information,
e.g., date of file creation, needs to be skipped by the data reading software
which otherwise will give an error message. The box below Header present
should be ticked if the data file contains a header. If you are in doubt here,
open the data file in a text editor.
The default setting assumes that there is no header in data files.
-
Variable information
-
Enter the number of variables in the dialogue box below
Number of variables
-
Enter the number of observations on each variable in
the dialogue box below Number of observations.
-
For the ith variable in the data set perform the steps:
-
Highlight Var i in the list below Variables
-
If so desired, change the name of the variable as it appears in the
dialogue box below Variable Name
-
Select the type for this variable: continuous, ordinal or nominal. If this
variable is categorical (i.e. ordinal or nominal), a dialogue box will appear
under the caption Categories. This box should be used to input a
space-separated list of categories for the variable.
-
Enter a value in the dialogue box underneath Valid value lower bound.
Any data value below this value will be treated as not applicable.
The defualt value for this parameter is zero.
-
Enter the name (including path) of the file into which the summary statistics
will be written into the dialogue box below Results file.
There is no default value for this parameter.
-
Enter the name (including path) of the file into which the options information can be
written. Saving the file will enable the user to efficiently repeat experiments.
There is no default value for this parameter.
-
Once all the information has been entered on the form click the
Save Options file button.
-
Exit the form by clicking on the Close button.
For an example options file, open the file named example_ss_opt.txt in a
text editor.
Note that loading the options file will fail if the software is unable to open
the results file for writing, which will occur if the results file is being used
by another application.
1.2 Calculating the summary statistics
Once the options information has been loaded or input, click on the button
Calculate to create the results file (as named in the options).
This program is used to create two files which summarise the difference between:
-
True data and perturbed data
-
Perturbed data and imputed data
The generated files are used as input to Eval.exe.
2.1 Computing difference files
The interface can be used in the following way to compute a difference file
for true and perturbed data.
-
At the top of the window select True/Perturbed
-
For the true data file:
-
Underneath True data file select from the list the data file that
contains the true values. If necessary, the left hand pane may be used to
change the directory and drive partition
-
If the first line (or row) of the file contains information distinct from
actual values, tick the Header box.
-
Select from the list beneath Delimiter the character(s) used to
separate values in a row of data.
-
Repeat steps 2(i) to 2(iii) for the perturbed data file
-
Identify the name and location of the results file in which the difference
information will be written:
-
Select the name of the file from the list. If necessary, the left hand pane may be used to
change the directory and drive partition
-
Type the name of the file in the dialogue box beneath the list. If necessary, the left hand pane may be used to
change the directory and drive partition
-
Enter the number of records in each data file in the dialogue box beneath
Number of records
-
Enter the number of variables in each data file in the dialogue box beneath
Number of Variables
-
If the box beneath Match first column is ticked, the software will check
the value of the first variable (usually an identification number) is the same
in each file for each data record. If there is a mismatch, the software will
stop
-
Click on the Compare button to compute the differences and write the output
to the nominated results file
Eval.exe uses the summary statistics file generated by Esummary.exe
and the two difference files generated by Ediff.exe to compute the
statistics as described in in the Euredit document entitled
Evaluation criteria for statistical editing and imputation.
The links between formulae given in the above paper and the output from the evaluation
software are:
Output label | Formula | Notes |
alpha | 1 | |
beta | 2 | |
delta | 3 | |
RAE | 7 | |
RRASE | 8 | |
RER | 9 | |
Dcat | 10 | |
tj | 11 | |
AREm1 | 12 | Set: K=1 |
AREm2 | 12 | Set: K=2 |
W | 14 | |
D | 15 | |
Eps | 16 | |
Dgen | 17 | |
Slope | | See page 19 in paper; Huber M-estimate used |
t-val | | |
mse | | |
R^2 | | |
dL1 | 19 | |
dL2 | 20 | |
dLinf | 21 | |
K-S | 25 | |
K-S_1 | 26 | Set: alpha = 1 |
K-S_2 | 26 | Set: alpha = 2 |
m_1 | 28 | Set: k = 1 |
m_2 | 28 | Set: k = 2 |
MSE | 30 | |
G | 13 | Calculated only if probabilities are supplied |
Case | | |
A | 4 | |
B | 5 | |
C | 6 | |
3.1 Computing evaluation statistics
As explained in
Evaluation criteria for statistical editing and imputation
,
not every statistic is calculated for all data since:
-
some statistics, e.g., W, are suitable only for categorical variables
-
some statistics, e.g., Dgen, are suitable only for ordinal variables
-
some statistics, e.g., dL1, are suitable only for continuous variables
Furthermore, the robust calculation of the regression statistics Slope,
t-val, mse and R^2 is not possible if the data values
are identical or the median absolute deviance of data is zero.
For more detail on the implemtation used to calculate these statistics,
please refer to the pseudo-code.
The software writes the values of the above statistics into file which can be viewed in
Excel. The cells of any statistics that are either not relevant or cannot be computed
are left blank. The final row of the Excel spreadsheat contains information on the
number of errors detected by the evaluation software:
Column number | Description |
1 | The number of invalid values for categorical variables that are invalid |
2 | The number of undetected perturbed values for categorical variables that are invalid |
3 | The number of undetected missing values |
4 | The number of true values for categorical variables that are invalid |
Complete the following steps to use the interface.
-
In the list boxes underneath the appropriate headings, select the true/perturbed
difference file and the perturbed/imputed difference file as computed by
Ediff.exe
-
In the list box underneath the heading Summary statistics,
select the summary statistics file as computed by Esummary.exe.
-
Input the value of the valid lower bound into the dialogue box beneath
Valid value lower bound
-
If there are probabilities associated with the error location, the
Provide probabilities option should be ticked and the file of probabilities
specified. These should be in a similar format to the data files.
This first column should be an observation identifier followed by the probabilities
for all variables for that observation. The space, comma or tab separator may be used.
Only rows with non-zero entries need to be included in the file. Any value that is
not specified will be assumed to be zero. In order to relate the included probabilities
to the observations one of the data files with a complete set of observations needs to
be specified. Only the observation identifiers will be read form this file
-
The Imputations supplied options means that all the statistics are computed.
If this is not ticked, only the editing statistics are computed and the imputed values
are ignored
-
The name of the results file is generated automatically from the name of the
summary statistics file. This name can, however, be changed manually in the
dialogue box beneath Results File
-
Click on the Calculate button to compute the statistics
This program allows the user to select the three programs described above.
It retains information from one program to another, setting appropriate defaults.
A test example is included with the software. This example uses the five data files:
File name | Description |
example.dat | True data values |
example_ss_opt.txt | Options file for example.dat |
example.pert | Perturbed data values for example.dat |
example.imp | Example of imputed values for the perturbed data |
example_prob.dat | A probabilities file for data in example.dat |
The data values in the files in the above table are in tab-separated format, therefore
Tab should be highlighted in the interfaces as the delimiter (where appropriate).
Users should check their output against the results files:
File name | Description |
test_example_sst.txt | Summary statistics file for example.dat |
test_example_pert_res.txt | Difference file for true/perturbed data |
test_example_imp_res.txt | Difference file for perturbed/imputed data |
test_example_res.xls | Evaluation statistics without the use of probabilities |
test_example_res_prob.xls | Evaluation statistics with the use of example_prob.dat |
Error codes
The evaluation software contains a number of error checks and resulting error codes.
Many of these either should not occur because of checks in the GUI or are unlikely to
occur in normal use. The common errors numbers have been trapped by the GUI and simple
error messages returned. Any other error number that occurs is simply returned as:
error_input_number (error_info_number),
where error_input_number refers to an input on the GUI and error_info_number refers to
the problem within the input. If these errors occur please check the correctness of the
inputs to the program. To give and indication of the type of codes used, the following
gives a guide to the input for each error_input_number.
GUI application | Integer code | Explaination |
Esummary.exe | 1 | Error in data file |
Esummary.exe | 2 | Error in results file |
Esummary.exe | 12 | Error in weights |
Ediff.exe | 1 | Error comparing true/perturbed values |
Ediff.exe | 2 | Error comparing perturbed/imputed values |
Ediff.exe | 3 | Error in results file |
Eval.exe | 1 | Error in perturbed differences file |
Eval.exe | 2 | Error in imputed differences file |
Eval.exe | 3 | Error in summary statistics file |
Eval.exe | 4 | Error in results file |
Eval.exe | 6 | Error in probabilities file |
Eval.exe | 7 | Error when matching record identifiers |
Eval.exe | 14 | Error in weights |