Scientific and Technical Approach
In EUREDIT the fundamental approach adopted involves identifying sound scientific and technical and user-oriented criteria to enable a meaningful comparison of current and new promising methods for data editing and imputation.
Representative data sets arising in household surveys, business surveys, censuses, panel surveys, time series and business registers will be selected, to be used throughout the lifetime of the project. The first step is to select representative data sets among those available from the end-users in the project, which provide a sufficiently broad coverage of the range of error-attributes. To perform this selection, the project will refine the basic classification of error-attributes shown below, which is based on those known to occur in real data sets.
Some important attributes known to affect the quality of edited and imputed data are:
Some possible instances of attribute
|Type of error||Inconsistencies, missingness and amount of missingness, "outlyingness" and amount of "outlyingness"|
Nature of error
Type of variable
nominal, ordinal and continuous variables
Degree of non-response
item non-response, unit non-response
Type of data set
social surveys, business surveys, censuses, panel data, administrative registers
Approximately four data sets covering the above factors in isolation or in combination will be selected, providing a mix of real data sets reflecting the heterogeneity of European sources as well as simulated ones with known types of errors. The finally selected data sets will be stored on a CD-ROM, for distribution across the project partners for evaluation of editing and imputation methods, and ultimately for wider dissemination (subject to confidentiality constraints).
It is vital to ensure the integrity and validity of experimental procedures used in evaluating the diverse range of methods proposed here, and this will by achieved through the development of a methodological framework early in the project. This framework will prescribe a set of common procedures to be followed in the evaluation of all methods in the EUREDIT project. The framework will provide guidelines for both training in the case of adaptive methods such as neural networks, as well as criteria for evaluating the actual performance of such a system. The quality of an editing and imputation method can be measured in terms of: (a) extent of real errors identified; (b) extent of incorrect values imputed; (c) number of changes to the data; (d) preservation of the structure of the data in terms of marginal and joint distributions as well as aggregate characteristics; (e) plausibility of imputed values; (f) sensitivity to assumptions about the nature of the underlying data; and (g) operational efficiency.
The first criterion (a) relates to editing and reflects the fact that an ideal error localisation procedure ought to identify the presence of all errors. The second criterion (b) relates to imputation and reflects the fact that an ideal imputation procedure ought to recover the correct values exactly, though in practice this is extremely rare. Criterion (c) refers to the desirability that the combined edit and imputation procedures lead to a minimum number of changes in the data for given levels of the other criteria. Criterion (d) represents what might be considered a minimal "statistical" requirement, in that the distributions of the correct and imputed data are the same, or at least not significantly different. Note that the distributions in (d) can be interpreted in a very wide sense. For time-series data this may well be the conditional distribution of a missing value at a point in time, given all past realisations of that value. For cross-sectional data these distributions may be the individual marginal distribution of each missing variable, right through to the joint distribution of all the missing variables. A particular feature of joint distributions is that some combinations of values of variables may be impossible or at least implausible. Criterion (e) refers to the desirability that imputation does not lead to implausible imputed values. All edit and imputation methods rely either explicitly or implicitly on assumptions about how the observed data relates to the underlying ‘correct’ data. Criterion (f) refers to the desirability that methods are robust to departures from the assumptions upon which they are based. Finally, methods may be compared according to operational criteria, such as their costs, either computational costs or the savings achieved through automatic rather than clerical procedures. This is criterion (g).
The methodological framework underlying these criteria will form the basis for development of a set of rules that will ensure fair and valid comparisons between the different approaches. These evaluation criteria will take account of the different objectives of editing and imputation experiments; the presence of mixed-mode data sets and the need for operational viability. The research and development of valid, statistically sound criteria for the measurement of attributes (a) to (g) above will be translated into appropriate performance criteria to guide evaluation of methods in the project.
These performance criteria will need to be defined in terms of the outputs of the edit and imputation (E&I) methods. This is a difficult problem since there is no "correct" data set that can act as a reference, and in order to proceed it is necessary to make assumptions about the nature of the underlying population and the error generating process. EUREDIT will develop and evaluate indices for the purpose of indicating the extent to which an imputed data value can be trusted.
The development of performance criteria will involve comparing: (i) the estimates obtained before E&I; (ii) the estimates obtained after E&I; and (iii) the estimates that would have been obtained if there were no errors . We shall operationalise this comparison using the datasets produced under WP2. For real 'error-prone' datasets only the estimates under (i) and (ii) can be calculated. To operationalise the estimates under (iii), we shall simulate datasets with known error characterstics. This simulation will first involve analyses of the error-prone datasets and the use of the results of other methodological research to derive realistic models generating the patterns of missing data and potential measurement error. 'True' datasets will then be constructed from complete records in the real datasets. These true datasets will be assumed error-free and will provide the source of the estimates under (iii) above. Errors will then be simulated in the true datasets, based on the models fitted to the error patterns, and the resulting simulated datasets will be used to obtain comparable estimates under (i) and (ii). Since it is impossible to be certain about the nature of error on the basis of observed datasets, it will be necessary to consider alternative sets of assumptions about the error-generation process when constructing the simulated datasets. The evaluation will then include a study of the impact of these alternative assumptions.
Different E&I methods will be evaluated using both real datasets, comparing (i) and (ii) to assess whether the methods make more than a negligible difference (compared, for example, to the standard error of the estimates), and simulated datasets, comparing (i), (ii) and (ii). In the latter case the methods will be judged to be better, the closer the estimates in (ii) are to those in (iii), with the quality of the estimates in (i) treated as a benchmark.
Such numerical comparisons will feed into a consideration of the impact of the E&I methods on the overall bias and variance of estimates. The general aim will be to reduce bias without creating an offsetting increase in variance.
In evaluating the different methods, it will be important to consider a wide range of types of estimates. Some imputation methods can be successful for estimating univariate means or proportions but can lead to severe bias when estimating distributional quantities or multivariate characteristics. Thus, it will be important to study the latter kinds of quantities, as arise frequently in government statistical contexts.
Once data sets have been selected and a methodological framework for fair and meaningful evaluation is in place, the actual evaluation of the various selected methods can proceed. The work in this part of the project follows two main themes. One theme concerns the establishment of a baseline, against which new methods can be ultimately be compared, using the best of current practice among methods already in use for editing and imputation. The second theme concerns the development and adaptation of selected new methods considered to show potential benefits when applied to editing and/or imputation problems.
A review of the currently used methods for data editing will be carried out. The most widely used editing methods will be considered, such as those based on the Felligi-Holt approach and NIM. EUREDIT will also adapt and evaluate currently used methods for data imputation, including: class imputation, hot-deck imputation, random imputation, nearest neighbour donor imputation, data augmentation and regression imputation. The current best practice methods will provide benchmark for new methods.
The new methods selected for detailed investigation in EUREDIT are described below.
1. Outlier-robust methods for editing and imputation.
A very important aspect of statistical data editing is outlier detection. Besides graphical tools, robust mathematical algorithms can be used to detect outliers. Robust multivariate outlier detection methods will be developed that can be applied to the high dimensional mixed-mode data sets that occur in much of official statistics (as well as in business and financial applications). In this context the research will distinguish between methods for identifying non-representative outliers and representative outliers in these data (Chambers, 1986; Chambers and Kokic, 1993). Non-representative outliers are either sample units with data values that are incorrect or sample elements with data values that are unique in the population. Representative outliers are sample units with data values that have been correctly observed and that are not unique in the population. Dealing with outliers is considered an essential part of the edit and imputation process. In EUREDIT, outlier detection theory will be developed and evaluated in both editing and imputation across the data sets. Most model-based imputation methods are univariate in nature, can handle only continuous data and are typically based on outlier sensitive statistical criteria (e.g. least squares), see (GSS, 1996). However, real errors in data are usually multivariate and consist of a mix of categorical and highly skewed continuous variables. For imputation, a model-based outlier robust multivariate model-based imputation method will be developed. The aim will be to develop procedures that preserve distributional structure and incorporate internal consistency constraints, while also remaining robust to representative outliers in the data. This will require robust modeling of the population data generating process based on the information in the observed data as well as that provided by application of outlier robust error localisation methods.
2. Multi-Layer Perceptron (MLP) neural networks for editing and imputation.
Multi-Layer Perceptrons (MLP) are possibly the most widely known and successful neural network. They use a feedforward architecture and are typically trained using a procedure called error-backpropagation. After training with a representative data set, the result is a pattern classifier or recogniser. The input data space is effectively partitioned into a set of (possibly intersecting) hyperplanes, parameterised by the network "weights" learned during training. The response to a new (unseen) pattern is determined by the position of the vector in the input space and the way output neurone weights have been trained. MLPs have been used successfully in a broad range of applications involving the recognition or classification of patterns in data. In EUREDIT, MLP-based prototypes will be developed and evaluated for both editing and imputation across the datasets. The basis for this approach has already been established by some smaller-scale studies, for example (Nordbotten, 1995; Nordbotten, 1996).
3. Correlation Matrix Memory (CMM) neural networks for editing and imputation.
Correlation Matrix Memory (CMM) forms the basis of a family of powerful data search and match techniques collectively known as AURA. Although less well-known than the other neural networks considered here, the AURA system has particular advantages over most other neural networks in applications involving larger data sets and/or missing data fields within a record. The AURA system (Austin, 1996; Austin and Lees, 1999) has been developed to allow fast search on data that contains errors and may be incomplete, by drawing upon neural network based methods. Development of AURA has involved over 20 man-years of effort supported through UK national R&D funding as well as commercial collaborations. A powerful feature provided by CMM is a highly flexible search ability that, besides retrieving exact match records, can readily retrieve records that only partially match a query as part of the same operation. In the EUREDIT project CMM neural networks will be studied as potential techniques for both data editing and imputation. Briefly, for error localisation, the CMM network will be trained to correctly recognise records drawn from the main population. Records with erroneous combinations of data points will not match closely with the training data and may be classed as outliers, the closeness of match being a tuneable CMM parameter. For data imputation, a similar CMM network can be used to identify one or more candidate values to replace erroneous fields. The actual choice of replacement value will be based on a replacement policy that could take one of several different forms, including a form based on a statistical model of the underlying distributions represented by the training set. Members of the consortium have many years’ experience of applying these methods in a range of domains.
4. Self-Organising Map (SOM) neural networks for editing and imputation.
Self-Organising Map (SOM) neural networks have been widely applied and studied during the past 10-15 years, and were first devised and studied by (Kohonen, 1989). SOM's use an unsupervised learning procedure in which the network self-adapts to model the distribution of a data set presented repeatedly to the network inputs (in contrast with the other neural networks described here which require the specification of target values for the networks output states). Data attributes such as: (a) variables which are out of limits for data that belongs to a specific class; (b) data that does not belong to any previously seen class; and (c) the distribution of collected data differs from the "usual case", can be successfully detected using SOM's. The underlying principle of abnormality detection is that normal, clean data is used as training material to build mixture density model of "normal" situations and anything that does not fit into this category is potentially erroneous, and must be examined more carefully. This abnormality detection will be developed using Neural Data Analysis (NDA) software (Häkkinen and Koikkalainen, 1997) which provides an interactive environment for development. This will involve the development of self-organising maps (SOMs) type neural networks for data imputation. Estimators of the missing values will be based on the knowledge about the whole population. As a result, a density difference between the collected data and the one that was used to build the estimator is obtained. This is used as a prior information for the estimator. A set of estimators will be trained for differently behaving data partitions. These estimators can be traditional methods, or neural networks, like the Multi-Layer Perceptron (MLP). The estimator outputs are combined to a single output by using the prior information and Bayesian inference. Related work is described in (Lensu and Koikkalainen, 1998).
5. Support Vector Machines (SVM) for imputation.
Over the last few years methods for constructing a new type of universal learning machines, based on statistical learning theory, have been developed. In contrast to traditional learning techniques, these novel methods can control the generalisation ability of different types of machines that use different sets of decision functions (polynomials, radial basis functions, neural networks etc). The Support Vector Machine (Vapnik, 1996), makes it possible to avoid the "curse of dimensionality" problem, and therefore to process very high-dimensional data sets. The SVM implements the Structural Risk Minimisation (SRM) principle for a special type of structure that guarantees the smallest probability of error on the test set, given a set of decision functions and a structure on this set. These types of method have been shown to be important for applications in computer vision, biological databases, data mining, etc., which often require the processing of millions or even billions of attributes. EUREDIT will investigate the use of SVM in the context of imputation. Given the high generalisation abilities, the SVM has the potential to deal with a variety of problems including imputation. Members of the consortium have considerable experience with SVM methods and Vapnik will be involved in the project.
6. Fuzzy logic and nonparametric regression methods for imputation.
Time series data are an important component of the outputs of virtually all National Statistical Institutes (NSIs), as well as the basis for a substantial part of the analytic work carried out in the financial sector of most EU economies. Much of these data are obtained via panel-type designs, that is, where the same group of respondents is measured over a period of time. In this situation, not all respondents are measured at the same time, and many do not respond completely during "their" measurement interval. The result is a data set with time and cross-sectional characteristics that contains many "holes", all of which need to be "filled-up" prior to analysis. This is especially the case when this analysis is focussed on development of models that explain both time series and cross-sectional behaviour. Such models are being increasingly used in microsimulation of economic performance for different sectors of national economies (Kokic, Chambers and Beare, 1999), as well as in financial management of investment portfolios (Kokic, Breckling and Eberlein, 1999). EUREDIT will evaluate the combined use of fuzzy logic and neural networks in imputation, as well as the development of nonparametric regression based alternatives to parametric methods for imputation. Fuzzy logic allows an alternative representation of imprecision in data values and rules which has proved useful in a number of domains, such as control engineering and market forecasting (Insiders, 1999). Nonparametric regression (Hordle, 1990; Breckling and Sassin, 1995) allows highly nonlinear relationships between variables with missingness and variables that are incomplete to be efficiently exploited in imputation. EUREDIT will develop and evaluate a new imputation approach combining elements of neural networks, fuzzy logic, nonparametric regression and various parametric techniques. Although this approach will be applied to other data sets in EUREDIT, particular emphasis will be placed on panel time series data and applications to financial data sets.
All the experimental evaluations of methods described above will finally be collated for an overall evaluation to identify those strategic selections of methods (new or existing) which will provide optimal quality data editing and imputation results for each particular combination of error attributes in a particular dataset.
On the basis of this assessment, the better-performing methods will be selected for wider dissemination and use. A common platform software prototype will be developed to allow users within EUREDIT access to all the best-performing methods for immediate dissemination and use. The software will use a modular architecture into which selected individual modules implementing each methods can be easily added (or removed). This software will be developed using industry standard methods (e.g. C/C++ under Unix/NT as appropriate).
Austin J, 1996. Distributed Associative memories for High Speed Symbolic Reasoning, Fuzzy Sets and Systems, No 82, pp. 223-233, Elsevier, 1996.
Austin J, Lees K, 1999. A Novel Search Engine Based on Correlation Matrix Memories," Neurocomputing - Special Issue, Elsevier Science, (awaiting publication).
Breckling J, Sassin O, 1995 A Non-parametric Approach to Time Series Forecasting, In ZEW-Wirtschaftsanalysen Band 5: Quantitative Verfahren im Finanzmarktbereich, ed. M. Schroder. Nomos Verlagsgesellschaft, Baden-Baden, 1995.
Chambers RL, 1986 Outlier Robust Finite Population Estimation," Journal of the American Statistical Association 81, 1063-1069, 1986.
Chambers RL, Kokic PN, 1993. Outlier Robust Sample Survey Inference, Invited Paper, Proceedings of the 49th Session of the International Statistical Institute, Firenze, August 25 - September 2, 1993.
Fellegi and Holt, 1976. A systematic approach to automatic edit and imputation, Journal of the American Statistical Association, Vol 71, No 353, March 1976.
Granquist 1998. Efficient editing – improving data quality by modern editing. Proceedings NTTS Conference, Sorrento, 1998.
GSS, 1996. "Report of the Task Force on Imputation," Government Statistical Service Methodology Series No. 3, UK, 1996.
Häkkinen E, Koikkalainen P, 1997. The Neural Data Analysis environment, In proc. WSOM'97: Workshop on Self-Organizing Maps, June 4-6, 1997, Helsinki University of Technology, Neural Networks Research Centre. Espoo, Finland.
Hordle W, 1990. Applied Nonparametric Regression Analysis, Cambridge University Press: Cambridge UK, 1990.
Hulliger B, 1995. Outlier robust Horvitz-Thompson estimators, Survey Methodology, Vol 21 No 1: 79-87.
Insiders, 1999. Financial Market Navigator: A run-time system for interpretation, predictions and simulation of trading strategies including analysis based on neural networks and fuzzy logic technologies, Insiders GmbH, Wissensbasierte Systeme, Mainz, Germany, 1999.
Kohonen T, 1989. Self-Organisation and Associative Memory, (Third Edition), Springer Series in Information Sciences, Springer-Verlag, 1989.
Kokic P, Chambers R, Beare S, 1999. Microsimulating Farm Business Performance, Proceedings on Social Science Microsimulation, 5-9 May 1997, Dagstuhl, Germany (to appear).
Lensu A, Koikkalainen P, 1998. Analysis of Multi-Choice Questionnaires through Self-Organizing Maps, In proc. ICANN'98: 8th International Conference on Artificial Neural Networks, September 2-4, 1998, Skövde, Sweden. Springer-Verlag, London, 1998, pp. 305-310.
Lomas D, 1999. PhD (in preparation), University of York, Dept. of Computer Science, UK.
Nordbotten S, 1995. Editing statistical records by by neural networks, Journal of Official Statistics, Vol 11, No. 4, 1995.
Nordbotten S, 1996. Neural network imputation applied to the Norwegian 1990 Population Census data, Journal of Official Statistics, Vol 12, No. 4, 1996.
Vapnik VN, 1996. Structure of Statistical Learning Theory, In: Computational Learning and Probabilistic Reasoning, Ed. by A.Gammerman, Wiley, 1996.