The major objectives of the FEDAURA
Project are:
- Establish a
baseline by evaluating currently used detection methods.
- Identify standards
and produce an evaluation framework for fraud detection systems.
- Develop and
evaluate a selected range of new techniques using AURA technology.
- Show that the AURA methods
can scale to process large-scale data sets within the demanding processing
time limits exemplified by the DWP fraud application.
- Evaluate other Neural Network,
Statistical and Data Mining methods for fraud detection and
compare these with AURA-based methods for similar tasks.
- Develop a framework for using
the DWP data based on Case-Based Reasoning,
- Assess the accuracy of the AURA
technology in identifying anomalies (e.g. fraud),
- Show how the technology can
be incorporated into existing DWP systems operated by EDS
and others.
The benefits of the project will
be primarily the reduction of Benefit fraud. This will be achieved through
a follow on implementation of the technology in an operational environment.
The project will allow a new technology to be transferred to leading software
consultancy companiesand developed ready for market. The technology is applicable
in many similar fraud environments including insurance, banking and e-business.
The AURA techniques developed at York have demonstrated powerful pattern-matching
abilities in a range of domains. Two major feature of the technology are:
- its ability to perform an initial
fast (but approximate) search followed by a more detailed analysis
- together with rapid, one-pass
training of the system.
The initial fast search removes
all data clearly inapplicable to the problem, leaving a small residual data
set that can be analysed by very powerful, but possibly slower methods.
A framework is needed to apply AURA within the problem domain of fraud
identification. The research intends to explore the use of a Case-Based
Reasoning (CBR) framework, one of the major Machine Learning frameworks
for pattern matching. CBR provides a methodology for identification
and updating of cases. The main advantages of CBR are:
- that it makes explicit the information
within each case, facilitating explanation;
- it provides for on-line updating
of information; and,
- it allows case updating.
As a basis for applying the AURA
technology to benefit claim fraud identification, it appears to offer a
suitable and very promising framework within which to structure the problem.
Essentially, data concerning each individual submitting a claim is considered
to form a case, with accumulated, historical claims data being matched against
any new benefit claim. However, with very large numbers of cases, run-times
can become a serious problem.
"Case-based reasoning will be ready for large-scale problems only
when retrieval algorithms are efficient at handling thousands of cases."
I Watson, F Marir, 1994, Case-Based Reasoning: A Review,
The Knowledge Engineering Review, Vol. 9, No. 4.
Clearly the DWP application is at least an order of magnitude larger
than this, in terms of case numbers. This can be seen in terms of the linear
time complexity in the number of cases (as opposed to neural network approaches,
which achieve fast run times at the expense of lengthy training times). Clearly
with over 5 million new cases per year, the pattern-matching problem can be
immense.
The major issues to be investigated involve how AURA can be used for
case matching, how cases can be updated and how explanations can be produced
from such a system. These issues are discussed in more detail below:
(1) Case matching
The project will build on previous research in methods of using AURA
for k-NN classification. In the DWP application AURA will
be used to identify a suitably small subset k of the most similar benefits
claims, from a database of known fraudulent and non-fraudulent claims. AURA
will rapidly identify these, permitting a statistical analysis to assess similarity
and ultimately the risk that a claim is fraudulent. The key feature within AURA
is that the class-density modelling is done on-line using the subset of data
returned. This allows the case-base to be continually updated while in use.
Other methods typically perform the probabilistic modelling off-line, and as
such updates become a time-consuming process. Clearly, some inaccuracies in
the modelling may occur, but we expect the impact of these to be negligible
when compared to the gains in speed in training and improved usability.
This approach is only possible because AURA allows very fast access to
the large number of cases. The approach is highly novel and is clearly applicable
widely beyond this project.
In addition to the simple one-stage process described above, more complex relationships
may be identified through repeated searching of the case-base, thus allowing
a form of on-line reasoning to be attempted. For example, where relatives of
a client (claimant) are also claiming benefits and these need to be identified
by a further search.
To be successful, the approach taken must exploit existing knowledge of fraud
methods identified by DWP staff. The project will identify ways to incorporate
this information within the matching framework in conjunction with the similarity-based
search provided by the main AURA matching engine. The research will assess
whether this knowledge can be coded into the system as rules. Alternatively,
conventional expert systems may be considered.
To be effective, the approach must scale with the large number of cases. To
implement such a large application requires very large resources. Sun Microsystems
have joined the project to provide the computing infrastructure needed in the
work, they will be donating a 16 processor "Throughput Engine"
and necessary storage for the project, to be used in addition to the extensive
facilities at University of York. Assessment of the system on a commercial platform
will be achieved through the support provided by Sun Microsystems.
(2) Adding new cases
Case update must allow the new claims to be added to the system. Cases are encoded
and stored as individual items in the AURA system. The simplicity of
this operation is one of the main points of the proposed approach. Rule data
extracted from experts will supplement this within the inference engine. It
is not intended to look at methods for extracting rules automatically from the
data as this is would exceed the time available for the project, but such methods
may be considered if time becomes available. All new knowledge will be based
on actual claims and expert domain knowledge.
(3) Explanation
As identified in the problem requirements, the system must have an explanation
capability. The use of cases provides a possible means to achieve this. In the
case-matching system the AURA methods will provide a subset of cases
that are "close" to the new claim. These cases will be analysed to
identify if there is a high likelihood of the claim being fraudulent. Clearly
the set of cases used for this can be used to present an explanation. In its
simplest form, this could be presented just as a set of cases, to be used to
determine the claim status. However, some summarising of the data can be undertaken
to make this more reportable. In addition, information from the rule-based system
will be incorporated (for example, those rules active in identifying possible
fraud).
Evaluation of neural network and other methods
To complement the work based on AURA, the project will build on the work
undertaken at Sema into the use of MLP neural networks for the
identification of fraud. Although AURA-based methods have the advantage
of speed, on-line updating and an ability to explain how the classification
was achieved, they can suffer from reduced accuracy compared to other neural,
machine learning and statistical methods that may iteratively search for the
best model of the data. The work will allow us to answer the question:
"how much accuracy is lost by using AURA-based methods?",
i.e. what the cost is of the speed and flexibility offered by the AURA methods.
Clearly the combination of CBR and AURA methods is an exciting possibility and
may allow CBR systems to be built that scale to very large problems, such as
the one addressed in the current work. The work will also show that certain
classes of neural network based systems (such as AURA) can be used on very large,
symbolic, problems. A number of issues need to be addressed to show the benefits
of this coupling, and these form the basis of the research to be undertaken.