Text Matching Technologies

MinerTaur

Due to the proliferation of information in databases and on the Internet, users are overwhelmed leading to information overload. It is impossible for humans to index and search such a vast amount of information by hand so automated indexing and searching techniques are required. A method is needed to: process documents unsupervised and generate a multi-level and compact index; overcome spelling mistakes in the user's query, suggesting alternative spellings for their query terms; and finally, calculate query-to-document similarities from statistics available in the text corpus.

Specifically, we are incorporating the Information Retrieval process into a modular Neural Network architecture [Hodge_THESIS, Hodge_ESANN01, Weeks+Hodge_PDP02, Weeks+Hodge_HPDC02]. Integration aims to exploit the benefits of the incorporated techniques whilst overcoming their respective limitations. Our system comprises three modules:

The system autonomously generates the modules from unstructured textual information. We use the AURA modular neural system for the spell checker and index, and we employ our hierarchical, neural clustering algorithm to autonomously induce a hierarchical thesaurus of synonym clusters from corpus statistics. Each query word input by the user passes through each module in turn.

Publications

Thesis

  • Victoria J. Hodge, [Hodge_THESIS]. Integrating Information Retrieval & Neural Networks, PhD Thesis, Department of Computer Science, The University of York, Heslington, YORK, YO10 5DD United Kingdom, 2001.download pdf (.pdf)

Journals, Proceedings, Reports

Unfortunately copyright restrictions prevent making some of my publications available on-line. However, reprints are available on request - request a copy.

  • Victoria J. Hodge & Jim Austin [Hodge_TKDE03]. An Evaluation of Standard Spell Checking Algorithms and a Binary Neural Approach. IEEE Transactions on Knowledge and Data Engineering 15(5): pp. 1073–1081, IEEE Computer Society, Sept/Oct 2003. Full Text Article from White Rose Research Online
  • Victoria J. Hodge & Jim Austin [Hodge_PR02]. A Comparison of a Novel Spell Checker and Standard Spell Checking AlgorithmsPattern Recognition 35(11): pp. 2571–2580, Elsevier Science, 2002. Full Text Article from Elsevier Science Journals - Pattern Recognition (pdf)
  • Victoria J. Hodge & Jim Austin [Hodge_NC02]. Hierarchical Word Clustering – automatic thesaurus generation, NeuroComputing 48(1–4): pp. 819–846, Elsevier Science, 2002. Full Text Article from Elsevier Science Journals - NeuroComputing (pdf)
  • M. Weeks, Victoria J. Hodge & Jim Austin [Weeks+Hodge_HPDC02]. Scalability of a Distributed Neural Information Retrieval System, Accepted for presentation at HPDC–2002, 11th IEEE International Symposium on High Performance Distributed Computing. Edinburgh, Scotland, July 24–26, 2002 download zipped abstract (.zip)
  • M. Weeks, Victoria J. Hodge & Jim Austin [Weeks+Hodge_PDP02]. A Hardware Accelerated Novel IR System,  In, Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP-2002), Las Palmas de Gran Canaria, Canary Islands, January 9th–11th, 2002. IEEE Computer Society, Los Alamitos, CA.  download zipped postscript (.zip)
  • Victoria J. Hodge & Jim Austin [Hodge_TR01]. An Evaluation of Phonetic Spell CheckersTechnical Report YCS 338(2001), Department of Computer Science, University of York.  download postscript (.ps)
  • Victoria J. Hodge & Jim Austin [Hodge_ICANN01]. A Novel Binary Spell Checker. In, Proceedings of the International Conference on Artificial Neural Networks (ICANN'2001), Vienna, Austria, 25–29 August, 2001. Dorffner, Bischof & Hornik (Eds), Lecture Notes in Computer Science (LNCS) 2130, Springer Verlag, Berlin.  download zipped postscript (.zip)
  • Victoria J. Hodge & Jim Austin [Hodge_ESANN01]. An Integrated Neural IR System. In, M.Verleysen (ed.) Proceedings of the 9th European Symposium on Artificial Neural Networks (ESANN'2001), Bruges (Belgium), 25–27 April 2001, D-Facto public., ISBN 2–930307–01–3, pp. 265–270. download zipped pdf (.zip)
  • Victoria J. Hodge & Jim Austin [Hodge_NN01]. An Evaluation of Standard Retrieval Algorithms and a Binary Neural Approach. Neural Networks, 14(3): pp. 287–303, Elsevier Science, 2001. Full Text Article from Elsevier Science – Neural Networks (pdf)
  • Victoria J. Hodge & Jim Austin [Hodge_TKDE01]. Hierarchical Growing Cell Structures: TreeGCS. IEEE Transactions on Knowledge and Data Engineering, Special Issue on Connectionist Models for Learning in Structured Domains, 13(2): pp. 207–218, 2001. Full Text Article from White Rose Research Online
  • Victoria J. Hodge & Jim Austin [Hodge_KES00]. Hierarchical Growing Cell Structures: TreeGCS. In, Proceedings of the Fourth International Conference on Knowledge–Based Intelligent Engineering Systems (KES'2000), Brighton, UK, August 30th to September 1st, 2000. download zipped PDF (.zip)
  • Victoria J. Hodge & Jim Austin [Hodge_IJCNN00]. An Evaluation of Standard Retrieval Algorithms and a Weightless Neural Approach. In, Proceedings of the IEEE–INNS–ENNS International Joint Conference on Neural Networks (IJCNN'2000), Italy, 24–27 July, 2000. download zipped postscript (.zip)

 

MinerTaur: AURA Implementation

The system comprises three modules: a spell checking pre-processor, a thesaurus, and an indexing data structure to provide the key for the Information Retrieval System in a word-document index. The system autonomously generates the modules from unstructured textual information. We use the AURA modular neural system for the spell checker and index, and we autonomously induce a hierarchical thesaurus of synonym clusters from corpus statistics. Each query word input by the user passes through each module in turn. The word-document index identifies and ranks the matching documents using rapid and efficient searching.

Screenshot

 Minertaur

Document Actions
Latest News

THE award

ACAG Wins Top Award

find out more

OPPORTUNITIES

Globe

PHD 

STUDENTSHIPS

Log in


Forgot your password?
INTRANET

Group Pages

 

Please refer to the legal disclaimer covering content on this site.