Date Friday 23rd January, 2004, 12:15, CS202J

Title: The many dimensions of information mining - Searching textual information sources

Jayasooriya Thimal, MSc Student, Dept of Computer Science, University of York .


Abstract

From the pioneering SMART Project of the 1960s to web search favourite Google, the core problem of sifting through and returning precise, relevant documents from a corpora of text has been the focus of many research studies over the years.

This literature survey attempts to summarize the major concepts in building a search tool for textual information sources. The author further introduces /document dimensions/ as a means of adding relevance and value to existing search techniques, and outlines potential applications for this relatively recent approach.

Also discusses and compares some of the methods used by current search engines to index and accurately retrieve relevant content.

Details


Date Friday 6th February, 2004, 12:15, CS202J

Student presentations.

Evolving Interesting Cellular Automata Rule-sets using Genetic Algorithms. Matthew Sweet.

Who Does She Look Like? Anthony Sayce.

Wordnet-based Optimal Lexical Semantic Tagging. Julian Sedding.

Supervisor: Dimitar Kazakov

Abstracts.
Matthew Sweet: Evolving Interesting Cellular Automata Rule-sets using Genetic Algorithms
Cellular Automata are used in a number of areas: fluid dynamics, ecosystem modelling, etc. In this paper we attempt to provide a method of evolving rule sets that support ``interesting'' life. Rather than identifying individual creatures, the entropy of the cellular automaton is used to calculate the fitness (or interesting-ness) of a rule set. Genetic algorithms are employed for the search of good solutions, including a novel technique for fitness scaling, the advantages of which are experimentally demonstrated.

Julian Sedding: WordNet-Based Text Clustering
Text document clustering can greatly simplify browsing large collections of documents by reorganizing them into a smaller number of manageable clusters. Algorithms to solve this task exist; however, the algorithms are only as good as the data they work on. Problems include ambiguity and synonymy, the former allowing for erroneous groupings and the latter causing similarities between documents to go unnoticed. In my research, naive disambiguation is attempted on a syntactic level by assigning POS-tags to the words, and the 'bag of words' representation is enriched with synonyms and hypernyms provided by WordNet.


Date Friday 20th February, 2004, 12:15, CS202J
This is a PhD thesis seminar.

Title: Relevance for question answering systems.

Marco De Boni, Phd Student, Dept of Computer Science, University of York .


Abstract:
While there is a very large amount of written information available in electronic format, there is no easy way to automatically find a reliable answer to simple questions such as "Who is the president of the US?". Research in Question Answering (QA) systems address these issues by trying to find a method for answering a question by searching for a precise response in a collection of documents. Current QA systems, however, are no more than prototypes, and, while there is agreement amongst researchers on the generic aim of QA systems, little work has been done on clarifying the problem beyond the establishment of a standard evaluation framework. There is consequently a significant lack of theoretical understanding of QA systems and considerable amount of confusion about their aims and evaluation.

I will be addressing the need for a theoretical investigation into QA systems by employing the notion of relevance in order to clarify the purpose of QA systems and elucidate their constituent structure. I will then show how the theory developed can be applied in practice to improve a "standard" QA system.


Date Friday 5th March, 2004, 12:15, CS202J

Title: Applying Inductive Logic Programming to the learning of intrusion strategies.

Steve Moyle, Computing Laboratory, Oxford University.

Abstract:
Inductive Logic Programming (ILP) is a form of machine learning that can utilize background knowledge to propose first order rules to explain examples of phenomena (e.g. hacking attempts).

Intrusion detection is the identification of potential breaches in computer security policy. The objective of an attacker is often to gain access to a system that they are not authorised to use. The attacker achieves this by exploiting a (known) software vulnerability by sending the system a particular input.

In this talk, a gentle introduction to ILP will be given, along with an overview of a common intrusion exploit -- the buffer overflow. It will be shown how ILP can learn rules to detect intrusion strategies that exploit buffer overflows.

This is joint work with John Heasman.


Date Wednesday 6th October

Learning with Alkemy.

Kee Siong Ng, The Australian National University.

Abstract:
In this talk, we consider the problem of learning comprehensible rules from structured data in the supervised learning setting and propose a tool in the form of Alkemy for tackling such tasks. Elements of Alkemy, including aspects of its knowledge representation language, which is based on higher-order logic, and some of its learning algorithms will be discussed.

Understanding the nature of learning with such an expressive language is clearly an important issue. We'll outline an approach to characterise the learning-theoretic complexity of the rich function classes we use, and from that give generalisation bounds for Alkemy. We will also try to quantify the price we pay for insisting on comprehensibility.


Date Monday 11th October, 12:15 in cs202j

Inductive Programming.

Lloyd Allison,
School of Computer Science & Software Engineering, Monash University, Clayton, Victoria, Australia 3800. lloyd@bruce.cs.monash.edu.au or Oct-Dec 2004 lloyd@cs.york.ac.uk

Abstract:
The seminar's area is Inductive inference, i.e. inferring general models from given data (artificial intelligence, data mining, machine learning). The research problem is: What are statistical models? I.e. What do they do? What can be done to them? How can you combine two or more of them and what do you get? How can you program with them? What are the operations, types, classes, semantics? This talk http://www.csse.monash.edu.au/~lloyd/Seminars/200410-II/ describes an approach, using Haskell, and gives practical examples.


Date Friday 22th October


Knowledge Oriented Clustering.

Charlotte Bean, Neural, Emergent and Agent Technologies Group. Department of Computer Science University of Hull.

Abstract:
This talk presents the design of a knowledge-oriented clustering algorithm that can be applied to data of both single and mixed-attribute type. The algorithm has a simple framework, based on that of hierarchical clustering, and the main clustering tool is a form of indiscernibility relation modified from the field of rough set theory. The research focuses on extracting maximal knowledge from data, both local and global, with minimal human intervention in order to obtain clusters that are meaningful and free from user-bias. This is achieved by employing well-defined numerical procedures to set key threshold parameters and by making use of a cluster accuracy measure to yield representative clusters within the boundaries of the given application. The algorithm is unified in its approach to clustering, which ensures consistency in the results when used to cluster the same data by different users. The talk will conclude with a small worked example to illustrate the use of the algorithm in practice.


Date Thursday, 16 December 2004

Title: Allocating a Divisable Resource: An Application to Grid Computing

Alan M. Frisch, Reader, Dept of Computer Science, University of York .


Abstract:

Effective use of the grid will require efficient methods to solve a variety of combinatorial problems such as resource allocation, configuration and scheduling. This talk considers the allocation of a divisable resource such as bandwidth. Given requests for quantities of the resource, each with a price, the task is to accept a subset of the requests that can be fulfilled and maximises revenue. This talk presents methods for solving realistic instances of this problem.

Last updated on 10 March 2011