Semantic based indexing technique for optimisation and intelligent document representation: Application to structured and unstructured document clustering
Barresi, S 2010, Semantic based indexing technique for optimisation and intelligent document representation: Application to structured and unstructured document clustering , PhD thesis, Salford : University of Salford.
Restricted to Repository staff only until 03 October 2014.
Download (15MB) | Request a copy
The advances in data collection and the increasing amount of unstructured and unlabeled text documents have led to the need for better disambiguation and indexing techniques, which allow for the effective and intelligent organisation of large amounts of documents into a small number of significant clusters; facilitating the analysis, browsing, and searching of document collections. Traditionally, document clustering systems have relied on bag-of-words and term frequency approaches to represent and subsequently classify documents, by only taking into account document syntax and with no consideration for semantic aspects. To address this issue, more complex indexing and clustering techniques, which consider the semantic associations between the words contained in a document and differentiate the degree of semantic importance of terms during the classification process, need to be further investigated in order to enable appropriate and automatic contextualisation of text documents and information. This research proposes a new indexing technique, which can be used to effectively represent, and subsequently cluster, collections of unstructured or structured documents. The presented technique aims at overcoming some of the major problems related to the bag-of-words approach; such as its lack of consideration for synonyms as well as its usual failure in differentiating the degree of semantic importance of terms. The main idea behind the proposed technique is to map each document into a lower dimensional space; by considering the semantic associations between the words contained in the document. To address the semantic problems posed by traditional indexing, the investigated method focuses on word sense disambiguation and document concepts. The proposed technique extracts concepts from documents and uses a set of these concepts as indexing units, achieving vector dimensionality reduction as well as more cohesive and separated clusters. Good results are also achieved in terms of purity, entropy, and when compared with similar studies in the field of semantic-based concept indexing.
|Item Type:||Thesis (PhD)|
|Schools:||Colleges and Schools > College of Science & Technology > School of Computing, Science and Engineering|
|Depositing User:||Institutional Repository|
|Date Deposited:||03 Oct 2012 14:34|
|Last Modified:||17 Feb 2014 13:14|
Actions (login required)
|Edit record (repository staff only)|