Skip to the content

Semantic based indexing technique for optimisation and intelligent document representation: Application to structured and unstructured document clustering

Barresi, S 2010, Semantic based indexing technique for optimisation and intelligent document representation: Application to structured and unstructured document clustering , PhD thesis, Salford : University of Salford.

[img] PDF
Restricted to Repository staff only until 03 October 2014.

Download (15MB) | Request a copy

    Abstract

    The advances in data collection and the increasing amount of unstructured and unlabeled text documents have led to the need for better disambiguation and indexing techniques, which allow for the effective and intelligent organisation of large amounts of documents into a small number of significant clusters; facilitating the analysis, browsing, and searching of document collections. Traditionally, document clustering systems have relied on bag-of-words and term frequency approaches to represent and subsequently classify documents, by only taking into account document syntax and with no consideration for semantic aspects. To address this issue, more complex indexing and clustering techniques, which consider the semantic associations between the words contained in a document and differentiate the degree of semantic importance of terms during the classification process, need to be further investigated in order to enable appropriate and automatic contextualisation of text documents and information. This research proposes a new indexing technique, which can be used to effectively represent, and subsequently cluster, collections of unstructured or structured documents. The presented technique aims at overcoming some of the major problems related to the bag-of-words approach; such as its lack of consideration for synonyms as well as its usual failure in differentiating the degree of semantic importance of terms. The main idea behind the proposed technique is to map each document into a lower dimensional space; by considering the semantic associations between the words contained in the document. To address the semantic problems posed by traditional indexing, the investigated method focuses on word sense disambiguation and document concepts. The proposed technique extracts concepts from documents and uses a set of these concepts as indexing units, achieving vector dimensionality reduction as well as more cohesive and separated clusters. Good results are also achieved in terms of purity, entropy, and when compared with similar studies in the field of semantic-based concept indexing.

    Item Type: Thesis (PhD)
    Contributors: Nefti-Meziani, S(Supervisor)
    Additional Information:
    Schools: Colleges and Schools > College of Science & Technology > School of Computing, Science and Engineering
    Depositing User: Institutional Repository
    Date Deposited: 03 Oct 2012 14:34
    Last Modified: 17 Feb 2014 13:14
    URI: http://usir.salford.ac.uk/id/eprint/26567

    Actions (login required)

    Edit record (repository staff only)

    Downloads per month over past year

    View more statistics