Overlapped speech and music segmentation using singular spectrum analysis and random forests

Mohammed, DY 2017, Overlapped speech and music segmentation using singular spectrum analysis and random forests , PhD thesis, Salford University.

[img] PDF - Submitted Version
Restricted to Repository staff only until March 2018.

Download (5MB) | Request a copy


Recent years have seen ever-increasing volumes of digital media archives and an enormous amount of user-contributed content. As demand for indexing and searching these resources has increased, and new technologies such as multimedia content management systems, en-hanced digital broadcasting, and semantic web have emerged, audio information mining and automated metadata generation have received much attention. Manual indexing and metadata tagging are time-consuming and subject to the biases of individual workers. An automated architecture able to extract information from audio signals, generate content-related text descriptors or metadata, and enable further information mining and searching would be a tangible and valuable solution. In the field of audio classification, audio signals may be broadly divided into speech or music. Most studies, however, neglect the fact that real audio soundtracks may have either speech or music, or a combination of the two, and this is considered the major hurdle to achieving high performance in automatic audio classification, since overlapping can contaminate relevant characteristics and features, causing incorrect classification or information loss.

This research undertakes an extensive review of the state of the art by outlining the well-established audio features and machine learning techniques that have been applied in a broad range of audio segmentation and recognition areas. Audio classification systems and the suggested solutions for the mixed soundtracks problem are presented. The suggested solutions can be listed as follows: developing augmented and modified features for recognising audio classes even in the presence of overlaps between them; robust segmentation of a given overlapped soundtrack stream depends on an innovative method of audio decomposition using Singular Spectrum Analysis (SSA) that has been studied extensively and has received increasing attention in the past two decades as a time series decomposition method with many applications; adoption and development of driven classification methods; and finally a technique for continuous time series tasks.

In this study, SSA has been investigated and found to be an efficient way to discriminate speech/music in mixed soundtracks by two different methods, each of which has been developed and validated in this research. The first method serves to mitigate the overlapping ratio between speech and music in the mixed soundtracks by generating two new soundtracks with a lower level of overlapping. Next, feature space is calculated for the output audio streams, and these are classified using random forests into either speech or music. One of the distinct characteristics of this method is the separation of the speech/music key features that lead to improve the classification performance.

Nevertheless, that did encounter a few obstructions, including excessively long processing time, increased storage requirements (each frame symbolised by two outputs), and this all leads to greater computational load than previously. Meanwhile, the second method em-ploys the SSA technique to decompose a given audio signal into a series of Principal Components (PCs), where each PC corresponds to a particular pattern of oscillation. Then, the transformed well-established feature is measured for each PC in order to classify it into either speech or music based on the baseline classification system using a RF machine learning technique. The classification performance of real-world soundtracks is effectively improved, which is demonstrated by comparing speech/music recognition using conventional classification methods and the proposed SSA method. The second proposed and de-veloped method can detect pure speech, pure music, and mix with a much lower complexity level.

Item Type: Thesis (PhD)
Schools: Schools > School of Computing, Science and Engineering
Depositing User: Duraid Yehya Mohammed
Date Deposited: 16 Feb 2018 15:52
Last Modified: 16 Feb 2018 15:57
URI: http://usir.salford.ac.uk/id/eprint/43773

Actions (login required)

Edit record (repository staff only) Edit record (repository staff only)


Downloads per month over past year