Optimisation techniques for finding connected components in large graphs using GraphX

Turifi, M 2018, Optimisation techniques for finding connected components in large graphs using GraphX , PhD thesis, The University of Salford.

[img]
Preview
PDF - Accepted Version
Download (3MB) | Preview

Abstract

The problem of finding connected components in undirected graphs has been well studied. It is an essential pre-processing step to many graph computations, and a fundamental task in graph analytics applications, such as social network analysis, web graph mining and image processing. Recently, it has been a major area of interest within the field of large graph processing. However, much of the research has focused on solving the problem using High Performance Computers (HPC). In large distributed systems, the MapReduce framework dominates the processing of big data, and has been used for finding connected components in big graphs although iterative processing is not directly supported in MapReduce. Current big data processing systems have developed into supporting iterative processing and providing additional features other than MapReduce. Moreover, current connected component algorithms in large distributed processing system only use the traditional approach to choosing the component identifier based on the lexical ordering of the node ID value. This study investigates how to enhance the performance of finding connected components algorithm for large graph in distributed processing system. It uses the approach to considering the graph degree property in choosing the component identifier, reviewing how this can affect the efficiency of the algorithm. In the design of our proposed algorithm features provided by current new processing systems such as moving the computation more toward the data partition in Spark are integrated. This study thus review how this has affected the performance. The degree approach to choosing the component identifier is experimentally tested using different algorithms. The study then applies the proposed approach on the fastest existing algorithm, and experimentally compare the performance of the connected component algorithm using both the original and our modified algorithm. The results show that using the degree approach has played a vital role in the evolution of the graph size during the process, leading to a faster convergence and significant performance improvement when case vertex pruning is applied in the algorithms. Furthermore, they demonstrate that in many cases optimising the design of the algorithm with local pre-processing of the data has resulted in performance enhancement.

Item Type: Thesis (PhD)
Contributors: Saraee, MH (Supervisor)
Schools: Schools > School of Computing, Science and Engineering > Salford Innovation Research Centre
Depositing User: Maher Turifi
Date Deposited: 19 Sep 2018 14:20
Last Modified: 19 Oct 2018 01:38
URI: http://usir.salford.ac.uk/id/eprint/44945

Actions (login required)

Edit record (repository staff only) Edit record (repository staff only)

Downloads

Downloads per month over past year