Masterarbeit, 2014
104 Seiten, Note: 82.00
Chapter 1 – INTRODUCTION
Chapter 2 - LITRATURE REVIEW
2.1 Advantages of using Hadoop
2.2 Big Data
2.3 Project Architecture
2.4 Goal: HDFS-HMFGR
2.5 INTERCONNECT TECHNOLOGIES (Hardware Solution)
2.5.1 Traditional Interconnect 10GigE Network
2.5.2 Infiniband Technology
2.5.2.1 IPoIB Interconnect Technology
2.5.2.2 RDMA-IB Interconnect Technology
2.6 Memory Allocation with MemCached (Software Solution)
Chapter 3 – Experimental Testbed System
3.1 An Insight into SSD and HDD
Chapter 4 – Installation, Designing and Implementation of System
4.1 Set a LAN and System service of Network
4.2 Hadoop Installation Guide
4.2.1 Steps On each Machine
4.2.1.1 Install prerequisites
4.2.1.2 Adding a dedicated Hadoop system user
4.2.1.3 Setup hostname
4.2.2 Steps On Master
4.2.2.1 Install Hadoop
4.2.2.2 Configuration ssh
4.2.2.3 Install NFS
4.2.2.4 Configuration
4.2.3. Run Hadoop
4.2.3.1 Run HDFS
4.2.3.2 Rum Map Reduce Job
4.2.3.3 Run All Daemon
4.2.3.4 Hadoop Web Interfaces
4.3 Enable Ethernet and InfiniBand (SR-IOV)
4.3.1 Enable Ethernet SR-IOV
4.3.2 Enable InfiniBand SR-IOV
4.4 Steps to setup IPoIB on Quanta machines
4.5 Install MemCached on CentOS
Chapter 5 – Benchmarking
5.1 ATTO Disk Benchmarking
5.2 HD Tune Pro:
5.3 Linux Disk Utilities:
5.4 HiBench (Hadoop Benchmarking Suit)
5.4.1 Micro-Benchmarks
5.4.1.1 Sort
5.4.1.2 Word Count
5.4.1.3 TeraSort
5.4.1.4 Enhanced DFSIO
5.4.2 Web Search:
5.4.2.1 Nutch Indexing
5.4.2.2 Page Ranking
5.4.3 Machine Learning:
5.4.3.1 Bayesian Classification
5.4.3.2 K-means Clustering
5.4.4 Analytical Query
5.4.4.1 Hive Join
5.4.4.2 Hive Aggregation
Chapter 6 – Performance Evaluation of SSD and HDD on Hadoop using 10GigE
6.1 Sort Work Load:
6.2 Word Count Work Load
6.3 Tera Sort Work Load:
Chapter 7 – Performance Evaluation of SSD and HDD on Hadoop using IPoIB
7.1 Sort Work Load:
7.2 Word Count Work Load
7.3 Tera Sort Work Load
Chapter 8 – Performance Evaluation of SSD and HDD on Hadoop by RDMA-IB
8.1 Sort Work Load
8.2 Word Count Work Load
8.3 Tera Sort Work Load
Chapter 9 – Performance Comparison between 10GigE and IPoIB
9.1 Performance Comparison of Sort Workload
9.1.1 Performance Comparison of SSD
9.1.2 Performance Comparison of HDD
9.2 Performance Comparison of WordCount Workload
9.2.1Performance Comparison of SSD
9.2.2 Performance Comparison of HDD
9.3 Performance Comparison of TeraSort Workload
9.3.1 Performance Comparison of SSD
9.3.2 Performance Comparison of HDD
Chapter 10 – Performance Comparison between IPoIB and RDMA-IB
10.1 Performance Comparison of Sort Workload
10.1.1 Performance Comparison of SSD
10.1.2 Performance Comparison of HDD
10.2 Performance Comparison of WordCount Workload
10.2.1Performance Comparison of SSD
10.2.2 Performance Comparison of HDD
10.3 Performance Comparison of TeraSort Workload
10.3.1 Performance Comparison of SSD
10.3.2 Performance Comparison of HDD
Chapter 11 – Performance Comparison between 10GigE and RDMA-IB
11.1 Performance Comparison of Sort Workload
11.1.1 Performance Comparison of SSD
11.1.2 Performance Comparison of HDD
11.2 Performance Comparison of WordCount Workload
11.2.1Performance Comparison of SSD
11.2.2 Performance Comparison of HDD
11.3 Performance Comparison of TeraSort Workload
11.3.1Performance Comparison of SSD
11.3.2 Performance Comparison of HDD
Chapter 12 – Overall comparison of 10GigE, IPoIB and RDMA-IB.
Chapter 13 – Conclusion
Chapter 14 – Future Scope
The primary research objective is to analyze and characterize the performance of external sorting algorithms within the Hadoop MapReduce framework, specifically evaluating the impact of storage devices (SSD versus HDD) when connected via different interconnect technologies like 10GigE, IPoIB, and RDMA-IB to optimize big data processing efficiency.
3.1 An Insight into SSD and HDD
A solid-state drive (SSD) (also identified as solid state disk or electronic disk) (Figure 2) is a data storage drive with integrated memory storage circuit assemblies as memory to keep information indefatigably. SSD uses electronic components that are attuned with conventional block input/output (I/O) HDDs, thus allowing easier substitute in ordinary applications. SSDs use NAND-based flash storage memory, which has the capacity to retain data without power [12][39].
A hard disk drive (HDD) (Figure 3) is a storage drive used for accumulating and retrieves digital data by means of quickly revolving disk covered with magnetic substance. HDD is non-volatile i.e. it keeps hold of its records even after power is switched off. Information stored is readable as unsystematic admission method, which means a single block of info can be stock up or recovered in any arrangement. An HDD contains single or numerous, rigidly fixed, revolving disks with magnetic tops prearranged on a stirring actuator limb to retrieve as well as store info to the surfaces [12].
Solid state drives give large no. of benefits over conventional hard drives like:
1.) SSDs are More Durable: SSD endures a non-mechanical arrangement of NAND flash elevated on circuit assemblies, along with are jolted repulsive. Hard disk complying adjunct an aberration of driving components driving them susceptible to jolt along with wreck.
2.) SSDs are Quicker: SSDs to acquire greater elaborated throughput, contemporary data entrances, faster start ups, quicker file exchanges, along with in average superfast calculating speed than hard disk. HDDs can lone enter the information preceding the nearer it exists from the R/W heads, whereas collective areas adjunct the SSD are exposed at the equivalent speeds.
3.) SSDs Consume less Power: SSDs use considerably a smaller amount of power at the highest point of load than hard drives. Their energy efficiency can make the systems cost effective and deliver long battery life, low power tension on system, and a cooler work out atmosphere. (Figure 3.4)
Chapter 1 – INTRODUCTION: Outlines the challenges of big data processing in the digital age and introduces Hadoop and MapReduce as essential architectural solutions.
Chapter 2 - LITRATURE REVIEW: Reviews the advantages of Hadoop, characteristics of big data, project architecture, and various interconnect technologies including Ethernet and InfiniBand.
Chapter 3 – Experimental Testbed System: Details the hardware and software specifications of the 4-node Quanta server stack used for experimental evaluation.
Chapter 4 – Installation, Designing and Implementation of System: Provides a comprehensive guide for setting up LAN, NFS, NIS, and Hadoop on the experimental cluster.
Chapter 5 – Benchmarking: Describes the methodology for disk benchmarking using tools like ATTO, HD Tune, and HiBench to evaluate storage device performance.
Chapter 6 – Performance Evaluation of SSD and HDD on Hadoop using 10GigE: Analyzes the execution performance of Sort, Word Count, and TeraSort workloads on SSD and HDD using 10GigE.
Chapter 7 – Performance Evaluation of SSD and HDD on Hadoop using IPoIB: Presents the performance results of various Hadoop workloads using the IPoIB interconnect technology.
Chapter 8 – Performance Evaluation of SSD and HDD on Hadoop by RDMA-IB: Investigates the performance improvements of Hadoop workloads when utilizing the RDMA-IB interconnect.
Chapter 9 – Performance Comparison between 10GigE and IPoIB: Compares the results obtained from 10GigE and IPoIB to demonstrate performance gains in map and reduce phases.
Chapter 10 – Performance Comparison between IPoIB and RDMA-IB: Evaluates the performance differences between IPoIB and RDMA-IB across different workloads and storage types.
Chapter 11 – Performance Comparison between 10GigE and RDMA-IB: Conducts a comparative performance analysis of the traditional 10GigE against the high-performance RDMA-IB interconnect.
Chapter 12 – Overall comparison of 10GigE, IPoIB and RDMA-IB.: Synthesizes all performance data into a comprehensive comparison to identify the most effective storage and interconnect combinations.
Chapter 13 – Conclusion: Summarizes research findings, stating that modern interconnects significantly outperform traditional ones for big data tasks.
Chapter 14 – Future Scope: Suggests future research directions, including the implementation of dynamic shared memory models using InfiniBand.
Big Data, Hadoop, MapReduce, SSD, HDD, HiBench, 10GigE, InfiniBand, IPoIB, RDMA-IB, Performance Evaluation, Benchmarking, Storage Systems, Cluster Computing, Latency
The paper focuses on evaluating and characterizing the performance of Hadoop storage systems by testing Solid State Drives (SSD) and Hard Disk Drives (HDD) across various network interconnect technologies such as 10GigE, IPoIB, and RDMA-IB.
Key topics include big data storage, Hadoop performance optimization, benchmark suites like HiBench, hardware vs. software interconnect solutions, and cluster administration for research environments.
The primary goal is to study how modern high-performance storage and network interconnects can overcome I/O bottlenecks in Hadoop clusters to improve throughput and reduce latency for data-intensive applications.
The author utilizes a practical experimental approach, constructing a 4-node cluster using Quanta servers and applying standardized workload benchmarks (Sort, Word Count, TeraSort) provided by the HiBench suite to collect empirical performance data.
The main sections cover the technical design of the Hadoop cluster, the setup of network configurations, a detailed benchmarking phase, and a systematic comparative evaluation of different interconnect technologies on workload execution times.
Keywords include Big Data, Hadoop, MapReduce, SSD, HDD, HiBench, 10GigE, InfiniBand, IPoIB, RDMA-IB, Performance Evaluation, and Cluster Computing.
The experimental results demonstrate that SSDs consistently provide higher throughput, lower latency, and faster completion times for MapReduce workloads compared to HDDs due to their non-mechanical nature.
The research concludes that RDMA-IB significantly outperforms 10GigE, showing drastic improvements in both map and reduce phase times, making it a highly effective solution for real-time big data processing.
The author recommends that cloud systems with high-importance, real-time requirements should upgrade to RDMA-IB, while those without strict real-time requirements can achieve significant performance boosts by upgrading to IPoIB.
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!

