DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems - PDF

Please download to get full document.

View again

of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Calendars

Published:

Views: 6 | Pages: 14

Extension: PDF | Download: 0

Share
Related documents
Description
DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems Yin Huai 1 Rubao Lee 1 Simon Zhang 2 Cathy H Xia 3 Xiaodong Zhang 1 1,3Department of Computer
Transcript
DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems Yin Huai 1 Rubao Lee 1 Simon Zhang 2 Cathy H Xia 3 Xiaodong Zhang 1 1,3Department of Computer Science and Engineering, The Ohio State University 2Department of Computer Science, Cornell University 2 3 ABSTRACT Traditional parallel processing models, such as BSP, are scale up based, aiming to achieve high performance by increasing computing power, interconnection network bandwidth, and memory/storage capacity within dedicated systems, while big data analytics tasks aiming for high throughput demand that large distributed systems scale out by continuously adding computing and storage resources through networks Each one of the scale up model and scale out model has a different set of performance requirements and system bottlenecks In this paper, we develop a general model that abstracts critical computation and communication behavior and computation-communication interactions for big data analytics in a scalable and fault-tolerant manner Our model is called DOT, represented by three matrices for data sets (D), concurrent data processing operations (O), and data transformations (T), respectively With the DOT model, any big data analytics job execution in various software frameworks can be represented by a specific or non-specific number of elementary/composite DOT blocks, each of which performs operations on the data sets, stores intermediate results, makes necessary data transfers, and performs data transformations in the end The DOT model achieves the goals of scalability and fault-tolerance by enforcing a data-dependency-free relationship among concurrent tasks Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and applicable to a wide variety of big data analytics jobs Finally, we demonstrate the effectiveness of the DOT model through several case studies Categories and Subject Descriptors H1[MODELS AND PRINCIPLES: Miscellaneous; H34 [INFORMATION STORAGE AND RETRIEVAL: Systems and Software Distributed systems General Terms Design, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee SOCC 11, October 27 28, 2011, Cascais, Portugal Copyright 2011 ACM /11/10 $1000 Keywords Big Data Analytics, Distributed Systems, System Modeling, System Scalability 1 INTRODUCTION The data explosion has been accelerated by the prevalence of Internet, e-commerce and digital communication With the rapid growth of big data, the need for quickly and efficiently manipulating the datasets in a scalable and reliable way is unprecedentedly high Big data analytics has become critical for industries and organizations to extract useful information from huge and chaotic data sets to support their core operations in many business and scientific applications Meanwhile, the computing speed of commodity computers and the capacity of storage systems continue to improve while their unit prices continue to decrease Nowadays, it is a common practice to deploy a large scale cluster with commodity computers as nodes for big data analytics In response to the high demand of big data analytics, several software frameworks on large and distributed cluster systems have been proposed and implemented Representative systems include Google MapReduce [11, Hadoop [1, Dryad [17 and Pregel [22 These system frameworks and implementations share two common goals: (1) for distributed applications, to provide a scalable and fault-tolerant system infrastructure and supporting environment; and (2) for software developers and application practitioners, to provide an easy-to-use programming model that hides the technical details of parallelization and fault-tolerance Although the above mentioned systems have been operational to provide several major Internet services and prior studies have been conducted to improve the performance of software frameworks of big data analytics, eg [15 and [20, the following three issues to be addressed demand more basic and fundamental research efforts Behavior Abstraction: The scale out model of big data analytics mainly concerns two issues: 1 how to maintain the scalability, namely to ensure a proportional increase of data processing throughput as the size of the data and the number of computing nodes increase; and 2 how to provide a strong fault-tolerance mechanism in underlying distributed systems, namely to be able to quickly recover processing activities as some service nodes crash Currently, several software frameworks are either claimed or experimentally demonstrated that they are scalable and fault-tolerant by case studies However, the basis and principles that jobs can be executed with scalability and faulttolerance is not well studied To address this issue, it is desirable to use a general model to accurately abstract the job execution behavior, because it is the most critical factor for scalability and fault-tolerance The job execution behavior is reflected by the computation and communication behavior and computation-communication interactions (called processing paradigm in the rest of the paper) when the job is running on a large scale cluster Application Optimization: Current practice on application optimization for big data analytics jobs is underlying software framework dependent, so that optimization opportunities are only applicable to a specific software framework or a specific system implementation Several projects have focused on this type of optimizations, eg [12 A bridging model between applications and underlying software frameworks would enable us to gain opportunities of software framework and implementation independent optimization, which can enhance performance and productivity without impairing scalability and fault tolerance With this bridging model, system designers and application practitioners can focus on a set of general optimization rules regardless of the structures of software frameworks and underlying infrastructures System Comparison, Simulation and Migration: The diverse requirements of various big data analytics applications cause the needs of system comparison and application migration among existing and/or new designed software frameworks However, without a general abstract model for the processing paradigm of various software frameworks for big data analytics, it is hard to fairly compare different frameworks in several critical aspects, including scalability, fault-tolerance and framework functionality Additionally, a general model can provide guide to building software framework simulators that are greatly desirable when designing new frameworks or customizing existing frameworks for certain big data analytics applications Moreover, since a bridging model between applications and various underlying software frameworks is not available, application migration from one software framework to another depends strongly on programmers special knowledge of both frameworks and is hard to do in an efficient way Thus, it is desirable to have guidance for designing automatic tools used for application migration from one software framework to another All of above three issues demand a general model that bridges applications and various underlying software frameworks for big data analytics In this paper, we propose a candidate for the general model, called DOT, which characterizes the basic behavior of big data analytics and identifies its critical issuesthe DOT model also serves as a powerful tool for analyzing, optimizing and deploying software for big data analytics Three symbols D, O, and T are three matrix representations for distributed data sets, concurrent data processing operations, and data transformations, respectively Specifically, in the DOT model, the dataflow of a big data analytics job is represented by a DOT expression containing multiple root building blocks, called elementary DOT blocks, or their extensions, called composite DOT blocks For every elementary DOT block, a matrix representation is used to abstract basic behavior of computing and communications for a big data analytics job The DOT model eliminates the data dependency among concurrent tasks executed by concurrent data processing units (called workers in the rest of the paper), which is a critical requirement for the purpose of achieving scalability and faulttolerance of a large distributed system We highlight our contributions in this paper as follows We develop a general purpose model for analyzing, optimizing and deploying software for big data analytics in distributed systems in a scalable and fault-tolerant manner In a concise and organized way, the model is represented by matrices that characterize basic operations and communication patterns along with interactions between computing and data transmissions during job execution We show that the processing paradigm abstracted by the DOT model is scalable and fault-tolerant for big data analytics applications Using MapReduce and Dryad as two representative software frameworks, we analyze their scalability and fault-tolerance by the DOT model The DOT model also provides basic principles for designing scalable and fault-tolerant software frameworks for big data analytics Under the DOT model, we provide a set of optimization guidelines, which are framework and implementation independent, and effective for a large scope of data processing applications Also, we show the effectiveness of these optimization rules for complex analytical queries The rest part of this paper is organized as follows Our model and its properties are introduced in Section 2 Section 3 shows that the processing paradigm of the DOT model is scalable and fault-tolerant In Section 4, we identify optimization opportunities provided by the DOT model Section 5 demonstrates the effectiveness of the DOT model by several case studies Section 6 introduces related work, and Section 7 concludes the paper 2 THE DOT MODEL The DOT model consists of three major components to describe a big data analytics job: (1) a root building block, called an elementary DOT block, (2) an extended building block, called a composite DOT block, that is organized by a group of independent elementary DOT blocks and (3) a method that is used for building the dataflow of a big data analytics job with elementary/composite DOT blocks 21 An Elementary DOT Block An elementary DOT block is the root building block in the DOT model It is defined as interactions of the following three entities that are supported by both hardware and software 1 A big data (multi-)set that is distributed among storage nodes in a distributed system; 2 A set of workers, ie concurrent data processing units, each of which can be a computing node to process and store data; and 3 Mechanisms that regulate the processing paradigm of workers to interact the big data (multi-)set in two steps First, the big data (multi-)set is processed by a number of workers concurrently Each worker processes a part of the data and stores the output as the intermediate result Moreover, there is no dependency among workers involved in this step Second, all intermediate results are collected by a worker After that, this single worker performs the last-stage data transformations based on intermediate results and stores the output as the final result T O D t o 1 o 2 o n 1 worker collects and transforms results n workers perform concurrent operations Data is divided into n parts D 1 D 2 D n Figure 1: The illustration of the elementary DOT block AnelementaryDOTblockis illustrated byfigure 1with a three-layer structure The bottom layer (D-layer) represents the big data (multi-)set A big data (multi-)set is divided into n parts (from D 1 to D n) in a distributed system, where each part is a sub-dataset (called a chunk in the rest of the paper) In the middle layer (O-layer), n workers directly process the data (multi-)set and o i is the data-processing operator associated with the ith worker Each worker only processes a chunk (as shown by the arrow from D i to o i ) and stores intermediate results At the top layer (T-layer), a single worker with operator t collects all intermediate results (as shown by the arrows from o i to t, i = 1,,n), then performs the last-stage data transformations based on intermediate results, and finally outputs the ending result What must be noticed is that as shown in Figure 1, the basic rule of an elementary DOT block is that n workers in the first step are prohibited from communicating with each other and the only communication in the block is intermediate results collection shown by the arrows from o i to t, i = 1,,n A simple example using one elementary DOT block is to calculate the sum of a large collection of integers In this example, this large collection of integers is partitioned to n chunks These n partitions are stored on n workers Firstly, each worker of the n workers storing integers calculates the local sum of integers it has and stores the local sum as an intermediate result Thus, all of operators o 1 to o n are summationthen, a single worker will collect all of intermediate results from n workers Finally, this single worker will calculate the sum of all intermediate results and generate the final result Thus, operator t is summation 22 A Composite DOT Block In an elementary DOT block, there is only one worker performing last-stage transformations to update results on intermediate results It is natural to use multiple workers to collect intermediate results and perform last-stage data transformation, because either intermediate results tend to be huge, or the cardinality of the set of categories of intermediate results is greater than onethus, a group of independent elementary DOT blocks is needed, which we define as a composite DOT block, the extension of the elementary DOT block A composite DOT block is organized by a group of independent elementary DOT blocks, which have the identical worker set of the O-layer and share the same big data (multi-)set as input divided in an identical way Suppose that a composite DOT block is organized by m elementary DOT blocks, each of which has n workers in the O-layer This composite DOT block will combine these elementary DOT blocks(trees) and then form a forest structure shown in Figure 2 Each worker in the O-layer will have m operators, and operator o i,j means that this operator originally belongs to the worker i of the jth elementary DOT block The T-layer of this composite DOT block will have m workers and operator t j means that it is the operator for last-stage transformations of the jth elementary DOT block T O D t 1 t j t m m independent elementary DOT blocks o 1,1 o 2,1 o n,1 o 1,j o 2,j o n,j o 1,m o 2,m o n,m D 1 D 2 D n D 1 D 2 D n D 1 D 2 D n Figure 2: The illustration of the composite DOT block Based on the definitions of the composite DOT block, there are three restrictions on communications among workers: 1 workers in the O-layer cannot communicate with each other; 2 workers in the T-layer cannot communicate with each other; and 3 intermediate data transfers from workers in the O-layer to their corresponding workers in the T-layer are the only communications occurring in a composite DOT block An example using one composite DOT block is a job used to calculate the sum of even numbers and odd numbers from a large collection of integers Similar to the example shown in Section 21, this large collection of integers is partitioned to n chunks Two elementary DOT blocks can be used to finish this job, one elementary DOT block for calculating the sum of even numbers and another for calculating the sum of odd numbers In the elementary DOT block for calculating the sum of even/odd numbers, each worker in the O-layer will first filter out odd/even numbers and calculate the local sum of even/odd numbers as the intermediate result; then, a single worker will collect all intermediate results and calculate the sum of even/odd numbers A composite DOT block is organized by these two elementary DOT blocks In the T-layer of this composite DOT block, there are 2 workers Operator t 1 can generate the sum of even numbers, while t 2 can generate the sum of odd numbers In a composite DOT block, the execution of its m elementary DOT blocks is flexible For any two elementary DOT blocks of those m elementary DOT blocks, these two elementary DOT blocks can be executed concurrently or sequentially, depending on specific system implementations 23 Big Data Analytics Jobs In the DOT model, a big data analytics job is described by its dataflow, global information and halting conditions Dataflow of a Job: The dataflow of a big data analytics job is represented by a specific or non-specific number of elementary/composite DOT blocks For the dataflow of a big data analytics job, any two elementary/composite DOT blocks are either dependent or independent For two elementary/composite DOT blocks, if the result generated by a DOT block is directly or indirectly consumed by another DOT block, ie one DOT block must be finished before another, they are dependent, otherwise they are independent Independent elementary/composite DOT blocks can be executed concurrently Global Information: Workers in an elementary/composite DOT block may need to access some lightweight global information, eg system configurations In the DOT model, the global information is available in a common place, such as the coordinator or the global master of the distributed systems Every worker in an elementary/composite DOT block can access the global information at any time Global Information Time Figure 3: An example of big data analytics job described by the DOT model with five DOT blocks DOT Block Global Information D O-layer T-layer D No Stop? Yes Halting Conditions Figure 4: An iterative job described by the DOT model with non-specific number of DOT blocks Halting Conditions: The halting conditions determine when or under what conditions a job will stop If a job is represented by a specific number of elementary/composite DOT blocks, the job simply stops after finishing the given number of blocks In this case, no specific halting condition is needed For a job represented by a recurrence relation [2, one or multiple conditions must be given, so the application can determine if this job should stop For example, convergence conditions and a maximum number of iterations are two commonly used halting conditions in iterative algorithms, such as PageRank [24 and the k-means algorithm [3 Figure 3 shows an dataflowexample described in thedot model with five DOT blocks In this example, DOT blocks 1, 2 and 3 process the input data first Then, DOT block 4 will consume the results generated by DOT blocks 1 and 2 Finally, DOT block 5 will take results of DOT blocks 3 and 4 as its input and generate the final result The global information can be accessed or updated by all of these five DOT blocks Because this job will stop after the DOT block 5 stops, there is no halting condition needed Figure 4 shows an iterative job described in the DOT model with a non-specific number of DOT blocks In this example, operators in the O-layer and T-layer of the DOT block in each iteration are the same After every iteration, halting conditions will be evaluated If all of the halting conditions are true, this job will stop Similar to the previous example, the global information can be accessed or updated by the DOT block in this iterative job 24 Formal Definitions In the DOT model, the elementary/composite DOT block can be formally defined using a matrix representation The dataflo
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x