Towards Cloud-based Analytics-as-a-Service (CLAaaS) for Big Data Analytics in the Cloud - PDF

Please download to get full document.

View again

of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Devices & Hardware

Published:

Views: 6 | Pages: 7

Extension: PDF | Download: 0

Share
Related documents
Description
2013 IEEE International Congress on Big Data Towards Cloud-based Analytics-as-a-Service (CLAaaS) for Big Data Analytics in the Cloud Farhana Zulkernine 1, Patrick Martin 1 and Ying Zou 2 1 School of Computing,
Transcript
2013 IEEE International Congress on Big Data Towards Cloud-based Analytics-as-a-Service (CLAaaS) for Big Data Analytics in the Cloud Farhana Zulkernine 1, Patrick Martin 1 and Ying Zou 2 1 School of Computing, 2 Dept. of Electrical and Computer Engineering Queen s University Kingston, ON, Canada K7L 3N6 1 {farhana, 2 Michael Bauer 1, Femida Gwadry-Sridhar 2 1 Dept. of Computer Science, 2 Dept. of Medicine Pharmacology and Physiology Western University London, ON, Canada 1 2 Ashraf Aboulnaga Dept. of Computer Science University of Waterloo Waterloo, ON, Canada N6A 3K7 Abstract Data Analytics has proven its importance in knowledge discovery and decision support in different data and application domains. Big data analytics poses a serious challenge in terms of the necessary hardware and software resources. The cloud technology today offers a promising solution to this challenge by enabling ubiquitous and scalable provisioning of the computing resources. However, there are further challenges that remain to be addressed such as the availability of the required analytic software for various application domains, estimation and subscription of necessary resources for the analytic job or workflow, management of data in the cloud, and design, verification and execution of analytic workflows. We present a taxonomy for analytic workflow systems to highlight the important features in existing systems. Based on the taxonomy and a study of the existing analytic software and systems, we propose the conceptual architecture of CLoud-based Analytics-as-a-Service (CLAaaS), a big data analytics service provisioning platform, in the cloud. We outline the features that are important for CLAaaS as a service provisioning system such as user and domain specific customization and assistance, collaboration, modular architecture for scalable deployment and Service Level Agreement. Keywords- Analytics, workflow, taxonomy, service, CLAaaS, AaaS, cloud, scientific workflow management system, analysis I. INTRODUCTION Data analytics has proven its potential in providing decision support in financial, administrative, and scientific sectors by enabling complex computations to generate knowledge, insights, and experimental proofs for scientific discovery. However, the amount of data that needs to be analyzed is growing at an exponential rate and the experts on data analytics technology such as data mining and machine learning often do not have the required domain knowledge to understand the data that needs to be analyzed. Therefore, researchers have been working on designing analytic systems to facilitate complex data analysis. However, these systems are often catered for specific data domains and do not provide ubiquitous access or the desired scalability for big data analysis. The cloud computing technology offers solutions to the above shortcomings but a service infrastructure must be in place to provision the necessary resources transparently on the cloud [13]. The term Analytics represents a broader scope that includes analysis and inference techniques for decision support. Some of the analytic jobs are definitive and are geared towards specific outcomes which are fed into decision support systems, while others are more exploratory where the end results may or may not be useful. The exploratory analytic jobs are dynamic in nature and typically require several revisions before they can be used as definitive jobs [11]. A data analysis process [14][17], which we call an Analytic Workflow, generally comprises a sequence of data cleansing and integration tasks and/or exploratory analytic jobs such as definition and execution of machine learning models or simple analytic queries. A task in a workflow can also be another workflow for more complex multi-level analytics. A workflow management system is, therefore, mandatory for efficient definition and execution of analytic workflows. The software tools for the tasks and analytic jobs and the visualization models vary depending on the data type, size, domain, and business goals. The proximity of the analytic tools to the data sources is important especially for big data, to avoid data transfer time and network cost. Selecting the appropriate hardware and software resources and defining data and task control flows with dependencies can be a challenge even for an expert data scientist [13]. Therefore, the cloud-based infrastructure should not only provide scalable hardware resources but also a platform equipped with customizable domain-specific software tools and a workflow management system to facilitate the definition and execution of big data analytic workflows [8][9]. A number of analytic workflow management systems exist today, which are also referred to as scientific workflow management systems in the literature [4][16][22][26]. Many of these systems evolved from domain specific analytic /13 $ IEEE DOI /BigData.Congress research studies. The systems vary in their focus on data domain, workflow types, representations, execution and provenance mechanisms, and support for collaboration and visualization. Taverna and XBaya [5] support composition and distributed execution of workflows using 3 rd party web and data services. With respect to the analysis of large data sets, grid technology is used for distributed mapping and enactment of workflows by WINGS, VLAM-G, DAGMan and Kepler [5]. However, grids are not as scalable as clouds. Cloud can provide ubiquitous access, on demand storage, memory and compute resources, fault tolerance and online collaboration. Distributed and parallel computation frameworks such as Hadoop [10], and the scalable big data storage, management and query tools [9] make the cloud an excellent platform for analytics. Although data security is a major concern for the cloud paradigm [3], techniques such as access control [6], intrusion detection [1], data anonymization [6] and encryption [27] are being used and researched as possible solutions. None of the existing analytic software and workflow systems [4][5][16][26] meet the desired accessibility and scalability, and provide the customizability to support various user roles such as analysts, domain experts, workflow executors or simple query executors. Considering the required features of an analytic workflow system as discussed above, we propose the conceptual architecture of CLAaaS, a Cloud-based Analytics-as-a-Service (AaaS) platform for big data analytics. As Platform-as-a-Service (PaaS), CLAaaS, will provide on demand data storage and analytics services through customized user interfaces which will include query, decision management, and workflow design and execution services for different user groups. CLAaaS will apply Service Level Agreements (SLAs) to provide controlled access to domain specific software and data resources, and recommendations and guidance in designing, sharing and executing analytic workflows. We use a taxonomy based on a study of existing workflow systems to identify the key features to be included in CLAaaS. The rest of the paper is organized as follows. Section 2 includes a study of the related work. A taxonomy of analytic workflow systems is presented in Section 3, which is used as to identify key features and requirements for designing CLAaaS. Section 4 describes the conceptual architecture of CLAaaS. The paper concludes in Section 5. II. RELATED WORK We propose CLAaaS as a service provisioning platform or PaaS and not as a single analytic software or system. CLAaaS will be configured with one or more analytic software, and most importantly, an analytic workflow management system based on the SLA. There are a number of different analytic software and workflow systems [4][5][16][22][26], which we discuss in this section as related work. SAS [21], SPSS, Cognos, InfoSphere BigInsights [12], and Tableau [23] are a few commercial products for statistical, business and scientific data analysis. Most of them provide rich tools for text analysis, analytic modeling, predictive analytics, visualization, collaboration, decision management, or adding 3 rd party applications for decision support. R, which is a successor of S, is a popular open source software suite used to develop programs to perform statistical analysis [19]. Several open source data mining tools such as Weka [25] and RapidMiner [20] are used widely in various research domains. RapidMiner includes some libraries of Weka and provides graphical user interfaces (GUI) for designing simple processes. However, all the above systems currently need to be installed on organizational frameworks and are not offered as services. Google Trends and Analytics [9] is an online analytical service, which allows mainly tracking of search keywords from user inputs for various ecommerce applications. The statistics can be used to understand consumer demands and advertise products better. Google Trends also supports graphical visualization. Several analytic workflow systems evolved during the last decade from domain specific data analytics research. Some of these are WINGS, Taverna, VLAM-G, SciRun, Kepler, XBaya, Vistrails, and Askalon [5][26]. Most of these workflow systems provide GUI interfaces to design workflows and use grid resources to execute them; however, the concepts and structures of workflow components, their representations, data handling strategy, workflow execution engines, and the underlying architectures vary. Comparisons of features of the different workflow systems and taxonomies are presented by Cruz et al. [4], Yu et al. [26], Han et al. [11], and Deelman et al. [4]. While Cruz et al. only focus on a taxonomy for provenance mechanisms, Yu et al. focus on the design, scheduling, fault tolerance and data movement aspects in their taxonomy. Han et al. classify the various types of workflow adaptations, and discuss mechanisms for doing so with reference to the ad-hoc modification requirements for workflows. Deelman et al. discuss a taxonomy specifically for scientific workflows including some of the more recent ones under four main categories, which constitute the workflow life cycle: composition, mapping, execution and provenance. We add a few more high level categories to provide a more general version of the taxonomy and to include some of the key aspects for AaaS on the cloud. Some of the above workflow systems have been extended to use cloud resources for workflow enactment. Juve et al. [13] discuss the usability of cloud for scientific workflows. Morar et al. [16] present an architecture where Askalon is used to design workflows, which are then executed using the best available cloud resources with respect to costs and SLAs. Ostermann et al. [18] propose the use of a mix of grid and cloud resources when grid resources are unavailable for cost-effective execution of workflows. The benefit is measured in terms of the ratio of the cost of cloud resources and the time saved. We propose hosting the analytics system on the cloud to serve multiple users at different roles using the ubiquitous access, data sharing and scalable features of the cloud. 63 III. FEATURES OF WORKFLOW SYSTEMS A. A General Taxonomy As discussed above, the existing taxonomies focus on a specific subset of the functional properties or features of the analytic workflow systems that prevailed at the time. We provide a more general taxonomy as shown in Fig.1 to include some of the key aspects that we deem are necessary for AaaS on the cloud. The main features of the taxonomy are described below. 1) Structure: Most of the workflow systems are taskbased where each task represents a data processing or analysis job by a software/service, or a workflow. Back end execution systems such as the Pegasus [4] map the tasks on to grid resources for execution. Service-based workflow systems such as Taverna [4], focus on the interfaces for the composition and invocation of services and enable distributed enactment. The information flow can be control, data or a hybrid of the two. Complex workflows include sub-workflows or iterative tasks whereas simple workflows have sequential, parallel or fork type selection structures. 2) Security: For sensitive data, common security methods that are applied are anonymization [6], or encryption [7] including access control measures [1]. Most of the existing systems only have some sort of access control. Anonymization for sensitive data is done before it is used in analysis and needs to be applied to data if it is hosted on the cloud. Researchers are working on encryption mechanisms [7] that would enable the analysis of encrypted data. 3) Workflow Design: Many existing workflow systems such as WINGS, Kepler, Triana, Vistrails and XBaya provide graphical workflow composition tools [5][26]. AVS and SciRun [4] enable users to compose graphical filters and rendering modules to design complex graphics applications. Taverna provides a hierarchical workflow view where a compact high level view can be expanded to see component details. Graphical workflows are typically converted into other representations for storage and execution. Users specify constraints in Resource Description Framework (RDF) format in WINGS, making it a hybrid design method. The constraints allow verification of the workflows. Assistance is provided in the form of suggestions of auxiliary data services, a filtered list of domain-specific software tools and workflow templates, and a mark-up or search functionality. 4) Representation: Workflows can be represented graphically using object-based Unified Modeling Language (UML), Scufl data flow model, graph-based DAX (Extensible Markup Language representation of Directed Acyclic Graph) or petri-net, or event-based BPMN (Business Process Model and Notation) [4], which are easier to construct for small workflows using GUI tools. Text editors are used for parameter specification and descriptions. High level scripting languages such as Ruby [29] and Python [28] are used to automatically generate the low level complex control structures in workflows. High level programming languages are also used instead of GUI tools or text editors to conveniently create the above workflow representations. 5) Visualization: Effective visualization can add an immense value to analytics and can vary based on the resulting data types. Image data should have a good resolution unlike charts or lists. Visualization models can be predefined by the user or defined intelligently by the system based on the data types as in Kepler and VisTrails [4]. 6) Collaboration: A data analyst often has limited knowledge about the data domain, which is necessary for designing good analytic models or drawing inferences on Figure 1. A taxonomy of scientific workflow management systems 64 analyzed data. Collaboration is, therefore, very important for correct interpretation of the results and specification of effective visualization models. Although the commercial analytic software provide some collaboration functionality, the open source workflow systems have little or no support for online collaboration. Results, data and workflows are shared and published online, for example, in myexperiment [17] and BioMart [2]. 7) Execution: Interdependent tasks where the output from one serves as the input for another have to be executed sequentially. Independent tasks can take advantage of distributed parallel execution to maximize resource utilization and minimize the execution time. Pegasus [4][5] takes workflow specifications in XML (Extensible Markup Language) DAG (Directed Acyclic Graph) format and maps them onto grid resources where they are executed using DAGMan [5]. Workflow execution using cloud resources is currently being explored [13][18] due to the scalability required for big data. Depending on the representation of workflows, various workflow engines may be used. WINGS supports multiple engines including shell scripts and Business Process Execution Language (BPEL) [4]. The execution system can be further equipped with monitoring, scheduling, fault tolearance and automated resource provisioning features as in Kepler and WINGS. TABLE I. WINGS Kepler Taverna XBaya Structure Hybrid, app. Data, service, complex Graphical, app., data, wflow, complex Hybrid, web service based, complex Hybrid, web service based CATEGORIZATION OF SOME OF THE OPEN SOURCE SCIENTIFIC WORKFLOW SYSTEMS USING OUR TAXONOMY Access control Access control Access control - Wflow Design Hybrid, assistance, verification Graphical, assistance Graphical, search/imp ort wflow Graphical, assistance Security Representatiozation Visuali- Collaboration Semantic RDF, BPEL, JS, DAGxml Data flow graph XML SCUFL, Freefluo Jython script Text, graphviz Image, text, graph Image, text, graphviz - Real time, within a group Offline, through websites Text - Execution Dist. / local, multi-format, prov., verification Dist.,Globus Execution & log, failure mgmt Centralized Provenance Script, centralized different features up to various depths depending on the focus of our research. Based on our study, an analytic workflow system should have the following key features: Hierarchical composition of workflows as shown in Fig. 2. The hierarchies include data schema and metadata definition and pre-processing, specification of analytical model(s), software or service configuration (with analytical models, parameters and data links) and verification, and workflow composition (using the above or another workflow). Each level in Fig. 2 represents a small workflow as a revision may be needed based on expert feedback until the results are satisfactory. Also for simple analytic jobs where a simple analytic query or a custom application can be used, level 2 may be skipped to move to level 3 from level 1. Moving back to level 1 from level 4 allows modification of pre-defined workflows and re-use of pre-defined analytic models and workflows in new workflows. Therefore, the key features of workflow systems should include: Support for templates and versioning to enable revision and reuse of templates in multiple workflows Support for dynamic design and validation of workflow components before composition Support for sequential, parallel, iterative and selective flows Support for data and/or control flow Specification of constraints and dependencies for verification and optimal parallel execution of workflows Quality of Service (QoS) provisioning based on SLA for Analytics-as-a-Service Transparent use of scalable cloud resources for costeffective execution of workflows Support for provenance with effective logging Support for a recommender system to provide domain specific assistance in the design of analytic Triana Hybrid, web service - Hybrid, wizard, verification WSRF, web services Graph, image, text - Centralized VisTrails Hybrid, app., data, simple Access control Hybrid with versions XML with annotation Image, text Offline, shared DBMS Centralized, script, multithread B. Required Features for an AaaS We present a general taxonomy of workflow systems based on 7 main features, which are used in Table 1 to categorize some of the popular open source scientific workflow management research tools. We explore the Figure 2. Hierarchical workflow composition. 65 and visualization models and execution of workflows Focus on user role, context, and business aspects in providing customized GUI and services Expandable graphical view of high level compact workflows Minimum data movement Data security and privacy C. User Groups An AaaS should provide custom interfaces for different user groups or roles, which would typically include: 1) Scientific Analysts - Have knowledge about analytical tools and methodologies, may or may not have knowledge about the data domain, e.g. statistician/data mining experts. Requires access to most of the functionality provided by CLAaaS including software tool definition/import, workflow template design, scheduling, and execution, visualization, and c
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x