Description of Work. Project acronym: BigFoot Project full title: Big Data Analytics of Digital Footprints - PDF

Please download to get full document.

View again

of 62
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

School Work

Published:

Views: 10 | Pages: 62

Extension: PDF | Download: 0

Share
Related documents
Description
Description of Work Project acronym: Project full title: Big Data Analytics of Digital Footprints Project Budget: 3, Euro Work programme topics addressed: Objective ICT : Cloud Computing,
Transcript
Description of Work Project acronym: Project full title: Big Data Analytics of Digital Footprints Project Budget: 3, Euro Work programme topics addressed: Objective ICT : Cloud Computing, Internet of Services and Advanced Software Engineering. Name of the coordinating person: Pietro Michiardi Fax: List of Participants Role Number Name Short Name Country Date Enter Date Exit CO 1 EURECOM EUR FR 1 36 CR 2 SYMANTEC SYM IR 1 36 CR 3 Technische Universität TUB DE 1 36 Berlin CR 4 Ecole Polytechnique EPFL CH 1 36 Federale de Lausanne CR 5 GridPocket GRIDP FR 1 36 Role: CO=Coordinator; CR=Contractor. SEVENTH FRAMEWORK PROGRAMME THEME FP7-ICT Cloud Computing, Internet of Services and Advanced Software Engineering Contents 1 Concept and objectives, progress beyond state-of-the-art, S/T methodology and work plan Concept and objectives Context Motivations Objectives: The Approach Expected results Indicators and success criteria Relevance to the topics addressed in the call Progress beyond the state-of-the-art Application layer Parallel data processing Interactive query engines Distributed data stores Virtualization layer Relevant EU-funded projects Baseline S/T methodology and associated work plan Introduction Methodology Workplan Structure and Breakdown Overall System Description Usage Scenarios Risk and mitigation plans Work packages list Deliverables list List of milestones Implementation Consortium as a whole Impact Expected impacts listed in the work programme Strategic impact Description of Work page 2 of 62 3.1.2 impacts listed in the work programme Scientific impact Social and economic impact The European dimension of Plan for the use and dissemination of foreground Dissemination and communication strategy Exploitation strategies Standardization activities Description of Work page 3 of 62 1 Concept and objectives, progress beyond state-of-the-art, S/T methodology and work plan 1.1 Concept and objectives The aim of is to design, implement and evaluate a scalable system for processing and interacting with large volumes of data. The software stack allows automatic and self-tuned deployment of data storage and parallel processing services for private cloud deployments which go beyond best-effort services currently available in the state-of-theart. The project addresses performance bottlenecks of current solutions and takes a crosslayer approach to system optimization, which is evaluated with a thorough experimental methodology using realistic workloads and datasets. The ultimate goal of the project is to contribute the software stack to the open-source community Context The amount of data in our world has been exploding. E-commerce, Internet security and financial applications, billing and customer services to name a few examples will continue to fuel exponential growth of large pools of data that can be captured, communicated, aggregated, stored, and analyzed. As companies and organizations go about their business and interact with individuals, they are generating a tremendous amount of digital footprints, i.e., raw, unstructured data for example log files that are created as a by-product of other activities. As discussed in the report in [56] there are many broadly applicable ways to leverage data and create value across sectors of the global economy: Making data access and interaction simple; Collect and process digital footprints to measure and understand the root causes of product performance and bring it to higher levels; Leverage large amounts of data to create highly specific user segmentations and to tailor products and services precisely to meet users needs; Produce sophisticated analytics to improve decision making with automated algorithms; Use data analysis to create new products and services, enhance existing ones, and invent entirely new business models. In summary, use of data is a key basis of competition and growth: companies failing to develop their analysis capabilities will fail to understand and leverage the big picture hidden in the data, and hence fall behind. Nowadays, the ability to store, aggregate, and combine large volumes of data and then use the results to perform deep analysis has become ever more accessible as trends such as Moore s Law in computing, its equivalent in digital storage, and cloud computing continue to lower costs and other technology barriers. However, the means to extract insights from data require remarkable Description of Work page 4 of 62 improvements as software and systems to apply increasingly sophisticated mining techniques are still in their infancy. Large-data problems require a distinct approach that sometimes runs counter to traditional models of computing. In, we depart from high-performance computing applications and we go beyond traditional techniques developed in the database community. In the sequel of this document we present a comprehensive system to elaborate and interact with large amounts of data, that can be deployed on top of private, virtualized clusters of commodity hardware. Before delving into the technical objectives we address in, in the next section we motivate our approach by focusing on few use-cases and by clearly indicating the deficiencies of current approaches Motivations We now illustrate the challenge organizations are faced to when dealing with their own digital footprints. Although the following examples apply to a wide range of use-cases, we shall focus on the context and requirements of two companies that are part of the consortium. GridPocket: provides energy-related value-added services solutions. The goal of this organization is to use and process consumption data generated by millions of customers to help electric, gaz, and water utilities to reduce their CO 2 emissions. Data analysis tasks apply, for example, to the following cases: i) Consumer billing: a monthly scan of all customer data to produce consumption reports; ii) Consumer Dashboard Web applications: analysis of the whole consumption data to create personalized customer reports; iii) Consumer segmentation: execution of sophisticated algorithms to classify consumers based on their consumption patterns and produce, for example, personalized contract offers; iv) Provisioning applications: analysis of geographical consumption patterns and design of predictive algorithms to help operators in provisioning of their electric network. Symantec: Symantec is one of the world industry leaders in security software, focused on helping customers protect their infrastructures, their information, and their businesses. Through its Global Intelligence Network, Symantec has established some of the most comprehensive sources of Internet threat data in the world, with 240,000 sensors monitoring network attack activity in more than 200 countries through a combination of security products and managed What is a private cloud? Private cloud deployments [44] resemble those that are public: one or more datacenters clusters of physical machines, interconnected via a high speed local area network host virtual instances of servers which can be customized from scratch, or that host services and applications that are exposed to end-users via simple and standard interfaces, e. g., HTTP. Such installations are private in that services and applications are not accessible from the outside world: they are confined to clients that interact with them from within the security perimeter of a company or organization. In addition, data stored and manipulated in a private cloud never leave the company s datacenter, hence the application of policies and rules to protect data from unauthorized access can be enforced with legacy approaches to security and access control. Description of Work page 5 of 62 services 1. However, security analysts are challenged in their daily job of analyzing global Internet threats because of the sheer volumes of data Symantec collects around the globe [31]. In the cyber security domain, this is sometimes referred to as attack attribution and situational understanding, which are considered today as critical aspects to effectively deal with Internet attacks [43, 58, 70]. Attribution in cyberspace involves different methods and techniques, which, when combined appropriately, can help to explain an attack phenomenon by (i) indicating the underlying root cause, and (ii) by showing the modus operandi of attackers. The goal is to help analysts answer some important questions regarding the organization of cyber criminal activities by taking advantage of effective tools able to generate security intelligence about known or unknown threats. What are the common requirements, with respect to analytics tasks, that characterize the two scenarios described above? Clearly, there are two main ways of interacting with data: i) batch processing of large amounts of data (for analysis, mining, classification, etc), and ii) selective queries on subsets of data that can be issued through manual inspection or by Web applications that read information related to a single user or a subset thereof. Hence, the need for a unified system capable of supporting both kinds of interactions with data. The requirements exemplified above can be somehow met with existing technology. What are the typical solutions that are available today and which novelty brings to the current state-of-the-art? First, we review the three most prominent approaches to large-data analytics: i) buy database management systems and appliances from big vendors, ii) use public cloud services and iii) use open-source projects/products. Data Analytics Appliances: examples of products that attack the Big Data market include EMC/GreenPlum, Splunk, Oracle Big Data Appliance and many more. Addressing large-data analytics problems with such an approach has the advantage of using a product that bundles together hardware and software, which comes with production-level support. However, these products are closed-source and little is known on their effective performance, and how they behave when compared to alternative solutions. Furthermore, the costs associated with a production-level deployment is exorbitant, with licensing fees proportional to the amount of data to be processed. It should also be noted that such systems are difficult (at best) to deploy and tune [62]. Last but not least, this approach suffers from the data lock-in problem: once data is loaded and analytic jobs are written for a specific platform, it is hard to move to another product. Public-cloud Analytic Services: a prominent example of this kind of approach is that of Amazon Elastic MapReduce 2. The idea is, for organizations such as the ones discussed above, to ship their data to a public cloud storage service, prepare analytic jobs in an offline manner (e. g., a MapReduce program, but also SQL-like queries), submit the data processing code to a web application and select the amount (and quality) of resources that will be dedicated to computation. Although this seems an appealing approach, it is not exempt from significant drawbacks. First, today s public cloud products offer only a best-effort service: in practice there is no performance guarantee, and current servicelevel-agreements account also for some down-time periods in which the service may be un-available. Furthermore, current EU directives are strict in what concerns privacy Description of Work page 6 of 62 issues 3 4, which drastically limit the applicability of public-cloud services. Finally, it is often unacceptable for industries to put sensitive data sets on public cloud infrastructure, for obvious confidentiality reasons (e.g., for a security software industry, uploading in a public cloud malware and attack data targeting their customers is not an option). Open-source projects: as a prominent example we consider Apache Hadoop we review other projects in Section 1.2 which consists in an ecosystem of sub-projects each dealing with a particular technology (e.g., a parallel processing framework similar to Google s MapReduce [28], a distributed data store similar to Google s BigTable [17], etc...). Additionally, several commercial products, based on Hadoop, have seen the light in the last two years: IBM BigInsights, MapR, Cloudera Hadoop Distribution. The main focus of Hadoop, is to reach a production-level quality: a lot of effort has been dedicated to address bugs, to improve interoperability among components, to improve custom deployments on dedicated cluster and, recently, to eliminate single-point-of-failures in the original architecture. In synthesis, open-source projects lack a structured effort toward system optimization and fail to cover the multiple layers involved in typical deployments. In summary, although the current state-of-the-art offers a rich set of approaches to tackle largescale data processing problems, we identify the following key points that are currently not addressed by existing technologies: Data interaction is hard. Current approaches lack an integrated interface to inspect and query (processed) data. Moreover, little work has been done in the literature to optimize the efficiency (and not only the performance) of interactive queries that operate on batch processed data. Parallel algorithm design is hard. While the design of parallel algorithms is already a difficult topic per se, current systems make the implementation of even simple jobs a tedious and exhausting experience. As such, parallel programs tend to have limited usability and have a short life-time, i. e., code re-use is limited. Lack of optimizations. Current systems entrust users with the task of optimizing their queries and algorithms. Moreover, data-flow and storage mechanisms are data-processing oblivious, which leave room for several optimizations that have not been addressed by current solutions. Deployment tools are poor. Management tools are still in their infancy and target solely bare metal clusters. In addition, the effects of virtualization has been largely overlooked in the literature. Illustrative example of a deployment. We conclude this Section with an illustrative example to highlight the benefits brought by the project. For this example, we assume is used by a company say Symantec willing to explore the secrets hiding in the vast amount of data they collect Description of Work page 7 of 62 We assume that Symantec already is in possession of a private cloud deployment, that is the set of machines on which will execute in addition to existing, collocated services is physically deployed in their premises. What are the steps a Symantec user 5 is required to follow? exposes storage, data processing and querying components as a Platform-as-a-Service: in practice this translates into the following steps: Using a standard interface (shell-based or web-based), the user specifies the location of the data she will operate on; the system injects such data in the relevant storage layer (distributed file system or data store); Using a standard interface (shell-based or web-based), the user specifies the data processing tasks that are required to run on her data. Data processing can be delay tolerant that is batch oriented analysis or latency sensitive that is interactive queries and the system can be instructed to direct such tasks to the corresponding engine. The system automatically deploys the necessary machinery (in terms of virtual machines) to execute data analysis tasks, and perform the necessary optimizations (data and virtual machine migration, data flow enhancements, work sharing optimizations) to obtain an aggregate result (that we also label metadata) Using a standard interface (shell-based or web-based) the user can further inspect aggregate statistics to extract useful information or to refine the data processing tasks. As the example above illustrates, offers a unified setting to store, process and interact with data, which is exposed to the user with simple and standard interfaces to a cloud service. All the complications related to deployment, tuning and optimizations are handled transparently by the system. In addition to the above scenario, it should be noted that offers tap-in points for experienced users that are willing to sacrifice the simplicity of this approach for a more controlled usage of, for example, the parallel processing layer. This provides the additional freedom for the user to decide how to interact and use Objectives: The Approach The key challenge of is to conduct cutting edge research on several issues related to Big Data Analytics applications and services, producing relevant output for the research community and with the potential of being immediately relevant and available to real-world, industrial problems. has a number of important scientific and industrial objectives. These include fundamental (scientific and research oriented) and experimental elements, completed by contribution to the open-source community. Fundamental Objectives. The fundamental objectives of are threefold and each is described in detail in the following sub-paragraphs. The technical work packages (WP) described in the remainder of this document, including WP 2,3,4 and 5 contribute to the research-oriented work carried out in. Description of Work page 8 of 62 Fundamental objective 1 Given the potential offered by, it is important to clearly define use cases, scenarios and detail the workloads that derive from real-world applications. Moreover, it is important to make an effort to generalize workloads to encompass other applications that share similar traits to those addressed in. The work for this first objective is executed in WP2, and it is scheduled to take place in the first phase of the project. Fundamental objective 2 In, the focus is on system optimization, starting from the top layer down to lower layers of the system stack. Each component of the architecture requires special care in achieving efficiency, scalability and reliability goals. Furthermore, little is known on the combination of optimization techniques at different layers of the stack, that we label a cross-layer approach. As such, part of this fundamental objective is the design, implementation and validation of: 1. a novel component aiming at transforming a high-level, declarative query language into parallel programs; This work corresponds to Task in work package 3; 2. optimizations to the inner data-flow of the parallel processing framework adopted in ; This work is carried out in Task in work package 3; 3. a service-oriented query engine for improving data interaction, and its integration with distributed datastores; This work is done in Task in work package 3; 4. novel data partitioning and placement mechanisms that aim at optimizing the storage layer. This work is carried out in work package 4. Fundamental objective 3 The system stack is designed to work in a virtualized cluster, consisting of a pool of virtual machines and virtual networks. Our ultimate goal is to go beyond besteffort services and offer performance guarantees. An underlying objective is to adopt a cross-layer approach to optimization. This fundamental objective is achieved by the work in WP5, which covers several aspects of infrastructure virtualization and algorithms to support the specification of requirements and constraints dictated by components. Experimental Objectives. strives for cutting-edge experimental work, that is driven by the applications envisioned in the project. To this end, each partner will work on experiments using several platforms deployed at their respective premises. Each experimental platform will evaluate selected components, during the early stages of the development process, and the whole platform once it is available. Experimental objectives are grouped by partner type, academic first, then industrial. Experimental objective 1 EUR, TUB, and EPFL will work towards the design and deployment of experimental test beds to validate and analyz
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x