Big Data Working Group. Big Data Analytics for Security Intelligence - PDF

Please download to get full document.

View again

of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Travel & Places

Published:

Views: 6 | Pages: 20

Extension: PDF | Download: 0

Share
Related documents
Description
Big Data Working Group Big Data Analytics for Security Intelligence September Cloud Security Alliance All Rights Resered All rights resered. You may download, store, display on your computer,
Transcript
Big Data Working Group Big Data Analytics for Security Intelligence September 2013 2013 Cloud Security Alliance All Rights Resered All rights resered. You may download, store, display on your computer, iew, print, and link to the Cloud Security Alliance Big Data Analytics for Security Intelligence at subject to the following: (a) the Document may be used solely for your personal, informational, non-commercial use; (b) the Document may not be modified or altered in any way; (c) the Document may not be redistributed; and (d) the trademark, copyright or other notices may not be remoed. You may quote portions of the Document as permitted by the Fair Use proisions of the United States Copyright Act, proided that you attribute the portions to the Cloud Security Alliance Big Data Analytics for Security Intelligence (2013) Cloud Security Alliance - All Rights Resered. 2 Contents Acknowledgments Introduction Big Data Analytics Data Priacy and Goernance Big Data Analytics for Security Examples Network Security Enterprise Eents Analytics Netflow Monitoring to Identify Botnets Adanced Persistent Threats Detection Beehie: Behaior Profiling for APT Detection Using Large-Scale Distributed Computing to Uneil APTs The WINE Platform for Experimenting with Big Data Analytics in Security Data Sharing and Proenance WINE Analysis Example: Determining the Duration of Zero-Day Attacks Conclusions Bibliography Cloud Security Alliance - All Rights Resered. 3 Acknowledgments Editors: Alaro A. Cárdenas, Uniersity of Texas at Dallas Pratyusa K. Manadhata, HP Labs Sree Rajan, Fujitsu Laboratories of America Contributors: Alaro A. Cárdenas, Uniersity of Texas at Dallas Tudor Dumitras, Uniersity of Maryland, College Park Thomas Engel, Uniersity of Luxembourg Jérôme François, Uniersity of Luxembourg Paul Giura, AT&T Ari Juels, RSA Laboratories Pratyusa K. Manadhata, HP Labs Alina Oprea, RSA Laboratories Cathryn Ploehn, Uniersity of Texas at Dallas Radu State, Uniersity of Luxembourg Grace St. Clair, Uniersity of Texas at Dallas Wei Wang, AT&T Ting-Fang Yen, RSA Laboratories CSA Staff: Alexander Ginsburg, Copyeditor Luciano JR Santos, Global Research Director Kendall Scoboria, Graphic Designer Ean Scoboria, Webmaster John Yeoh, Research Analyst 2013 Cloud Security Alliance - All Rights Resered. 4 1.0 Introduction Figure 1. Big Data differentiators The term Big Data refers to large-scale information management and analysis technologies that exceed the capability of traditional data processing technologies. 1 Big Data is differentiated from traditional technologies in three ways: the amount of data (olume), the rate of data generation and transmission (elocity), and the types of structured and unstructured data (ariety) (Laney, 2001) (Figure 1) Cloud Security Alliance - All Rights Resered. 5 Human beings now create 2.5 quintillion bytes of data per day. The rate of data creation has increased so much that 90% of the data in the world today has been created in the last two years alone. 2 This acceleration in the production of information has created a need for new technologies to analyze massie data sets. The urgency for collaboratie research on Big Data topics is underscored by the U.S. federal goernment s recent $200 million funding initiatie to support Big Data research. 3 This document describes how the incorporation of Big Data is changing security analytics by proiding new tools and opportunities for leeraging large quantities of structured and unstructured data. The remainder of this document is organized as follows: Section 2 highlights the differences between traditional analytics and Big Data analytics, and briefly discusses tools used in Big Data analytics. Section 3 reiews the impact of Big Data analytics on security and Section 4 proides examples of Big Data usage in security contexts. Section 5 describes a platform for experimentation on anti-irus telemetry data. Finally, Section 6 proposes a series of open questions about the role of Big Data in security analytics. 2.0 Big Data Analytics Big Data analytics the process of analyzing and mining Big Data can produce operational and business knowledge at an unprecedented scale and specificity. The need to analyze and leerage trend data collected by businesses is one of the main driers for Big Data analysis tools. The technological adances in storage, processing, and analysis of Big Data include (a) the rapidly decreasing cost of storage and CPU power in recent years; (b) the flexibility and cost-effectieness of datacenters and cloud computing for elastic computation and storage; and (c) the deelopment of new frameworks such as Hadoop, which allow users to take adantage of these distributed computing systems storing large quantities of data through flexible parallel processing. These adances hae created seeral differences between traditional analytics and Big Data analytics (Figure 2) Cloud Security Alliance - All Rights Resered. 6 Figure 2. Technical factors driing Big Data adoption 1. Storage cost has dramatically decreased in the last few years. Therefore, while traditional data warehouse operations retained data for a specific time interal, Big Data applications retain data indefinitely to understand long historical trends. 2. Big Data tools such as the Hadoop ecosystem and NoSQL databases proide the technology to increase the processing speed of complex queries and analytics. 3. Extract, Transform, and Load (ETL) in traditional data warehouses is rigid because users hae to define schemas ahead of time. As a result, after a data warehouse has been deployed, incorporating a new schema might be difficult. With Big Data tools, users do not hae to use predefined formats. They can load structured and unstructured data in a ariety of formats and can choose how best to use the data. Big Data technologies can be diided into two groups: batch processing, which are analytics on data at rest, and stream processing, which are analytics on data in motion (Figure 3). Real-time processing does not always need to reside in memory, and new interactie analyses of large-scale data sets through new technologies like Drill and Dremel proide new paradigms for data analysis; howeer, Figure 1 still represents the general trend of these technologies Cloud Security Alliance - All Rights Resered. 7 Figure 3. Batch and stream processing Hadoop is one of the most popular technologies for batch processing. The Hadoop framework proides deelopers with the Hadoop Distributed File System for storing large files and the MapReduce programming model (Figure 4), which is tailored for frequently occurring large-scale data processing problems that can be distributed and parallelized Cloud Security Alliance - All Rights Resered. 8 Figure 4. Illustration of MapReduce Seeral tools can help analysts create complex queries and run machine learning algorithms on top of Hadoop. These tools include Pig (a platform and a scripting language for complex queries), Hie (an SQL-friendly query language), and Mahout and RHadoop (data mining and machine learning algorithms for Hadoop). New frameworks such as Spark 4 were designed to improe the efficiency of data mining and machine learning algorithms that repeatedly reuse a working set of data, thus improing the efficiency of adanced data analytics algorithms. There are also seeral databases designed specifically for efficient storage and query of Big Data, including Cassandra, CouchDB, Greenplum Database, HBase, MongoDB, and Vertica. Stream processing does not hae a single dominant technology like Hadoop, but is a growing area of research and deelopment (Cugola & Margara 2012). One of the models for stream processing is Complex Eent Processing (Luckham 2002), which considers information flow as notifications of eents (patterns) that need to be aggregated and combined to produce high-leel eents. Other particular implementations of stream technologies include InfoSphere Streams 5, Jubatus 6, and Storm Cloud Security Alliance - All Rights Resered. 9 2.1 Data Priacy and Goernance The preseration of priacy largely relies on technological limitations on the ability to extract, analyze, and correlate potentially sensitie data sets. Howeer, adances in Big Data analytics proide tools to extract and utilize this data, making iolations of priacy easier. As a result, along with deeloping Big Data tools, it is necessary to create safeguards to preent abuse (Bryant, Katz, & Lazowska, 2008). In addition to priacy, data used for analytics may include regulated information or intellectual property. System architects must ensure that the data is protected and used only according to regulations. The scope of this document is on how Big Data can improe information security best practices. CSA is committed to also identifying the best practices in Big Data priacy and increasing awareness of the threat to priate information. CSA has specific working groups on Big Data priacy and Data Goernance, and we will be producing white papers in these areas with a more detailed analysis of priacy issues. 3.0 Big Data Analytics for Security This section explains how Big Data is changing the analytics landscape. In particular, Big Data analytics can be leeraged to improe information security and situational awareness. For example, Big Data analytics can be employed to analyze financial transactions, log files, and network traffic to identify anomalies and suspicious actiities, and to correlate multiple sources of information into a coherent iew. Data-drien information security dates back to bank fraud detection and anomaly-based intrusion detection systems. Fraud detection is one of the most isible uses for Big Data analytics. Credit card companies hae conducted fraud detection for decades. Howeer, the custom-built infrastructure to mine Big Data for fraud detection was not economical to adapt for other fraud detection uses. Off-the-shelf Big Data tools and techniques are now bringing attention to analytics for fraud detection in healthcare, insurance, and other fields. In the context of data analytics for intrusion detection, the following eolution is anticipated: 1 st generation: Intrusion detection systems Security architects realized the need for layered security (e.g., reactie security and breach response) because a system with 100% protectie security is impossible. 2 nd generation: Security information and eent management (SIEM) Managing alerts from different intrusion detection sensors and rules was a big challenge in enterprise settings. SIEM systems aggregate and filter alarms from many sources and present actionable information to security analysts. 3 rd generation: Big Data analytics in security (2 nd generation SIEM) Big Data tools hae the potential to proide a significant adance in actionable security intelligence by reducing the time for correlating, consolidating, and contextualizing dierse security eent information, and also for correlating long-term historical data for forensic purposes Cloud Security Alliance - All Rights Resered. 10 Analyzing logs, network packets, and system eents for forensics and intrusion detection has traditionally been a significant problem; howeer, traditional technologies fail to proide the tools to support long-term, large-scale analytics for seeral reasons: 1. Storing and retaining a large quantity of data was not economically feasible. As a result, most eent logs and other recorded computer actiity were deleted after a fixed retention period (e.g., 60 days). 2. Performing analytics and complex queries on large, structured data sets was inefficient because traditional tools did not leerage Big Data technologies. 3. Traditional tools were not designed to analyze and manage unstructured data. As a result, traditional tools had rigid, defined schemas. Big Data tools (e.g., Piglatin scripts and regular expressions) can query data in flexible formats. 4. Big Data systems use cluster computing infrastructures. As a result, the systems are more reliable and aailable, and proide guarantees that queries on the systems are processed to completion. New Big Data technologies, such as databases related to the Hadoop ecosystem and stream processing, are enabling the storage and analysis of large heterogeneous data sets at an unprecedented scale and speed. These technologies will transform security analytics by: (a) collecting data at a massie scale from many internal enterprise sources and external sources such as ulnerability databases; (b) performing deeper analytics on the data; (c) proiding a consolidated iew of security-related information; and (d) achieing real-time analysis of streaming data. It is important to note that Big Data tools still require system architects and analysts to hae a deep knowledge of their system in order to properly configure the Big Data analysis tools. 4.0 Examples This section describes examples of Big Data analytics used for security purposes. 4.1 Network Security In a recently published case study, Zions Bancorporation 8 announced that it is using Hadoop clusters and business intelligence tools to parse more data more quickly than with traditional SIEM tools. In their experience, the quantity of data and the frequency analysis of eents are too much for traditional SIEMs to handle alone. In their traditional systems, searching among a month s load of data could take between 20 minutes and an hour. In their new Hadoop system running queries with Hie, they get the same results in about one minute Cloud Security Alliance - All Rights Resered. 11 The security data warehouse driing this implementation not only enables users to mine meaningful security information from sources such as firewalls and security deices, but also from website traffic, business processes and other day-to-day transactions. 10 This incorporation of unstructured data and multiple disparate data sets into a single analytical framework is one of the main promises of Big Data. 4.2 Enterprise Eents Analytics Enterprises routinely collect terabytes of security releant data (e.g., network eents, software application eents, and people action eents) for seeral reasons, including the need for regulatory compliance and post-hoc forensic analysis. Unfortunately, this olume of data quickly becomes oerwhelming. Enterprises can barely store the data, much less do anything useful with it. For example, it is estimated that an enterprise as large as HP currently (in 2013) generates 1 trillion eents per day, or roughly 12 million eents per second. These numbers will grow as enterprises enable eent logging in more sources, hire more employees, deploy more deices, and run more software. Existing analytical techniques do not work well at this scale and typically produce so many false posities that their efficacy is undermined. The problem becomes worse as enterprises moe to cloud architectures and collect much more data. As a result, the more data that is collected, the less actionable information is deried from the data. The goal of a recent research effort at HP Labs is to moe toward a scenario where more data leads to better analytics and more actionable information (Manadhata, Horne, & Rao, forthcoming). To do so, algorithms and systems must be designed and implemented in order to identify actionable security information from large enterprise data sets and drie false positie rates down to manageable leels. In this scenario, the more data that is collected, the more alue can be deried from the data. Howeer, many challenges must be oercome to realize the true potential of Big Data analysis. Among these challenges are the legal, priacy, and technical issues regarding scalable data collection, transport, storage, analysis, and isualization. Despite the challenges, the group at HP Labs has successfully addressed seeral Big Data analytics for security challenges, some of which are highlighted in this section. First, a large-scale graph inference approach was introduced to identify malware-infected hosts in an enterprise network and the malicious domains accessed by the enterprise's hosts. Specifically, a host-domain access graph was constructed from large enterprise eent data sets by adding edges between eery host in the enterprise and the domains isited by the host. The graph was then seeded with minimal ground truth information from a black list and a white list, and belief propagation was used to estimate the likelihood that a host or domain is malicious. Experiments on a 2 billion HTTP request data set collected at a large enterprise, a 1 billion DNS request data set collected at an ISP, and a 35 billion network intrusion detection system alert data set collected from oer 900 enterprises worldwide showed that high true positie rates and low false positie rates can be achieed with minimal ground truth information (that is, haing limited data labeled as normal eents or attack eents used to train anomaly detectors) Cloud Security Alliance - All Rights Resered. 12 Second, terabytes of DNS eents consisting of billions of DNS requests and responses collected at an ISP were analyzed. The goal was to use the rich source of DNS information to identify botnets, malicious domains, and other malicious actiities in a network. Specifically, features that are indicatie of maliciousness were identified. For example, malicious fast-flux domains tend to last for a short time, whereas good domains such as hp.com last much longer and resole to many geographically-distributed IPs. A aried set of features were computed, including ones deried from domain names, time stamps, and DNS response time-to-lie alues. Then, classification techniques (e.g., decision trees and support ector machines) were used to identify infected hosts and malicious domains. The analysis has already identified many malicious actiities from the ISP data set. 4.3 Netflow Monitoring to Identify Botnets This section summarizes the BotCloud research project (Fraçois, J. et al. 2011, Noember), which leerages the MapReduce paradigm for analyzing enormous quantities of Netflow data to identify infected hosts participating in a botnet (François, 2011, Noember). The rationale for using MapReduce for this project stemmed from the large amount of Netflow data collected for data analysis. 720 million Netflow records (77GB) were collected in only 23 hours. Processing this data with traditional tools is challenging. Howeer, Big Data solutions like MapReduce greatly enhance analytics by enabling an easy-to-deploy distributed computing paradigm. BotCloud relies on BotTrack, which examines host relationships using a combination of PageRank and clustering algorithms to track the command-and-control (C&C) channels in the botnet (François et al., 2011, May). Botnet detection is diided into the following steps: dependency graph creation, PageRank algorithm, and DBScan clustering. The dependency graph was constructed from Netflow records by representing each host (IP address) as a node. There is an edge from node A to B if, and only if, there is at least one Netflow record haing A as the source address and B as the destination address. PageRank will discoer patterns in this graph (assuming that P2P communications between bots hae similar characteristics since they are inoled in same type of actiities) and the clustering phase will then group together hosts haing the same pattern. Since PageRank is the most resourceconsuming part, it is the only one implemented in MapReduce. BotCloud used a small Hadoop cluster of 12 commodity nodes (11 slaes + 1 master): 6 Intel Core 2 Duo 2.13GHz nodes with 4 GB of memory and 6 Intel Pentium 4 3GHz nodes with 2GB of memory. The dataset contained about 16 million hosts and 720 million Netflow records. This leads to a dependency graph of 57 million edges. The number of edges in the graph is the main parameter affecting the computational complexity. Since scores are propagated through the edges, the number of intermediate MapReduce key-alue p
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x