A Survey of Cryptographic Approaches to Securing Big-Data Analytics in the Cloud - PDF

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Computers & Electronics

Published:

Views: 6 | Pages: 6

Extension: PDF | Download: 0

Share
Related documents
Description
A Survey of ryptographic Approaches to Securing Big- Analytics in the loud Sophia Yakoubov, Vijay Gadepally, Nabil Schear, Emily Shen, Arkady Yerukhimovich MIT Lincoln Laboratory Lexington, MA {sophia.yakoubov,
Transcript
A Survey of ryptographic Approaches to Securing Big- Analytics in the loud Sophia Yakoubov, Vijay Gadepally, Nabil Schear, Emily Shen, Arkady Yerukhimovich MIT Lincoln Laboratory Lexington, MA {sophia.yakoubov, vijayg, nabil, emily.shen, Abstract The growing demand for cloud computing motivates the need to study the security of data received, stored, processed, and transmitted by a cloud. In this paper, we present a framework for such a study. We introduce a cloud computing model that captures a rich class of big-data use-cases and allows reasoning about relevant threats and security goals. We then survey three cryptographic techniques homomorphic encryption, verifiable computation, and multi-party computation that can be used to achieve these goals. We describe the cryptographic techniques in the context of our cloud model and highlight the differences in performance cost associated with each. I. INTODUTION In today s data-centric world, big-data processing and analytics have become critical to most enterprise and government applications. Thus, there is a need for an appropriate bigdata infrastructure that supports storage and processing on a massive scale. loud computing has become the tool of choice for bigdata processing and analytics due to its reduced cost, broad network access, elasticity, resource pooling, and measured service [1]. loud computing enables consumers to store and analyze their data using shared computing resources while easily handling fluctuations in the volume and velocity of the data. However, cloud computing comes with risks. The shared compute infrastructure introduces many security concerns not present in more traditional computing architectures. The cloud provider and tenants may be untrusted entities who try to tamper with data storage or computation. These concerns motivate the need for a novel framework for analyzing cloud computing security, as well as for the use of cryptographic tools to address cloud computing security goals. In this paper, we propose a computation model of the cloud for big-data applications, and survey existing cryptographic tools using this model. Our primary contributions are: A general computation model that captures a large class of big-data use-cases in the cloud, A description of relevant security threats, and This work is sponsored by the Assistant Secretary of Defense for esearch and Engineering under Air Force ontract #FA Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government. A survey of cryptographic tools that address these security threats, and their current performance overhead. The remainder of this paper is organized as follows. In Section II, we present a model of big-data analytics in the cloud and introduce genomic sequencing as a motivating application. In Section III describes the security goals and the threat model. Then, in Section IV, we describe several cryptographic techniques that can be used to address these security goals in various cloud deployments. Finally, in Section V, we conclude and describe some directions for future research in the area of secure cloud computation. II. A MODEL FO BIG-DATA ANALYTIS IN THE LOUD In this section, we present a general computational model of the cloud that allows reasoning about a wide variety of big-data analytics applications. In this model, we categorize cloud compute nodes by their roles in the big-data analytics pipeline. Extending the notation of Bogdanov et al. [2], we define the following types of nodes: I denotes an input node supplying raw data for the application. nodes include sensors capturing data and machines used to enter client data. denotes a compute node whose role is to perform the computation for the application. ompute nodes include ingestion nodes, which refine the input data to get it ready for analysis, and enrichment nodes, which perform the actual analysis. S denotes a storage node whose role is to store data between computations. nodes store both the original inputs and the computation outputs. denotes a result node which receives the output of some computation, and either makes automated decisions based on that output or conveys the output to a client. Additionally, X + (where X {I,, S, }) denotes a set of one or more (possibly communicating) nodes of type X. Figure 1 depicts a cloud architecture for big-data analytics using the above terminology. This model can be used to describe a wide variety of big-data applications. As an example, consider an application in which the cloud stores and correlates sensitive genomics data with the goal of identifying a specific biological sequence, as described by Kepner et al. [3]. In this scenario, the cloud stores and correlates billions of reference genomic sequences. An agency /14/$ IEEE loud Applications - 1 YSA 4 Modification equests refers to a result node, and X + refers to one or more nodes of Fig. 1: A cloud architecture for big-data analytics. Here I refers to an input node, refers to a compute node, S refers to a storage node, type X (where X {I,, S, }). such as the National Institutes of Health (NIH) may be responsible for maintaining the reference database and inserting new reference datasets as they are characterized. Each dataset has metadata computed periodically by the cloud using a set of enrichment algorithms and user criteria. A scientist can then analyze a sample genomic sequence using the correlation between the sample and the cloud-stored reference sequences, performing computations such as those in the standard sequence alignment tool BLAST [4]. We can map this scenario onto the cloud model shown in Figure 1 in the following manner. An input node in this example is a genomic sequencing tool that obtains reference datasets. Note that there may be multiple input nodes corresponding to multiple sequencers. These input nodes send the sequence data to a set of ingest nodes, which are compute nodes responsible for parsing the sequences and organizing them into files and/or databases. The ingest nodes then send the persistent files and databases to the storage nodes. The enrichment compute nodes periodically access the stored data to perform additional computations. The enrichment process is typically offline (often batch) processing on the data to update the associated metadata based on user needs. These enriched data sequences and correlations are stored again by the storage nodes. In this application, the result node would be a doctor or medical researcher who wishes to correlate a patient s genomic sequence with reference sequences. This data receiver sends a query to the cloud and receives the appropriate data from the storage nodes. This data additionally passes through a compute node acting as a guard, which checks that the data receiver has the proper authorization to see the data requested. The data receiver can also make requests to modify the enrichment process, based on his analytics needs. In this application, the NIH may wish to guarantee the privacy of reference datasets containing sensitive information. Furthermore, the queries executed by doctors or researchers may also be sensitive. As another example, consider an imagery analysis big data system. Imagery analysis consists of collecting, processing, and analyzing large numbers of images collected from a variety of sensors (e.g., satellite, aerial, smartphone) to extract meaningful information. Unlike the genomic sequencing application, imagery analysis has a heterogeneous set of input nodes, which can include, among other things, distributed sensor networks and image capture devices from aircraft and other platforms. These sensors may be diverse not only in nature but also in geographic location. All of the data gathered by the sensors is ingested into a database and stored on the storage nodes. nodes are responsible for user- and mission-specific requests such as correlating imagery data by location, metadata, or description. Many images collected have specific security concerns. The enriched data can be accessed by field operatives and analysts, but only if they have the proper security level or need to know, as checked by the guard compute node. Using such a system, analysts may, for instance, look at imagery of remote areas to determine ground movements. The taxonomy of cloud nodes types in this model allows us to reason about a wide variety of different cloud applications. For instance, the type and level of protection required may vary depending on which type of node is involved and on the needs of the application. In the next section, we describe a threat model for categorizing the types of threats present in the cloud. III. DATA SEUITY IN THE LOUD loud computing introduces risks to any sensitive data it touches. These risks largely arise from the need to entrust data protection to a third party cloud provider. Different nodes in the environment may be controlled or administered by different untrusted parties, and could be vulnerable to attacks from other cloud tenants, malicious insiders or external adversaries. When data owners release control of their data to a cloud environment, they require guarantees that their data remains appropriately protected. Today, these guarantees are typically legal promises that the cloud provider makes to the owner as outlined in a service level agreement (SLA). ryptography allows data owners to protect their data proactively instead of relying solely on legal agreements that are difficult to monitor or enforce. To understand the protections offered by cryptography in the cloud, we consider security with respect to three traditional security goals: onfidentiality: All sight sensitive data (computation input, output, and any intermediate state) remains secret from any potentially adversarial or untrusted entities. Integrity: Any unauthorized modification of modification sensitive data is detectable. Furthermore, the outputs of any computation on sensitive data are correct (i.e., consistent with the input data). Availability: owners (e.g., the output recipients described in the previous section) are ensured access to their data and compute resources. Since availability is typically addressed in today s cloud environments through non-cryptographic means, we focus only on the confidentiality and integrity of cloud computation and storage. The measures necessary to achieve confidentiality and integrity depend heavily on how the cloud is deployed, who controls which parts of the cloud, and the trust that exists between these entities. We consider the following three scenarios: a) Untrusted cloud: One scenario is when the data owners do not trust the cloud or any of the cloud nodes to maintain the confidentiality or integrity of data or computations outsourced to the cloud. Thus, client-side protections are necessary to ensure that confidentiality and integrity are maintained in the face of an adversarial cloud. This scenario will commonly correspond to the public cloud deployment model. b) Trusted cloud: A second scenario, common in government use-cases, is when the cloud is deployed in an air-gapped environment completely isolated from any outside networks and adversaries. lients can put their data in the cloud and have assurance that it will remain confidential against outside adversaries. However, even in the isolated environment, some nodes may be corrupted (e.g., due to malware or insiders). These corrupted nodes can t exfiltrate any private data, but they could attempt to violate data and computation integrity. This scenario will commonly correspond to the private cloud deployment model. c) Semi-trusted cloud: A third scenario that may be particularly relevant to real-world deployments of cloud resources is a semi-trusted cloud. In this setting, we neither require the client to fully trust the cloud, nor do we assume that the entire cloud is untrusted. Instead, we assume that some parts of the cloud may be under the control of an adversary at any given time, but that a sufficient fraction of the resources will remain adversary-free. This scenario is consistent with a cloud provider who is trusted to attempt to maintain security, but not to succeed in guarding against every internal or external threat. This scenario may correspond to the hybrid, public, or private cloud deployment models. To reason about security in the cloud setting, we follow a standard cryptographic approach to modeling the adversary. We model the threat as an adversary that can control cloud nodes of his or her choosing. There are two types of adversaries typically considered in the literature, defined by the capabilities of the parties they corrupt: A party corrupted by an honest-but-curious (HB) adversary carries out all computations and protocols exactly as an honest party would. However, the adversary tries to learn additional information by combining the observations of its set of corrupted parties. In particular, an adversary corrupting multiple parties can learn information that no single party could learn individually. A party corrupted by a malicious adversary may deviate arbitrarily from the prescribed protocols (e.g., by sending malformed messages, actively colluding with other malicious parties, etc.) in an effort to violate the confidentiality or integrity of the data or computations. The design goal for secure cloud computing is to preserve the confidentiality and integrity of data in the presence of such adversaries. The threat model and the chosen architecture dictate the appropriate solutions for a given situation. IV. YPTOGAPHI TEHNIQUES In this section, we survey three cryptographic techniques that are particularly applicable to achieving secure big-data analytics in the cloud: homomorphic encryption (HE), verifiable computation (V), and secure multi-party computation (MP). We note that many other cryptographic techniques can be used to help secure cloud computing, including functional encryption, identity-based encryption and attribute-based encryption. However, we focus on the techniques we believe are the most promising and relevant for securely delegating computation to a cloud. The three cryptographic techniques each address one of the scenarios described in Section III: untrusted, trusted, and semi-trusted cloud. For each cryptographic technique, we describe how to use it in the genome sequencing example from Section III, as well as the functionality and security guarantees it provides. Specifically, we characterize each cryptographic technique in terms of the security properties it provides (confidentiality and/or integrity), the adversaries (honest-butcurious or malicious) it protects against, and whether it requires interaction between parties. Figure 2 summarizes the three techniques with respect to these properties. Figure 3 summarizes approximate efficiency cost of each of the techniques across a wide range of computations, depicting the multiplicative performance overhead incurred over unsecured computation. A. Homomorphic Encryption Suppose that to lower operating costs, the NIH were to store and process its data on a public cloud, such as the Amazon Elastic ompute loud (E2) [5]. The NIH does not own this cloud or trust the cloud to provide security, but wants to outsource their big-data analytics while maintaining the confidentiality of their data. This corresponds to the untrusted cloud model from Section III. One potential approach is for the NIH to encrypt its data and allow the cloud to perform computation over the encrypted data, without giving the cloud the decryption key. Figure 4a illustrates this approach. Homomorphic encryption is a type of encryption that allows functions to be computed on encrypted data without decrypting it first. That is, given only the encryption of a message, one can obtain an encryption of a function of that message by computing directly on the encryption. More formally, let E k (m) be an encryption of a message m under a key k. An encryption scheme is homomorphic with respect to a function f if there is a corresponding function f such that the D k (f (E k (m)) = f(m), where D k is the decryption algorithm under key k. A fully homomorphic encryption (FHE) scheme enables arbitrary functions to be computed over encrypted data. FHE is considered the holy grail of confidential outsourced computation because it allows any computation to be performed over multiple encryptions without decryption. Moreover, a single ryptographic technique Adversary type onfidentiality Integrity equires interaction Homomorphic Encryption (HE) Malicious Y N N Verifiable omputation (V) Malicious N Y N HE + V Malicious Y Y Y Multi-Party omputation (MP) Honest-but-curious or malicious Y Y Y Fig. 2: omparison of cryptographic approaches showing the type of adversary the approach handles, the security guarantees provided, and whether computation requires interaction between parties. We include the combination of homomorphic encryption (HE) and verifiable computation (V) separately to highlight the fact that these two techniques can be combined to offer both confidentiality and integrity. different organizations. Also, homomorphic encryption does not allow for computation on data encrypted using different keys (without incurring additional significant overhead), thus making it impossible for sensors to allow different access to data they contribute to the computation. Some of these limitations can be addressed by other tools such as attributebased encryption [15] and functional encryption [16]; however, this discussion is beyond the scope of this paper. Note that homomorphic encryption only guarantees data confidentiality, not integrity. However, it can be combined with verifiable computation (described in Section IV-B) to provide both guarantees. The combination of homomorphic encryption and verifiable computation enables secure computation even on a completely untrusted cloud. Fig. 3: A graphical depiction of the multiplicative performance overheads over unsecured computation incurred by homomorphic encryption (HE), verifiable computation (V), and multiparty computation (MP). entity in possession of the encryption (such as a cloud node) can perform this computation, without the need for interaction with the data owner or other entities. Alternatively, somewhat homomorphic encryption (e.g., El Gamal encryption [6], under which the multiplication of two ciphertexts produces an encryption of the product of the plaintexts) supports restricted classes of functions. Gentry introduced the first fully homomorphic encryption scheme in 2009 [7]. This was a revolutionary cryptographic achievement, but the scheme was far too inefficient for any practical use. Since 2009, several works (e.g., [8] [12]) have improved Gentry s technique, significantly reducing the running time. However, fully homomorphic encryption remains prohibitively slow for most use-cases. For example, HElib, a library developed by IBM that provides state-of-the-art implementation of homomorphic encryption, currently performs a matrix-vector multiplication for a 256-entry integer vector in approximately 26 seconds [13], [14]. In addition to its inefficiency, homomorphic encryption has other limitations. For instance, homomorphic encryption requires that all sensors and the eventual recipients of the results share a key to encrypt the inputs and decrypt the results, which may be difficult to arrange if they belong to B. Verifiable omputation Suppose the NIH is willing to expend a significant amount of hardware resources to securely deploy their own private cloud. Specifically, they are willing to set their cloud up in a secure enclave, isolating it completely from the outside world through the use of an air-gap, preventing any data from leaving the cloud. Such a setup would automatically guarantee the confidentiality of data without the need for addi
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x