Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato, - PDF

Please download to get full document.

View again

of 21
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Advertisement

Published:

Views: 5 | Pages: 21

Extension: PDF | Download: 0

Share
Related documents
Description
Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato, Ph.D., Brown University, May Analyzing huge datasets becomes prohibitively slow when the dataset does not
Transcript
Abstract of Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato, Ph.D., Brown University, May Analyzing huge datasets becomes prohibitively slow when the dataset does not fit in main memory. Approximations of the results of guaranteed high quality are sufficient for most applications and can be obtained very fast by analyzing a small random part of the data that fits in memory. We study the use of the Vapnik-Chervonenkis dimension theory to analyze the trade-off between the sample size and the quality of the approximation for fundamental problems in knowledge discovery (frequent itemsets), graph analysis (betweenness centrality), and database management (query selectivity). We show that the sample size to compute a high-quality approximation of the collection of frequent itemsets depends only on the VC-dimension of the problem, which is (tightly) bounded from above by an easy-to-compute characteristic quantity of the dataset. This bound leads to a fast algorithm for mining frequent itemsets that we also adapt to the MapReduce framework for parallel/distributed computation. We exploit similar ideas to avoid the inclusion of false positives in mining results. The betweenness centrality index of a vertex in a network measures the relative importance of that vertex by counting the fraction of shortest paths going through that vertex. We show that it is possible to compute a high-quality approximation of the betweenness of all the vertices by sampling shortest paths at random. The sample size depends on the VC-dimension of the problem, which is upper bounded by the logarithm of the maximum number of vertices in a shortest path. The tight bound collapses to a constant when there is a unique shortest path between any two vertices. The selectivity of a database query is the ratio between the size of its output and the product of the sizes of its input tables. Database Management Systems estimate the selectivity of queries for scheduling and optimization purposes. We show that it is possible to bound the VC-dimension of queries in terms of their SQL expressions, and to use this bound to compute a sample of the database that allow much a more accurate estimation of the selectivity than possible using histograms. Sampling-based Randomized Algorithms for Big Data Analytics by Matteo Riondato Laurea, Università di Padova, Italy, 2007 Laurea Specialistica, Università di Padova, Italy, 2009 Sc. M., Brown University, 2010 A dissertation submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the Department of Computer Science at Brown University Providence, Rhode Island May 2014 Copyright 2014 by Matteo Riondato This dissertation by Matteo Riondato is accepted in its present form by the Department of Computer Science as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date Eli Upfal, Director Recommended to the Graduate Council Date Uğur Çetintemel, Reader Date Basilis Gidas, Reader (Applied Mathematics) Approved by the Graduate Council Date Peter M. Weber Dean of the Graduate School iii Vita Matteo Riondato was born in Padua, Italy, in January He attended Università degli Studi di Padova, where he obtained a Laurea (Sc.B.) in Information Engineering with a honors thesis on Algorithmic Aspects of Cryptography with Prof. Andrea Pietracaprina. He then obtain a Laurea Specialistica (Sc.M) cum laude in Computer Enginering with a thesis on Top-K Frequent Itemsets Mining through Sampling, having Prof. Andrea Pietracaprina, Prof. Eli Upfal, and Dr. Fabio Vandin as advisor. He spent the last year of the master as a visiting student at the Department of Computer Science at Brown University. He joined the Ph.D. program in computer science at Brown in Fall 2009, with Prof. Eli Upfal as advisor. He was a member of both the theory group and the data management group. He spent summer 2010 as a visiting student at Chalmers University, Gothemburg, Sweden; summer 2011 as a research fellow at the Department of Information Engineering at the Università degli Studi di Padova, Padua, Italy; summer 2012 as a visiting student at Sapienza University, Rome, Italy; and summer 2013 as a research intern at Yahoo Research Barcelona. While at Brown, he was the teaching assistant for CSCI 1550 (Probabilistic Methods in Computer Science / Probability and Computing) for four times. He was also President of the Graduate Student Council (Apr 2011 Dec 2012), the official organization of graduate students at Brown. He was a student representative on the Graduate Council ( ) and on the Presidential Strategic Planning Committee for Doctoral Education. iv Acknowledgements Dedique cor meum ut scirem prudentiam atque doctrinam erroresque et stultitiam et agnovi quod in his quoque esset labor et adflictio spiritus. Eo quod in multa sapientia multa sit indignatio et qui addit scientiam addat et laborem. 1 Ecclesiastes, 1: This dissertation is the work of many people. I am deeply indebted to my advisor Eli Upfal. We met in Padova in October 2007 and after two weeks I asked him whether I could come to the States to do my master thesis. To my surprise, he said yes. Since then he helped me in every possible way, even when I seemed to run away from his help (and perhaps I foolishly was). This dissertation started when he casually suggested VC-dimension as a tool to overcome some limitations of the Chernoff + Union bound technique. After that, Eli was patient to let me find out what it was, interiorize it, and start using it. He was patient when I kept using it despite his suggestions to move on (and I am not done with it yet). He was patient when it became clear that I no longer knew what I wanted to do after grad school. His office door was always open for me and I abused of his time and patience. He has been my mentor and I will always be proud of having been his pupil. I hope he will be, if not proud, at least not dissatisfied with what I am and will become, no matter where life leads me. Uǧur Çetintemel and Basilis Gidas were on my thesis committee and have been incredibly valuable in teaching me about databases and statistics (respectively) and above all in encouraging and giving me advice me when I had doubts about my future perspectives. Fabio Vandin has been, in my opinion, my older brother in academia. We came to Providence almost together from Padua and we are leaving it almost together. Since I first met him, he was 1 And I applied my heart to know wisdom, and to know madness and folly: I perceived that this also was a striving after wind. For in much wisdom is much grief; and he that increaseth knowledge increaseth sorrow. (American Standard Version) v the example that I wanted to follow. If Fabio was my brother, Olya Ohrimenko was my sister at Brown, with all the up and downsides of being the older sister. I am grateful to her. Andrea Pietracaprina and Geppino Pucci were my mentors in Padua and beyond. I would be ungrateful if I forgot to thank them for all the help, the support, the encouragement, the advice, and the frank conversations that we had on both sides of the ocean. Aris Anagnostopoulos and Luca Becchetti hosted me in Rome when I visited Sapienza University in summer They taught me more than they probably realize about doing research. Francesco Bonchi was my mentor at Yahoo Labs in Barcelona in summer He was much more than a boss. He was a colleague in research, a foosball fierce opponent, a companion of cervezitas on the Rabla de Poble Nou. He shattered many cancrenous certainties that I created for myself, and showed me what it means to be a great researcher and manager of people. The Italian gang at Brown was always supportive and encouraging. They provided moments of fun, food, and festivity: Michela Ronzani, Marco Donato, Bernardo Palazzi, Cristina Serverius, Nicole Gerke, Erica Moretti, and many others. I will cherish the moments I spent with with them. Every time I went back to Italy my friends made it feel like I only left for a day: Giacomo, Mario, Fabrizio, Nicolò, the various Giovanni, Valeria, Iccio, Matteo, Simone, Elena, and all the others. Mackenzie has been amazing in the last year and I believe she will be amazing for years to come. She supported me whenever I had a doubt or changed my mind, which happened often. Whenever I was stressed she made me laugh. Whenever I felt lost, she made me realize that I was following the path. I hope I gave her something, because she gave me a lot. Mamma, Papà, and Giulia have been closest to me despite the oceanic distance. They were the lighthouse and safe harbour in the nights of doubts, and the solid rock when I was falling, and the warm fire when I was cold. I will come back, one day. This dissertation is dedicated to my grandparents Luciana, Maria, Ezio, and Guido. I wish one day I could be half of what they are and have been. Never tell me the odds. Han Solo. vi Contents List of Tables xi List of Figures xii 1 Introduction Motivations Overview of contributions The Vapnik-Chervonenkis Dimension Use of VC-Dimension in computer science Range spaces, VC-dimension, and sampling Computational considerations Mining Association Rules and Frequent Itemsets Related work Preliminaries The dataset s range space and its VC-dimension Computing the d-index of a dataset Speeding up the VC-dimension approximation task Connection with monotone monomials Mining (top-k) Frequent Itemsets and Association Rules Mining Frequent Itemsets Mining top-k Frequent Itemsets Mining Association Rules vii 3.4.4 Other interestingness measures Closed Frequent Itemsets Discussion Experimental evaluation Conclusions PARMA: Mining Frequent Itemsets and Association Rules in MapReduce Previous work Preliminaries MapReduce Algorithm Design Description Analysis Top-K Frequent Itemsets and Association Rules Implementation Experimental evaluation Performance analysis Speedup and scalability Accuracy Conclusions Finding the True Frequent Itemsets Previous work Preliminaries The range space of a collection of itemsets Computing the VC-Dimension Finding the True Frequent Itemsets Experimental evaluation Conclusions viii 6 Approximating Betweenness Centrality Related work Graphs and betweenness centrality A range space of shortest paths Tightness Algorithms Approximation for all the vertices High-quality approximation of the top-k betweenness vertices Discussion Variants of betweenness centrality k-bounded-distance betweenness α-weighted betweenness k-path betweenness Edge betweenness Experimental evaluation Accuracy Runtime Scalability Conclusions Estimating the Selectivity of SQL Queries Related work Database queries and selectivity The VC-dimension of classes of queries Select queries Join queries Generic queries Estimating query selectivity The general scheme Building and using the sample representation Experimental evaluation ix 7.5.1 Selectivity estimation with histograms Setup Results Conclusions Conclusions and Future Directions 150 Parts of this dissertation appeared in proceedings of conferences or in journals. In particular, Chapter 3 is an extended version of [176], Chapter 4 is an extended version of [179], Chapter 5 is an extended version of [177], Chapter 6 is an extended version of [175], and Chapter 7 is an extended version of [178]. x List of Tables 3.1 Comparison of sample sizes with previous works Values for maximum transaction length and d-bound q for real datasets Parameters used to generate the datasets for the runtime comparison between Dist- Count and PARMA in Figure Parameters used to generate the datasets for the runtime comparison between PFP and PARMA in Figure Acceptable False Positives in the output of PARMA Fractions of times that FI(D, I, θ) contained false positives and missed TFIs (false negatives) over 20 datasets from the same ground truth Recall. Average fraction (over 20 runs) of reported TFIs in the output of an algorithm using Chernoff and Union bound and of the one presented in this work. For each algorithm we present two versions, one (Vanilla) which uses no information about the generative process, and one (Add. Info) in which we assume the knowlegde that the process will not generate any transaction longer than twice the size of the longest transaction in the original FIMI dataset. In bold, the best result (highest reported fraction) Example of database tables VC-Dimension bounds and samples sizes, as number of tuples, for the combinations of parameters we used in our experiments xi List of Figures 2.1 Example of range space and VC-dimension. The space of points is the plane R 2 and the set R of ranges is the set of all axis-aligned rectangles. The figure on the left shows graphically that it is possible to shatter a set of four points using 16 rectangles. On the right instead, one can see that it is impossible to shatter five points, as, for any choice of the five points, there will always be one (the red point in the figure) that is internal to the convex hull of the other four, so it would be impossible to find an axis-aligned rectangle containing the four points but not the internal one. Hence VC((R 2, R)) = Absolute / Relative ε-close Approximation to FI(D, I, θ) Relative ε-close approximation to AR(D, I, θ, γ), artificial dataset, v = 33, θ = 0.01, γ = 0.5, ε = 0.05, δ = Runtime Comparison. The sample line includes the sampling time. Relative approximation to FIs, artificial dataset, v = 33, ε = 0.05, δ = Comparison of sample sizes for relative ε-close approximations to FI(D, I, θ). = v = 50, δ = A system overview of PARMA. Ellipses represent data, squares represent computations on that data and arrows show the movement of data through the system A runtime comparison of PARMA with DistCount and PFP A comparison of runtimes of the map/reduce/shuffle phases of PARMA, as a function of number of transactions. Run on an 8 node Elastic MapReduce cluster xii 4.4 A comparison of runtimes of the map/reduce/shuffle phases of PARMA, as a function of minimum frequency. Clustered by stage. Run on an 8 node Elastic MapReduce cluster Speedup and scalability Error in frequency estimations as frequency varies Width of the confidence intervals as frequency varies Example of betweenness values Directed graph G = (V, E) with S uv 1 for all pairs (u, v) V V and such that it is possible to shatter a set of four paths Graphs for which the bound is tight Graph properties and running time ratios Betweenness estimation error b(v) b(v) evaluation for directed and undirected graphs Running time (seconds) comparison between VC, BP, and the exact algorithm Scalability on random Barabási-Albert [14] graphs Selectivity prediction error for selection queries on a table with uniform independent columns Two columns (m = 2), five Boolean clauses (b = 5) Selectivity prediction error for selection queries on a table with correlated columns Two columns (m = 2), eight Boolean clauses (b = 8) Selectivity prediction error for join queries queries. One column (m = 1), one Boolean clause (b = 1) xiii Chapter 1 Introduction 千 里 之 行, 始 於 足 下 1 老 子 (Laozi), 道 德 (Tao Te Ching), Ch. 64. Thesis statement: Analyzing very large datasets is an expensive computational task. We show that it is possible to obtain high-quality approximations of the results of many analytics tasks by processing only a small random sample of the data. We use the Vapnik-Chervonenkis (VC) dimension to derive the sample size needed to guarantee that the approximation is sufficiently accurate with high probability. This results in very fast algorithms for mining frequent itemsets and association rules, avoiding the inclusion of false positives in mining, computing the betweenness centrality of vertices in a graph, and estimating the selectivity of database queries. 1.1 Motivations Advancements in retrieval and storage technologies led to the collection of large amounts of data about many aspects of natural and artificial phenomena. Analyzing these datasets to find useful information is a daunting task that requires multiple iterations of data cleaning, modeling, and processing. The cost of the analysis can often be split into two parts. One component of the cost is intrinsic to the complexity of extracting the desired information (e.g., it is proportional to the number of patterns to find, or the number of iterations needed). The second component depends on the size of the dataset to be examined. Smart algorithms can help reducing the first component, 1 A journey of a thousand miles begins with a single step. 1 2 while the second seems to be ever increasing as more and more data become available. Indeed the latter may reduce any improvement in the intrinsic cost as the dataset grows. Since many datasets are too large to fit into the main memory of a single machine, analyzing such datasets would require frequent access to disk or even to the network, resulting in excessively long processing times. The use of random sampling is a natural solution to this issue: by analyzing only a small random sample that fits into main memory there is no need to access the disk or the network. This comes at the inevitable price of only being able to extract an approximation of the results, but the trade-off between the size of the sample and the quality of approximation can be studied using techniques from the theory of probability. The intuition behind the use of random sampling also allow us to look at a different important issue in data analytics: the statistical validation of the results. Algorithms for analytics often make the implicit assumption that the available dataset constitutes the entire reality and it contains, in some sense, a perfect representation of the phenomena under study. This is indeed not usually the case: not only datasets contain errors due to inherently stochastic collection procedures, but they also naturally contain only a limited amount of information about the phenomena. Indeed a dataset should be seen as a collection of random samples from an unknown data generation process, whose study corresponds to the study of the phenomena. By looking at the dataset this way, it becomes clear that the results of analyzing the dataset are approximated and not perfect. This leads to the need of validating the results to discriminate between real and spurious information, which was found to be interesting in the dataset only due to the stochastic nature of the data generation process. While the need for statistical validations has long been acknowledged in the life and physical sciences and it is a central matter of study in statistics, computer science researchers have only recently started to pay attention to it and to develop methods to either validate the results of previously available algorithms or to avoid the inclusion of spurious information in the results. We study the use of sampling in the following data analytics tasks: Frequent itemsets and association rules mining is one of the core tasks of data analytics, requiring to find sets of items in a transactional dataset that appear in more than a given fraction of the transactions, and use this collection to build high-confidence inference rules between sets of items. Betweenness centrality is a measure of the importance of a vertex in a graph, corresponding 3 to the fraction of shortest paths going through that vertex. By ranking vertices according to their betweenness centrality one can find which ones are most important, for example for marketing purposes. The selectivity of a database query is the ratio between the size of its output and the product of the sizes of the tables in input. Database management systems estimate the selectivity of queries to make informed scheduling decisions. We choose to focus on important tasks from different areas of data analytics, rather than a single task or different tasks from the same area. This is motivated by our goal of show
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x