DBStream: an Online Aggregation, Filtering and Processing System for Network Traffic Monitoring - PDF

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

News & Politics

Published:

Views: 14 | Pages: 6

Extension: PDF | Download: 0

Share
Related documents
Description
DBStream: an Online Aggregation, Filtering and Processing System for Network Traffic itoring Arian Bär, Pedro Casas FTW - Telecommunications Research Center Vienna {baer, Lukasz Golab University
Transcript
DBStream: an Online Aggregation, Filtering and Processing System for Network Traffic itoring Arian Bär, Pedro Casas FTW - Telecommunications Research Center Vienna {baer, Lukasz Golab University of Waterloo Alessandro Finamore Politecnico di Torino Abstract Network traffic monitoring systems generate high volumes of heterogeneous data streams which have to be processed and analyzed with different time constraints for daily network management operations. Some monitoring applications such as anomaly detection, performance tracking and alerting require fast processing of specific incoming real-time data. Other applications like fault diagnosis and trend analysis need to process historical data and perform deep analysis on generally heterogeneous sources of data. The Data Stream Warehousing (DSW) paradigm provides the means to handle both types of monitoring applications within a single system, providing fast and rich data analysis capabilities as well as data persistence. In this paper, we introduce DBStream, a novel online traffic monitoring system based on the DSW paradigm, which allows fast and flexible analysis across multiple heterogeneous data sources. DBStream provides a novel stream processing language for implementing data processing modules, as well as aggregation, filtering, and storage capabilities for further data analysis. We show multiple traffic monitoring applications running on DBStream, processing real traffic from operational ISPs. Keywords DBStream; Data Stream Warehousing; Network Traffic itoring and Analysis I. INTRODUCTION The complexity of large-scale, Internet-like networks is constantly increasing. With more and more services being offered on the Internet, the massive adoption of Content Delivery Networks (CDNs) for traffic hosting and delivery, and the continuous growth of bandwidth-hungry video-streaming services, network and server infrastructures are becoming extremely difficult to understand and to track. Network Traffic itoring and Analysis (NTMA) has taken an important role to understand the functioning of such networks, especially to get a broader and clearer visibility of unexpected events. The evolution of the Internet calls for better and more flexible measurement and monitoring systems to pinpoint problems and optimize service quality. A variety of methodologies and tools have been devised by the research community to passively monitor network links. Technologies such as NetFlow and more advanced monitoring solutions [1] [3] enable the monitoring of high-speed links using off-the-shelf hardware. Similarly, several solutions are available for active measurements, from simple command-line tools such as the standardping to more advanced frameworks for topology discovery such as TopHat [4]. All these tools are stand-alone solutions, capable of extracting large amounts of detailed information from live networks. What is sorely /14/$31. c 214 IEEE missing is a flexible system able to store and process such rich and heterogeneous sources of network monitoring data in order to understand the complicated dynamics of nowadays Internet. Such a system should be capable of handling different types of NTMA applications, from real-time or near real-time data processing applications such as service performance tracking and anomaly detection and alerting, to more complex big data analysis tasks involving the processing of large amounts of stored historical data. The Data Stream Warehousing (DSW) paradigm [9] provides the means to handle both types of monitoring applications within a single system, combining the real-time data processing of data stream management systems with the deep analytics of long historical data of traditional warehouses. In this paper we introduce DBStream, a flexible and scalable DSW system tailored to NTMA applications. DBStream is a repository system capable of ingesting data streams coming from a wide variety of sources (e.g., passive network traffic data, active measurements, router logs and alerts, etc.) and performing complex continuous analysis, aggregation and filtering jobs on them. DBStream can store tens of terabytes of heterogeneous data, and allows both real-time queries on recent data as well as deep analysis of historical data. Figure 1 shows a standard deployment of DBStream as part of a generic network monitoring architecture for current Internet-like networks, consisting of both passive and active probes, as well as external data sources provided by other data repositories. One of the main assets of DBStream is the flexibility it provides to rapidly implement new NTMA applications, through the usage of a novel stream processing language tailored to continuous network analytics. Advanced analytics can be programmed to run in parallel and continuously over time, using just a few lines of code. The near real-time data analysis is performed through the online processing of timelength configurable batches of data (e.g., batches of one minute of passive traffic measurements), which are then combined with historical collections to keep a persistent collection of the output. Moreover, the processed data can then be easily integrated into visualization tools (e.g., web portals). To exemplify the kind of NTMA applications which can run on top of DBStream, we present in this paper four different traffic processing applications, considering the incoming data from passive probes installed both at the core of a mobile network and at the edge of a fixed-line ADSL/FTTH network of major European ISPs. In addition, we evaluate the performance of DBStream in processing real traffic measurements, and compare it both with standard PostgreSQL data repositories, NTMA Applications External Data A access links access links P DBStream P A peering link ISP Network A active probe Internet exchange Point IXP P passive probe Fig. 1. A standard deployment of DBStream in an ISP network. DBStream is a data repository capable of processing data streams coming from a wide variety of sources. as well as MapReduce-based frameworks. The remainder of this paper is organized as follows: Related work is briefly discussed in Sec. II. Sec. III introduces DBStream, describing its design and implementation. System performance and scalability considerations are discussed in Sec. IV. Sec. V shows four different NTMA applications running on DBStream, processing traffic measurements from two different operational networks. Finally, Sec. VI concludes the paper. II. RELATED WORK Several technologies are available to potentially implement a data repository system like DBStream, which we can coarsely divide into SQL and NoSQL systems [5]. The former class includes Database Management Systems (DBMSs), which are known to offer excellent performance when accessing the data, but suffer when new data have to be inserted continuously. The latter class makes its selling point by offering great horizontal scalability, but offers no guarantee on the response time. NoSQL systems include MapReduce [1] systems, supporting a simpler key-value interface rather than the relational/sql model used by DBMSs. Hadoop [11] and Hive [12] are two popular MapReduce technologies. MapReduce systems are based on batch processing rather than on stream processing, which is specifically required in NTMA applications. There has been a great deal of effort to improve traditional DBMSs in the last several years. Many data processing and storage systems have been developed to improve both performance and scalability. Still, a major limitation of such systems is the inability to cope with continuous analytics. Some new solutions have been proposed, including Data Stream Management Systems (DSMSs) and Data Stream Warehouses (DSW). DSMSs enable continuous processing of data over time; examples include Gigascope [6] and Borealis [7]. These systems consist of in-memory operations with no persistent data storage, which is a critical limitation for traffic analysis purposes. DSWs extend DBMSs with the ability to ingest new data in nearly real-time. DataCell [8] and DataDepot [9] are two examples, as well as the DBStream system presented in this paper. Finally, hybrid systems composed of a mix of SQL and NoSQL technologies have been proposed, for example HadoopDB [13]. None of these systems were designed to address continuous network monitoring applications. The only exception is DataDepot, which is a closed-source system based on proprietary technologies. Furthermore, to the best of our knowledge, only DBStream supports incremental queries defined through a declarative language such as SQL, which are particularly useful for tracking the status of a network. III. SYSTEM OVERVIEW DBStream is a novel continuous analytics system. Its main purpose is to process and combine data from multiple sources as they are produced, create aggregations, and store query results for further processing by external analysis or visualization modules. The system targets continuous network monitoring but it is not limited to this context. For instance, smart grids, intelligent transportation systems, or any other use case that requires continuous process of large amounts of data over time can take advantage of DBStream. DBStream combines on-the-fly data processing of DSMSs with the storage and analytic capabilities of DBMSs and typical big data analysis systems such as Hadoop. In contrast to DSMSs, data are stored persistently and are directly available for later visualization or further processing. As opposed to traditional data analytics systems, which typically import and transform data in large batches (e.g., days or weeks), DBStream imports and processes data in small batches (e.g., on the order of minutes). Therefore, DBStream resembles a DSMS in the sense that data can be processed quickly, but streams can be re-played from past data. The only limitation is the size of available storage. DBStream thus supports a native concept of time. At the same time DBStream provides a flexible interface for data loading and processing, based on the declarative SQL language used by all relational DBMSs. Two salient features of DBStream are the following: first, it supports incremental queries defined through a declarative interface based on the SQL query language. Incremental queries are those which update their results by combining newly arrived data with previously generated results rather than being re-computed from scratch. This enables continuous time-series based data analysis, which is a strong requirement for real-time NTMA applications such as anomaly detection. Secondly, in contrast to many database system extensions, DBStream does not change the query processing engine. Instead, queries over data streams are evaluated as repeated invocations of a process that consumes a batch of newly arrived data and combines them with the previous result to compute the new result. Therefore, DBStream is able to reuse the full functionality of the underlying DBMS, including its query processing engine and query optimizer. DBStream is built on top of a SQL DBMS back-end. We use the PostgreSQL database in our implementation, but the DBStream concept can easily be used with other databases and it is not dependent on any specific features of PostgreSQL. A. System Architecture In DBStream, base tables store the raw data imported into the system, and materialized views (or views for short) store ! ! Fig. 2. General overview of the DBStream architecture. DBStream combines on-the-fly data processing of DSMSs with the storage and analytic capabilities of DBMSs and big data analysis systems such as Hadoop. the results of queries such as aggregates and other analytics which may then be accessed by ad hoc queries and applications in the same way as base tables. Base tables and materialized views are stored in a time-partitioned format inside the PostgreSQL database, which we refer to as Continuous Tables (CT). Time partitioning makes it possible to insert new data without modifying the entire table; instead, only the newest partition is modified, leading to a significant performance increase. A job defines how data are processed in DBStream, having one or more CTs as input, a single CT as output and an SQL query defining the processing task. An example job could be: count the distinct destination IPs in the last 1 minutes. This job would be executed whenever 1 new minutes of data have been added to the input table (independently of the wall clock time) and stored in the corresponding CT. Figure 2 gives a high-level overview of the DBStream architecture. DBStream consists of a set of modules running as separate operating system processes. The Scheduler defines the order in which jobs are executed, and besides avoiding resource contention, it ensures that data batches are processed in chronological order for any given table or view. Import modules may pre-process the raw data if necessary, and signal the availability of new data to the Scheduler. The scheduler then runs jobs that update the base tables with newly arrived data and create indices, followed by incrementally updating the materialized views. Each view update is done by running an SQL query that retrieves the previous state of the view and modifies it to account for newly arrived data; new results are then inserted into a new partition of the view, and indices are created for this partition. View Generation modules register jobs at the Scheduler. Finally, the Retention module is responsible for implementing data retention policies. It monitors base tables and views, deleting old data based on predefined storage size quotas and other data retention policies. Since each base table and view is partitioned by time, deleting old data is simple: it suffices to drop the oldest partition(s). The DBStream system is operated by an application server process called hydra, which reads the DBStream configuration file, starts all modules, and monitors them over time. Status information is fetched from those modules and made available in a centralized location. Modules can be placed on separate machines, and external programs can connect directly to DBStream modules by issuing simple HTTP requests. The DBstream system features a simple processing language. Below we show an example of a typical aggregation query, counting the number of rows per minute and device class. If the input table A has one flow in each row, the number of rows corresponds to the number of flows. job inputs= a (window 15min) output= b (window 15min) schema= time int4, dev_class int4, cnt int4 query select time - time%1min, dev_class, count(*) from A group by serial_time, dev_class /query /job In more detail, the XML attributeinputs is used to define one or more input streams. For each input stream, the batch size is specified with a window definition; in the example, the window size is 15 minutes. The output attribute is used to specify an output stream, which then can be used as input to other queries. The output stream also has a window definition. In addition, for the output stream, the schema is defined as the set of data types returned by the query. Note that the first column must be a monotonically increasing timestamp, which is used in the window definitions. Inside the query XML element, an SQL query defines how the input(s) should be processed. The result of this query is then stored in the new window of the output table. In the query, all features of PostgreSQL, including the very flexible User Defined Functions (UDF)s, can be used to process the data. Utilizing UDFs, it is easy to add code written in Python, Perl, C, R and other programming languages into the query. In particular, when defining the query to compute a new window of an output table based on a new window of an input stream, it is possible to reference the previous window of the output table in addition to the new window of the input stream. This is useful when, e.g., computing cumulative counts and sums such as upload and download volumes over long periods of time. In this case, it suffices to add the volumes from the new input window to the cumulative sums maintained in the (previous window of the) output table. We call these queries incremental queries in the remainder of this paper. IV. PERFORMANCE BENCHMARKING In this section, we compare DBStream implemented on top of PostgreSQL version 9.2 with standard PostgreSQL version 9.2. We perform three tests: a simple benchmark measuring the overhead of DBStream, a simple workload that illustrates the benefits of the scheduler, and a more complex query that illustrates the performance benefits of incremental processing. We use a 1 day-long data set collected at the vantage point of a major European ISP, which corresponds to 496 GB of data in plain text files and 73 GB in DBStream, containing 1.52 billion TCP connections. DBStream runs on a single server, equipped with a XEON E GHz, 32 GB of RAM and four 2TB hard disks running in a RAID1 configuration. The first test is a very simple query counting the number of flows in one day. It acts as a baseline for the hardware performance of the server as well as showing the overhead of DBStream compared to PostgreSQL. Note that running Fig. 3. DBStream performance vs. PostgreSQL. The Scheduler and the usage of incremental queries significantly improve performance. a single query means that the DBStream scheduler is not necessary. In the considered day, the data amounts to 55 GB, corresponding to 82 million TCP connections. DBStream processed approximately 268 thousand rows per second and PostgreSQL 269 thousands; this is also reflected in the processed MB/s, which is 179 MB/s for DBStream and 18 MB/s for PostgreSQL. This test shows that the overhead of DBStream is minimal with respect to standard PostgreSQL. The second test considers a more typical use case for DSWs and is meant to illustrate the need for a scheduler. Given one day of traffic data, we compute various aggregate statistics on HTTP traffic. We create 5 views, all with a window size of 1 minute. The first, call it A, contains only the interesting set of columns corresponding to HTTP flows. For example, we discard all P2P traffic and fetch organization names from the MaxMind database using a join query. This materialized view amounts to 4 GB. From view A, we generate four derived views, B, C, D and E, which can directly be used for visualization. These contain percentiles of the HTTP traffic statistics we are interested in: per-connection uploaded bytes in B, downloaded bytes in C, minimum Round Trip Times in D, and server elaboration time in E. In PostgreSQL, we first load the whole day s worth of data into table A and then run the queries corresponding to the other views and save their results in the respective tables. In DBStream, all views are formulated as jobs with input windows of 1 minutes and created by replaying the same day of data letting the DBStream Scheduler propagate the changes to all the views. As shown in Figure 3 Workload 1, the roughly three-times better performance of DBStream is achieved by parallelization. Since the Scheduler is aware of the precedence constraints among the views, whenever one 1 minutes window of A is loaded, the corresponding partitions of B, C, D and E can be computed in parallel, and at the same time, processing of the next window of A can start. In the last experiment, we evaluate the efficiency of incremental queries, as compared to re-computing the results from scratch, which is done by default in PostgreSQL. We define a job which computes all the active IPs over a moving window. We assume that each window is 1 minute long, and test three variants of this query: finding all the active IPs within the past 1 minutes, 3 minutes and 6 minutes. Intuitively, as the length of the sliding window referenced by the query increases, we expect the performance advantage of incremental processing to increase
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks