Login

basheerun · 10-04-2017, 08:50 PM

ABSTRACT
There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a _xed order) with limited memory resources. In its most general form, a data stream is actually an update stream, i.e., comprising data-item deletions as well as inser- tions. Such massive update streams arise naturally in several application domains (e.g., monitoring of large IP network installations, or processing of retail-chain transactions). Estimating the cardinality of set expressions de_ned over several (perhaps, distributed) update streams is perhaps one of the most fundamental query classes of interest; as an ex- ample, such a query may ask \what is the number of distinct IP source addresses seen in passing packets from both router R1 and R2 but not router R3?". Earlier work has only ad- dressed very restricted forms of this problem, focusing solely on the special case of insert-only streams and speci_c op- erators (e.g., union). In this paper, we propose the _rst space-e_cient algorithmic solution for estimating the car- dinality of full-edged set expressions over general update streams. Our estimation algorithms are probabilistic in na- ture and rely on a novel, hash-based synopsis data structure, termed \2-level hash sketch". We demonstrate how our 2- level hash sketch synopses can be used to provide low-error, high-con_dence estimates for the cardinality of set expres- sions (including operators such as set union, intersection, and di_erence) over continuous update streams, using only small space and small processing time per update. Further- more, our estimators never require rescanning or resampling of past stream items, regardless of the number of deletions in the stream. We also present lower bounds for the problem, demonstrating that the space usage of our estimation algo- rithms is within small factors of the optimal. Preliminary experimental results verify the e_ectiveness of our approach.
1. INTRODUCTION
Query-processing algorithms for conventional Database Management Systems (DBMS) rely on (possibly) several passes over a collection of static data sets in order to produce an accurate answer to a user query. For several emerging ap- plication domains, however, updates to the data arrive on a continuous basis, and the query processor needs to be able to produce answers to user queries based solely on the observed stream of data and without the bene_t of several passes over a static data image. As a result, there has been a urry of recent work on designing e_ective query-processing algo- rithms that work over continuous data streams to produce results online while guaranteeing (1) small memory foot- prints, and (2) low processing times per stream item [1, 10, 13, 15]. Such algorithms typically rely on summarizing the data stream(s) involved in concise synopses that can be used to provide approximate answers to user queries along with some reasonable guarantees on the quality of the approxi- mation. In their most general form, real-life data streams are ac- tually update streams; that is, the stream is a sequence of updates to data items, comprising data-item deletions as well as insertions 1. Such continuous update streams arise naturally, for example, in the network installations of large Internet service providers, where detailed usage information (SNMP/RMON packet-ow data, active VPN circuits, etc.) from di_erent parts of the underlying network needs to be continuously collected and analyzed for interesting trends. Other application domains giving rise to continuous and massive update streams include retail-chain transaction pro- cessing (e.g., purchase and sale records), ATM and credit- card operations, logging Web-server usage records, and so on. The processing of such streams follows, in general, a distributed model where each stream (or, part of a stream) is observed and summarized by its respective party (e.g., the element-management system of an individual IP router) and the resulting synopses are then collected (e.g., periodically) at a central site, where queries over the entire collection of streams can be processed [15]. This model is used, for exam- ple, in Lucent's Interprenet and Cisco's NetFlow products for IP network monitoring. Clearly, there are several forms of queries that users or applications may wish to pose (online) over such continuous update streams; examples include joins or multi-joins [1, 10], norm computations [2, 19], or quantile estimation [16]. Perhaps one of the most fundamental queries of interest is estimating the result cardinalities of set expressions de_ned

Download full report
http://googleurl?sa=t&source=web&cd=1&ve...gmod03.pdf&ei=e8BETuPUF4iJrAef5vH5Aw&usg=AFQjCNERW4_YhkyS1L-uiB1JpGAHuXL4eA&sig2=Tw5WOhhccq0DuYJevV3GRw