Login

akancha · 10-04-2017, 08:02 PM

Abstract
Data stream computations in domains such as internet applications are often performed in a highly distributed fashion in order to save time. An example is the class of applications that use the Google Mapreduce framework of scalable distributed processing as presented by (Dean & Ghemawat 2004). A basic question here is: what kind of data stream computations admit scalable and efficient distributed algorithms? We show that the class of data stream computations that approximate functions of the frequency vector of the stream can be computed efficiently in a distributed manner.
1 Introduction
Modern distributed data-centric applications typically process massive amounts of data over a collection of nodes that are organized in the form of a tree. The applications include internet data processing and sensor networks, among others. An example is the class of applications that use the Google Mapreduce framework of scalable and flexible distributed processing as presented by (Dean & Ghemawat 2004). In this model, data resides at the hundreds or even thousands of nodes that form the leaf nodes of a depth one tree. Data at each of the nodes is locally reduced, and the reduced data is sent to the root site, which combines the locally reduced copies into a single copy and then computes an answer from it. In sensor networks, the sensors are organized in the form of a directed spanning tree and local data at nodes is reduced and combined up the tree. Many such practical algorithms are maximally flexible in the sense that all trees over the same set of leaf nodes may be used to compute the answer. There is another practical model of data processing, namely, data stream processing, that also processes massive amounts of data, but, in a sequential, online fashion. Typically, a low space summary of the data stream is constructed and maintained with respect to newly arriving data or updates. Although the data streaming model is sequential, it is an oftspoken virtue of data stream processing that typical stream summary structures are efficiently composable. By composability, we mean that there is a binary composition operation _ that takes instances of the summary structure that are maintained independently over distributed streams and combines them into a single instance whose state is the same as, or Copyright c 2009, Australian Computer Society, Inc. This paper appeared at the Fifteenth Computing: The Australasian Symposium (CATS 2009), Wellington, New Zealand. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 94, Rod Downey and Prabhu Manyem, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included. is equivalent to, the state of the summary structure after processing the concatenation of the distributed streams at a single site. The property of composable summaries makes it viable to have data-parallel distributed streaming computation with substantial flexibility, as illustrated by the following example. Example 1. Consider a distributed computation consisting of k nodes such that the input at each node is an n-dimensional vector fi, i = 1, 2, . . . , k. The problem is to design an efficient distributed algorithm to compute an approximation to the 2 norm of the sum of the vectors f1 + . . . + fk. This is possible using the randomized linear sketch technique by (Alon et al. 1998) for approximating the 2 norm of a vector in the data stream model. Each site computes a random sketch of its input vector using the same random bits and sends it to a central site. The central site composes the sketches received from the sites, by simply adding them like vectors in the sketch space and then applies the 2-approximation function on the composed sketches. The composition operation is time-efficient, as well as associative and commutative. So any binary tree over the set of leaf nodes may be used to compute the answer, making the distributed computation highly flexible. The composability property of stream structures presents the possibility that sequential data stream execution can be molded into data-parallel execution over distributed streams. This was explored in the mud model (acronym for Massive Unordered Distributed algorithms) presented by (Feldman et al. 2008). We first outline the data streaming model and then discuss the mud model and the results by (Feldman et al. 2008).

Download full report
http://googleurl?sa=t&source=web&cd=1&ve...cats09.pdf&ei=NK1ETrz1GIPqrQexloT9Aw&usg=AFQjCNHOzCZZPYxrDsAqZnIWhe-HPn_DLQ&sig2=K4sJ0QjCXrWsBjm5G0tSvg