Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
On Estimating Frequency Moments of Data Streams
#1

Abstract.
Space-economical estimation of the pth frequency moments, defined as Fp = Pn i=1 fi p, for p > 0, are of interest in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random sketches formed by the inner product of the frequency vector f1, . . . , fn with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating Fp and for data stream computations in general. The concept of p-stable sketches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indyk [11] for estimating Fp, for 0 < p < 2, and has been further studied in Li [13]. In this paper, we consider the problem of estimating Fp, for 0 < p < 2. A disadvantage of the stable sketches technique and its variants is that they require O( 1 _2 ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We present algorithms for estimating Fp, for 0 < p < 2, that does not require the use of stable sketches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-known summary structures for data streams, such as the COUNT-MIN sketch [7] and the COUNTSKETCH structure [5]. Our algorithms require space O( 1 _2+p ) 3 to estimate Fp to within 1 _ factors and requires expected time O(log F1 log 1 _ ) to process each update. Thus, our technique trades an O( 1 _p ) factor in space for much more efficient processing of stream updates. We also present a stand-alone iterative estimator for F1.
1 Introduction
Recently, there has been an emergence of monitoring applications in diverse areas including network traffic monitoring, network topology monitoring, sensor networks, financial market monitoring, and web-log monitoring. In these applications, data is generated rapidly and continuously, and must be analyzed efficiently, in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, and so on. In many of these applications, it is often required to continuously track the big picture , or an aggregate view of the data, as opposed to a detailed view of the data. In such scenarios, efficient approximate computation is often acceptable.

Download full report
http://googleurl?sa=t&source=web&cd=1&ve...ndom04.pdf&ei=p7NETpSQBIXnrAeqs4GKCw&usg=AFQjCNF4bINQaGweVG9AEGK7HHUiQbdqaw&sig2=-xOxrel1DMT2LHMrH6_KOw
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

Powered By MyBB, © 2002-2024 iAndrew & Melroy van den Berg.