On a Design Space for Aggregating over Sliding Windows on a Stream.
Master's thesis, University of Twente.
Full text available as:
On-line, continuous processing over streaming data challenges performance: processing time and memory space usage. This research is about the best performing way to compute an aggregate function over sliding windows containing the most recent data of a stream. The performance depends on the output semantics: input, aggregate function, accuracy requirements, retention as the quantity of elements aggregated and slide as the quantity of elements to move over the stream. Two approaches are identified, that under the same output semantics, differ in the way sliding windows are materialized. The instantaneous approach takes a window covering the retention and computes the aggregate function over the window upon every slide, independent of previous windows. The incremental approach has a synopsis structure covering the retention, updated by consecutive disjoint batches covering a slide.
A proposed theoretical model relates the approaches to processing time and memory space usage. Scalability is assessed by instantiating that model with the computational complexity of the approaches. It is shown that the incremental approach with a trivial synopsis storing the entire retention can have processing time benefits over the instantaneous approach. Moreover it seems approximations of aggregate functions are the only way to reduce memory space usage when sliding one element at a time. Properties of aggregates from literature are related to the approaches and theoretical model. Aggregates with the properties of a priori bounded space and incremental / decremental unctions are called bounded differential aggregates. Under the incremental approach they require constant processing time and can significantly reduce memory space usage in practice when the slide is sufficiently large. By instantiating
the theoretical model with empirical approximations of performance, approaches can be compared by their performance in practice. Performance in practice shows differences between approaches belonging to the same complexity class, more average-case like performance and (unexpected) performance side-effects.
A taxonomy is proposed categorizing synopsis structures by support for fixed-size (e.g. count-based) or variable-size (e.g. time-based) retention, exact or approximate aggregates produced and the deterministic or randomized / probabilistic result of approximations. A few synopsis structures covering the taxonomy are explored in the context of the selected aggregates variance and quantiles of (most) interest to Info Support. Exact computations are explored by the trivial synopsis that stores the entire retention and panes that exploit a priori space bounds of aggregates for sufficiently large batches. Approximations are explored by sampling, histograms for variance only and summaries for quantiles only. The selected aggregates under both approaches along with synopsis structures are implemented on the streaming data processing system Microsoft StreamInsight 1.1 of interest to Info Support. A stream of uniform random numbers is used. Accuracy of 95%-100% is of interest to Info Support. The effect of
retention is observed by increasing it at a fixed slide. The effect of slide (inversely: overlap) is observed by
a fixed retention and increasing slide. The empirical results show the instantaneous approach outperforms
the incremental approach when the slide (and batch) is sufficiently large; at about half the retention.
Sampling achieves high accuracy and the best performance under the incremental approach. Panes have the best performance when the batches are large enough. Quantile summaries reduce memory space usage significantly, have high accuracy and low processing time. Using SPSS 17 the empirical results are generalized on the environment by regression models for processing time and memory space usage per aggregate, per accuracy requirements and per approach (differentiated by synopsis structure). Violation of the assumptions of homoscedasticity, independence or normality of residuals during analysis, suggests the
use of more advanced analysis. Validation of the models on an independent subset of the measurements indicates errors similar to the model errors from analysis.
A design space is proposed that uses the theoretical model to compare the approaches in terms of performance, either by scalability or by empirical approximations of performance in practice, and is augmented with guidelines derived during the research. It indicates the best performing approach regarding how to compute an aggregate function over sliding windows, given output semantics. By exploring other aggregates and synopsis structures, the design space can be extended.
|Item Type:||Master's Thesis|
|Research Group:||EWI-DB: Databases|
|Research Program:||CTIT-ASSIST: Applied Science of Services for Information Society Technologies|
|Research Project:||SensorDataLab: a sensor network infrastructure|
|Uncontrolled Keywords:||stream processing
|Deposited On:||03 October 2011|
Export this item as:
To correct this item please ask your editor
Repository Staff Only: edit this item