Non-parametric clustering of multivariate count data with applications to bulk cache preloading

Robert Bosch Centre for Cyber-Physical Systems

Caching is an important determinant of storage system performance. Bulk cache preloading is the process of preloading large batches of relevant data into cache, minutes or hours in advance of actual requests by the application. We address bulk preloading, by analyzing high-level spatio-temporal motifs from raw and noisy I/O traces by aggregating the trace into a temporal sequence of correlated count vectors. Such temporal multivariate data from trace aggregation arise from a diverse set of workloads leading to diverse data distributions with complex spatio-temporal dependencies.

Motivated by this, we explore predictive models based on non-parametric clustering of multivariate count data, a previously unexplored topic. While Poisson is the most popular distribution for count modeling, the Multivariate Poisson often leads to intractable inference and a sub-optimal fit of the data. Hence, we first explore extensions to the Multivariate Poisson distribution and propose the Sparse Multivariate Poisson distribution along with Dirichlet Process mixture model extensions and efficient techniques for approximate inference. This enables the practical use of Poisson based models for multivariate count data in a variety of real-world applications. Next, we explore techniques to move beyond the limitations of Poisson based models, for clustering multivariate counts, for instance in handling over dispersed data or data with negative correlations. We explore, for the first time, marginal independent inference techniques based on the Gaussian Copula for multivariate count data in the Dirichlet Process mixture model setting, which enables us to circumvent the limiting assumptions of Poisson based marginals. Inference with copulas is hard when data is not continuous. We propose inference based on extended rank likelihood that bypasses specifying marginals, making our inference suitable for count data and even data with a combination of discrete and continuous marginals. This enables the use of Bayesian non-parametric modeling, for several data types, without assumptions on marginals.

Finally, we propose temporal extensions for predictive modeling of multivariate counts based on our count modeling techniques. We outline HULK, a strategy for I/O efficient bulk cache preloading to leverage high-level spatio-temporal motifs in Block I/O traces. Experimentally, we show a dramatic improvement in hit-rates on benchmark traces and lay the groundwork for further research in storage domain to reduce latencies using data mining techniques for trace modeling.

Source

Similar Posts