Naive implementation of a few online statistical algorithms.
The idea behind this implementation is to be used as a tool for stateful streaming computations.
Algos covered so far:
- 4 statistical moments and the derived features:
- average
- variance and standard deviation
- skewness
- kurtosis
- covariance
- exponentially weighted moving averages and variance
Algos to be researched:
- exponentially weighted moving skewness
- exponentially weighted moving kurtosis
Using a more formal and mature library like Apache Commons Math is probably a better idea for production applications, but this is also tested against it.
The main concepts introduced in this library are the Stats, EWeightedStats (exponentially
weighted stats), VectorStats and Covariance. Each of them can be composed using either the
append or the |+| functions.
For example, if we have a sequence of numbers, we can compute the statistics like this:
val xs1 = Seq(1.0, 3.0)
val stats1: Stats = xs1.foldLeft(Stats.Nil)((s, x) => s |+| x)
val xs2 = Seq(5.0, 7.0)
val stats2: Stats = xs2.foldLeft(Stats.Nil)((s, x) => s |+| x)
val totalStats = stats1 |+| stats2
val newStats = totalStats |+| 4.0The Stats type with the |+| operation also form a monoid, since |+| has an identity
(unit) element, Stats.Nil, and it is associative.
Also the |+| operation is also commutative, which makes appealing for distributed computing
as well.
Same goes for VectorStats and Covariance.
EWeightedStats is an exception for now, as two EWeightedStats instances can not be composed.
However, the |+| works between an EWeightedStats instance and a double.
| Feature | Space Complexity (O) | Time Complexity (O) |
|---|---|---|
| Count, Sum, Min, Max | O(1) (1 * MU) | O(1) |
| Average | O(1) (2 * MU) | O(1) |
| Variance, Standard deviation | O(1) (3 * MU) | O(1) |
| Skewness | O(1) (4 * MU) | O(1) |
| Kurtosis | O(1) (5 * MU) | O(1) |
| Exponentially weighted average | O(1) (2 * MU) | O(1) |
| Exponentially weighted variance | O(1) (2 * MU) | O(1) |
MU: Memory Unit, e.g. Int: 4 bytes, Double 8: bytes
The streaming-anomalies-demos project was created to explore and demonstrate some basic use cases for the online-stats library.
- "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments" by Philippe Pebay(http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf)
- "Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights" by Philippe Pebay, Timothy B. Terriberry, Hemanth Kolla, Janine Bennett4
- "The Exponentially Weighted Moving Variance" by J. F. Macgregor and T. J. Harris
- "Incremental calculation of weighted mean and variance" by Tony Finch, February 2009
- Bessel Correction
- Skewness
- Kurtosis
- Pearson's Correlation Coefficient