I'm using Python 3.8. Running
pip install -r requirements.txt
might work. The only dependencies other than the usual (numpy, scipy, sklearn, matplotlib, pandas, tqdm) are ripser for fast persistence computations, persim for computing persistence images, and sklearn-som.
To do a simple viz of raw data from the default data/test set run
python main.py --plot input --show
To run and view (but not save) the default data/test set run
python main.py --preset --interact
Interaction may not work. I think you have to set your matplotlib backend to something like Qt5Agg. Try clicking on the TPers plot. I have
backend: Qt5Agg
in ~/.matplotlib/matplotlibrc.
To replicate the results detailed in the report (saving to figures/{DATASET}/{TESTSET}) call
python main.py --preset --som --plot input pre tpers --save --set {DATASET} --test {TESTSET}
For example,
python main.py --preset --som --plot input pre tpers --save --set SystemSLogs --test cpuhog
is the default behavior. Running the bash script
./mkfigs.sh
will run presets on all data/test sets in the data directory, generating the figures included in the report (hopefully).
The --som flag attempts to load a .pkl file containing a pre-trained self-organizing map (SOM) for the specified data/test set.
I don't know if .pkl files will survive.
New ones can be trained by running
python mksom.py --set {DATASET} --test {TESTSET}
The script trains a model using the training data (tr.log) file for the specified data/test set, tests it on the corresponding test data set (te.log), and plots the results against an existing model in cache/som_{DATASET}-{TESTSET}.pkl, if available.
Pass anything (other than n or no) to override the existing model.
If no model exists it just saves.
Passing something like
--pre A B C
will run operations A, B, and C in order.
Default behavior (--preset 0) is
--pre scale pca=4
Available operations are as follows:
scale: Min-Max scaling on features independentlyscale=min: Min scaling only on features independently,scale=all: Min-Max scaling on the whole dataset (min and max of all entries),scale=min,all: Min scaling on the whole dataset (min of all entries),diff: apply difference transform (discrete derivative) to each feature independently,power: apply power transform to each feature independently,detrend: detrend each feature independently,ma: apply moving average to each feature independently.ma=wconvolves with 2*2+1 point window,pca: PCA transform.pca=nreduces tonprincipal compnents.
The same operations can be applied to the total persistence curve by passing them as arguments to --post.
Note that this has no effect on kmeans prediction on persistence, but does affect prediction via threshold on tpers (see below).
--length {n}: set the window length ton,--overlap {w}: set the window overlap tow,
If none of --period, --fft, or --torus are passed persistence will be run on raw windowed data.
--period {t}: Period of cycle in the complex plane. Equal to the length of the window if passed without argument. If passed with argument and--torus, the complex data will be period specified will be provided to the torus transform (untested).--fft: Run Fourier transform on each frame with blackmann window. If passed with--torusthe complex frequency domain output of the Fourier transform will be passed as phase and amplitude to the torus transform (super cool, kinda works... sometimes).--torus: Apply torus transform (warning don't do torus transform on more than two values/features without setting--npermless than ~50, usually 20 works).--exp {p}: Applypas an exponent to all data, for fun. Executed before all other transforms.--abs: Take the absolute value of all data, for science. Executed before all other transforms.
The following arguments are passed to ripser for each frame.
--dim {d}: Maximum rips/persistence dimension,--thresh {t}: Maximum distance to compute in the Rips complex,--nperm {n}: Number of greedy permutations. Probably safe to set to 20 for all applications. Huge speedup., type=int, help='greedy permutations')--metric {euclidean, manhattan, cosine}: Metric for Rips computation. Default:euclidean.
--invert {d,...}: Invert provided dimensions (multiply by -1. Inverted in the sum),--entropy: Compute persistent entropy, for fun (and science),--average: Compute average total persistence in each dimension,--pmin {m}: Only include diagram features with total persistence at leastm.
Just run
python main.py -h
It's the same thing.
--data: Print available data/test sets,--dir {DATA_DIR}: Data directory. Default:./data,--set {DATASET}: Dataset. Default:SystemSLogs.log,--test {TESTSET}: Test set. Default:cpuhog.log,--file {LOGFILE}: File name. Default:te.log,--cache {CACHE}: Cache directory. Currently only for SOM models. Default./cache,--preset {?i}: Preset to run. Default for dataset provided if passed without argument,--show-presets: Print available presets,--values {COLUMN_NAME1 COLUMN_NAME2 ...}: Data values (features) to use,--plot {input, pre, window, transform, persistence, tpers', post}: Modules to plot,--nroc {n}: Number of points on ROC curve,--frame {f}: Frame to plot (ifwindowortransformpassed to plot). For saving purposes. Warning untested,--show: Show plot, otherwise it will just quit (if neither--savenor--interactis passed),--save {?fdir}: Save plots to directory. Default:./figures/{DATASET}/{TESTSET})--predict {threshold,SOM,kmeans,minkmeans,maxkmeans}: Don't passSOM. min/max kmeans are dumb. Threshold is just prediction by thresholding each feature.--analyze {input, pre, persistence, tpers, post}: Modules to analyze. Pass--analyze {MODULE}={PREDICT}to override prediction type passed by--predictfor a given module,--aplot {input, pre, persistence, tpers, post}: Analyze and plot module,--interact: Interact with terminal module. Useful for viewing data from individual frames for "framed" modules passed to--plotsuch astransform,window, andpersistence. Default behavior is to plot persistence diagram. Very useful withtransformpassed to--plot.--som: Compare with saved SOM model (in cache),--lead {W}: SOM predict lead (anomaly pending) time. Default: 10,--streak {s}: Streak of anomalies required to raise SOM alarm. Default: 3
The dataset should be specified and included in main.py.
The UBL data is included (?), and specified in ubl_data.py.
At this time the user is required to specify
AVAIL_DATA: Dictionary of available data/test sets,AVAIL_VALUES: List of available values (features) for each data point/
DIR: Directory containing data,DATASET: Default dataset,TESTSET:Default test set,LOGFILE:Default test file,VALUES: Default values (features).
LENGTH: Transformation window length,OVERLAP: Transformation window overlap,DIM: Max persistence dimension,
PRESETS: List of argument presets,PRESET_DICT: Data/test set presets (when--presetis passed without argument).
Definition of an InputData object that extends TimeSeriesData and contains raw input data for a given data/test set.