slearn: Software for learning symbolic sequences

Overview

The slearn Python package is designed for learning and processing symbolic sequences, particularly for time series analysis. Symbolic representations reduce the dimensionality of time series data, accelerating tasks such as motif discovery, clustering, classification, forecasting, and anomaly detection. As demonstrated by Elsworth and Güttel (arXiv, 2020), symbolic forecasting reduces the sensitivity of Long Short-Term Memory (LSTM) networks to hyperparameter settings, making it a powerful approach for machine learning on symbolic data.

slearn provides APIs for:

Generating symbolic sequences with controlled complexity using Lempel-Ziv-Welch (LZW) compression.
Computing distances between symbolic sequences for similarity analysis.
Benchmarking deep learning models (e.g., LSTMs, GRUs, Transformers) for sequence memorization.
Supporting symbolic time series representations like SAX and ABBA variants.

This package is ideal for researchers and practitioners working on symbolic time series analysis and machine learning.

Installation

Install slearn using either pip or conda:

pip

pip install slearn

conda

conda install -c conda-forge slearn

To verify the installed version:

pip show slearn
# or
conda list slearn

Dependencies:

Python 3.6+
NumPy
pandas
scikit-learn

Key Features

1. Generating Strings with Controlled Complexity

The LZWStringLibrary module generates strings with specified numbers of unique symbols and LZW complexity, approximating Kolmogorov complexity. It also computes distances between sequences based on LZW complexity, enabling similarity analysis for symbolic time series.

Example:

from slearn import lzw_string_generator, lzw_string_seeds

# Generate a single string with 2 symbols and target complexity 3
str_, str_complex = lzw_string_generator(2, 3, priorise_complexity=True, random_state=2)
print(f"string: {str_}, complexity: {str_complex}")

# Same, but prioritize symbol count over complexity
str_, str_complex = lzw_string_generator(2, 3, priorise_complexity=False, random_state=2)
print(f"string: {str_}, complexity: {str_complex}")

# Generate a library of strings with varying symbols and complexities
df_strings = lzw_string_seeds(symbols=[2, 3], complexity=[3, 6, 7], priorise_complexity=False, random_state=0)
print(df_strings)

Output:

string: BAA, complexity: 3
string: BAB, complexity: 3
   nr_symbols  LZW_complexity  length       string
0           2               3       3          ABA
1           2               6       8     BABBABBA
2           2               7      11  BAAABABAAAA
3           3               3       3          BAC
4           3               6       6       ABCACB
5           3               7       8     ABCAAABB

2. Benchmarking Deep Learning Models

slearn provides tools to benchmark the memorization capabilities of deep learning models (e.g., LSTMs, GRUs, Transformers) on symbolic sequences. The benchmark_models function generates performance reports and visualizations.

Example:

from slearn.deep_models import LSTMModel, GRUModel, TransformerModel, GPTLikeModel
from slearn.simulation import benchmark_models

model_list = [LSTMModel, GRUModel, TransformerModel, GPTLikeModel]
benchmark_models(
    model_list,
    symbols_list=[2, 4, 6, 8],          # Number of unique symbols
    complexities=[210, 230, 250, 270, 290],  # Target LZW complexities
    sequence_lengths=[3500],
    window_size=100,
    validation_length=100,
    stopping_loss=0.1,
    max_epochs=999,
    num_runs=5,
    units=[128],
    layers=[1, 2, 3],
    batch_size=256,
    max_strings_per_complexity=1000,
    learning_rates=[1e-3, 1e-4]
)

Custom models can be implemented following the examples in slearn/deep_models.py.

3. Symbolic Time Series Representations

slearn supports multiple Symbolic Aggregate Approximation (SAX) variants and the ABBA method for time series symbolization. The following table summarizes the implemented methods:

Algorithm	Time Series Type	Segmentation	Features Extracted	Symbolization	Reconstruction
SAX	Univariate	Fixed-size segments	Mean (PAA)	Gaussian breakpoints, single symbol per segment	Piecewise constant from PAA values
SAX-TD	Univariate	Fixed-size segments	Mean (PAA), slope	Mean to symbol, trend suffix ('u', 'd', 'f')	Linear trends from PAA and slopes
eSAX	Univariate	Fixed-size segments	Min, mean, max	Three symbols per segment (min, mean, max)	Quadratic interpolation from min, mean, max
mSAX	Multivariate	Fixed-size segments	Mean per dimension	One symbol per dimension per segment	Piecewise constant per dimension
aSAX	Univariate	Adaptive segments (local variance)	Mean (PAA)	Gaussian breakpoints, single symbol per segment	Piecewise constant from adaptive segments
ABBA	Univariate	Adaptive piecewise linear segments	Length, increment	Clustering (k-means), symbols assigned to clusters	Piecewise linear from cluster centers

Example:

import numpy as np
from slearn.symbols import SAX, SAXTD, ESAX, MSAX, ASAX

def test_sax_variant(model, ts, t, name, is_multivariate=False):
    symbols = model.fit_transform(ts)
    recon = model.inverse_transform()
    print(f"{name} reconstructed length: {len(recon)}")
    return np.sqrt(np.mean((ts - recon) ** 2))  # RMSE

# Generate test time series
np.random.seed(42)
t = np.linspace(0, 10, 100)
ts = np.sin(t) + np.random.normal(0, 0.1, 100)  # Univariate
ts_multi = np.vstack([np.sin(t), np.cos(t)]).T + np.random.normal(0, 0.1, (100, 2))  # Multivariate

# Test SAX variants
sax = SAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(sax, ts, t, "SAX")

saxtd = SAXTD(window_size=10, alphabet_size=8)
rmse = test_sax_variant(saxtd, ts, t, "SAX-TD")

esax = ESAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(esax, ts, t, "eSAX")

msax = MSAX(window_size=10, alphabet_size=8)
rmse = test_sax_variant(msax, ts_multi, t, "mSAX", is_multivariate=True)

asax = ASAX(n_segments=10, alphabet_size=8)
rmse = test_sax_variant(asax, ts, t, "aSAX")

4. String Distance and Similarity Metrics

slearn provides interfaces for computing string distances and similarities, including normalized versions, based on formal definitions.

Example:

from slearn.dmetric import (
    damerau_levenshtein_distance,
    jaro_winkler_distance,
    normalized_damerau_levenshtein_distance,
    normalized_jaro_winkler_distance
)

print(damerau_levenshtein_distance("cat", "act"))  # Output: 1
print(jaro_winkler_distance("martha", "marhta"))   # Output: 0.961
print(normalized_damerau_levenshtein_distance("cat", "act"))  # Output: 0.333
print(normalized_jaro_winkler_distance("martha", "marhta"))   # Output: 0.961

Supported Classifiers

slearn integrates with scikit-learn classifiers for symbolic sequence analysis:

Classifier	Parameter Call
Multi-layer Perceptron	`MLPClassifier`
K-Nearest Neighbors	`KNeighborsClassifier`
Gaussian Naive Bayes	`GaussianNB`
Decision Tree	`DecisionTreeClassifier`
Support Vector Classification	`SVC`
Radial-basis Function Kernel	`RBF`
Logistic Regression	`LogisticRegression`
Quadratic Discriminant Analysis	`QuadraticDiscriminantAnalysis`
AdaBoost Classifier	`AdaBoostClassifier`
Random Forest	`RandomForestClassifier`

Documentation

Comprehensive documentation is available at slearn.readthedocs.io.

Citation

If you use slearn or the LZWStringLibrary in your research, please cite:

R. Cahuantzi, X. Chen, and S. Güttel, “A Comparison of LSTM and GRU Networks for Learning Symbolic Sequences,” in Intelligent Computing, Springer Nature Switzerland, 2023, pp. 771–785.

If you use the prediction with ABBA, please cite:

X. Chen, Fast Aggregation-Based Algorithms for Knowledge Discovery, Ph.D. dissertation, The University of Manchester, 2024.

For questions or issues, contact the maintainers via email.

License

This project is licensed under the MIT License.

Contributing

Contributions to slearn are welcome! To contribute:

Fork the repository: github.com/nla-group/slearn.
Create a branch for your feature or bug fix.
Submit a pull request with a clear description of changes.
Ensure tests pass (see unittests.yml workflow).

TODO List:

Add language modeling functionalities.
Expand and refine documentation.
Optimize performance for large-scale sequence generation and processing.

Name		Name	Last commit message	Last commit date
Latest commit History 339 Commits
.github/workflows		.github/workflows
build/lib/slearn		build/lib/slearn
conda		conda
data		data
dist		dist
docs		docs
examples		examples
exps		exps
info		info
paper		paper
slearn.egg-info		slearn.egg-info
slearn		slearn
.gitattributes		.gitattributes
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
README.rst		README.rst
record.txt		record.txt
requirements.txt		requirements.txt
setup.py		setup.py
unittests.py		unittests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

slearn: Software for learning symbolic sequences

Overview

Installation

pip

conda

Key Features

1. Generating Strings with Controlled Complexity

2. Benchmarking Deep Learning Models

3. Symbolic Time Series Representations

4. String Distance and Similarity Metrics

Supported Classifiers

Documentation

Citation

License

Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

nla-group/slearn

Folders and files

Latest commit

History

Repository files navigation

slearn: Software for learning symbolic sequences

Overview

Installation

pip

conda

Key Features

1. Generating Strings with Controlled Complexity

2. Benchmarking Deep Learning Models

3. Symbolic Time Series Representations

4. String Distance and Similarity Metrics

Supported Classifiers

Documentation

Citation

License

Contributing

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages