Datapot | Usage | Examples | Features | Authors
Open source tool for machine learning on semi-structured data that creates numeric object-feature matrix from JSON.
The idea of Datapot is to make the process of data preparation and feature extraction automatic, easy and effective.
Install Datapot
Using pip:
$ pip install datapotOr clone Datapot repo:
$ git clone https://github.com/bashalex/datapot.git
$ cd datapot
$ pip install .To create a Datapot object simply write the following:
>>> import datapot as dp
>>> datapot = dp.DataPot()- detect()
- fit()
- transform()
Method detect(data, limit) goes through the first N objects (N = limit), passes the possible features to Transformers. Each Transformer evaluates if a feature from current field or a number of fields can be created. As a result a dict of features and Transformers is created. Method fit(data) trains the detected Transformers on the given set if it is required.
To apply detect() and fit() to JSON Lines file:
>>> data = open('datapot/data/job.jsonlines', 'r')
>>> datapot.detect(data, limit=100)
>>> datapot.fit(data)
DataPot class instance
- number of features without transformation: 9
- number of new features: 82
features to transform:
('Id', [NumericTransformer])
('FullDescription', [TfidfTransformer])
('ContractType', [SVDOneHotTransformer])
('ContractTime', [SVDOneHotTransformer])
('Company', [SVDOneHotTransformer])
('Category', [SVDOneHotTransformer])
('SalaryNormalized', [NumericTransformer])Method transform(data) generates a pandas. DataFrame with new features that were detected and trained on the detect() and fit() calls.
>>> df = datapot.transform(data)
num of new features: 82Look for more examples of using Datapot with different datasets and more Transformer specific.
Datapot provides many ways of extracting features from JSON-s.
Data types that can be processed:
- Boolean
- Numerical
- Numerical array (transform array to their sum divided by average length of array in training set)
- Time series (сalculate descriptive statistical properties of a given time series)
- Timestamp (date, time, day of week, day of month etc.)
- Text (bag of words tf-idf, word2vec)
- Categorical (one-hot encoding, dimension reduction)
Manually selected features:
- Identity (keep the field unchanged)
- Group Dimensionality Reduce (change the dimensionality of features in the same JSON field)
- Alex Bash
- Yuriy Mokriy
- Nikita Savelyev
- Michal Rozenwald
- Peter Romov
Datapot is a course work project of the Faculty of Computer Science of the Higher School of Economics.