cat(egorical data analysis).

Raw tabular data semantical mapping using statistical features.

Check the presentation and overview.ipynb notebook for more details

Results

Two best performing models - random forest and nn both gave accuracy ~0.84. The random classifier would give accuracy of ~0.02

Here is an example of model predictions:

Values    :  'Central Missouri', 'unattached', 'unattached', 'Kansas Sta . . .
Predicted :  {'affiliation': 0.3, 'country': 0.2, 'category': 0.2}
Truth     :  affiliation 

Values    :  95, 100, 95, 89, 84, 91, 88, 94, 75, 78, 90, 84, 90, 76, 93 . . .
Predicted :  {'rank': 0.3, 'plays': 0.3, 'education': 0.2}
Truth     :  weight 

Values    :  'Katie Crews', 'Christian Hiraldo', 'Alex Estrada', 'Fredy  . . .
Predicted :  {'jockey': 0.9, 'owner': 0.1, 'year': 0.0}
Truth     :  jockey 

Values    :  'Christian', 'Non-Christian', 'Unreported', 'Jewish', 'Athe . . .
Predicted :  {'type': 0.2, 'language': 0.1, 'name': 0.1}
Truth     :  religion 

Values    :  'AAF-McQuay Canada Inc.', 'AAF-McQuay Canada Inc.', 'Abilit . . .
Predicted :  {'company': 0.3, 'album': 0.2, 'description': 0.1}
Truth     :  company

Feel free to play with the model in the overview.ipynb notebook!

Statistical features

We extract min, mean, max, std statistical features from each sequence of values:

length
percentage of alphabetic characters
percentage of numeric characters
percentage of uppercase characters
each specific character on a fixed position occurence rate
uniqueness

Check the ./modules/FeaturesExtractor.py for more details.

Dataset used

Here I used dataset from this research.

Although there was initially an attempt to create the data from scratch, it ultimately failed due to the inability of humans to properly

Data structure

cat/
|
├─── modules/
|    |
│    ├─ regulars.py      # REGEX-based data typifiers
|    |
│    ├─ analyzer.py      # Statistical features
|    |
│    ├─ labeler.py       # Script for manual data labeling
|    |
│    ├─ cats.py          # Dictionary with all possible semantic types
|    |
│    ├─ formatter.ipynb  # Helper functions for parsing raw-raw data
|
|
├─── cached/             # Cached variables values (exracted features, model weights, etc)
|
├─── data/
|    |
|    ├─ unused           # 300+ GB of mostly unlabeled data
|    |
|    ├─ parquet          # data currently used for training and validating model
|    
|
├─ .gitignore
|
├─ overview.ipynb        # Notebook that demonstrates the project's core from A to Z
|
├─ README.md
|
├─ LABELING.md           # Guide for those who helped me with data labeling :D

Please note that the cached and data are not in the repository due to their large size. identify the categories. The labeler can be used for labeling other, more human-like data though.

The project was in some degree inspired by this research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

cat(egorical data analysis).

Results

Statistical features

Dataset used

Data structure

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
modules		modules
.gitignore		.gitignore
LABELING.md		LABELING.md
README.md		README.md
denys_zinoviev_ml_proj.pdf		denys_zinoviev_ml_proj.pdf
overview.ipynb		overview.ipynb

Uh oh!

Uh oh!

rureirureirurei/cat

Folders and files

Latest commit

History

Repository files navigation

cat(egorical data analysis).

Results

Statistical features

Dataset used

Data structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages