Predicting the main programming language of actual repositories on GitHub purely based on the contents of the commit messages using supervised text learning.
Predictions are currently made on the commit level and not on the repository level. I will implement this in the future and also try out some other classification algorithms.
- Install Python3
- Clone repository
- Install requirements, run the following command within the project directory and install any packages that might still be missing:
python3 -m pip install -r requirements.txt
- Download dataset from https://github.com/kvnmlr/ml-commits/blob/files/files.zip and extract in folder project root.
- Run classification:
python3 classification/classify.py
- Enter your GitHub username and password in the credentials.json file
- Run the crawler:
python3 crawler.py - Extract languages and commit message:
python3 features_extractor.py - Run classification:
python3 classification/classify.py
In 2 seconds:
Correct: 166
Wrong: 132
Languages considered: 25
Classifier Score: 55.70%