NewsCLF is a text classifier that predicts the main topic of a news article as one of: business · entertainment · politics · sport · tech.
It ships as a clean Python package (importable API) and a single CLI (newsclf) for training and inference. Runs on CPU, Apple Silicon (MPS), or CUDA automatically.
- Deliver a robust, reproducible BBC-style news classifier with strong accuracy.
- Provide a library API for apps/services and a CLI for quick ops.
- Save artifacts, metrics, reports, and confusion matrices per run for auditability.
- PyTorch, Hugging Face Transformers (default:
distilbert-base-uncased) - scikit-learn, pandas, numpy
- matplotlib
python -m venv venv
source venv/bin/activate
pip install -e .Expected data file:
dataset/bbc_news_text_complexity_summarization.csvwith columns:text,labels.
newsclf train \
--data_csv dataset/bbc_news_text_complexity_summarization.csv \
--model_name distilbert-base-uncased \
--max_len 256 --batch_size 16 --lr 5e-5 \
--weight_decay 0.01 --warmup_pct 0.10 \
--epochs 5 --patience 2 --seed 42Outputs (per run)
artifacts/classifier/<run_id>/best/ # model + tokenizer + metrics.json + label_map.json
reports/classification_report.txt # appended per run
plots/confusion_matrix_<run_id>.png # confusion matrix
# Top-1 prediction from the most recent run
newsclf predict --text "Parliament approves the annual budget"
# Pick the best checkpoint across runs by a metric (e.g., macro-F1)
newsclf predict --text "Chipmaker unveils 3nm processor" --select best --metric test_macro_f1
# Show alternatives (top-3)
newsclf predict --text "Streaming platform renews hit fantasy series" --topk 3from newsclf import NewsClassifier, predict
# Load once, predict many times (recommended for services)
clf = NewsClassifier(select="best", metric="test_macro_f1", max_len=256)
label, conf = clf.predict("Parliament approves the annual budget")
print(label, round(conf, 2))
# One-shot helper (loads each call)
print(predict("Star striker scores twice in cup final"))Fast, deterministic end-to-end tests (tiny synthetic dataset):
pytest -q
# or verbose:
pytest -vvCovers:
- Stratified data split & label mapping
- 1-epoch training smoke test
- Single-text inference from saved checkpoint
- CLI smoke test
- Load & split the CSV (train/val/test = 70/15/15, stratified).
- Tokenize with a HF tokenizer; batches use dynamic padding.
- Train DistilBERT via a clean PyTorch loop (AdamW, linear warmup/decay, grad-clip).
- Early stop on macro-F1; save best checkpoint for the run.
- Evaluate on test set; write
metrics.json,classification_report.txt, and the confusion matrix. - Infer via CLI or
NewsClassifier(top-k confidences supported).
BBC topic modeling exploration (BERTopic vs LDA) that inspired the dataset choice and comparison concept: Kaggle notebook by Jacopo Ferretti. (Kaggle)
Quickstart
# Install
python -m venv venv && source venv/bin/activate
pip install -e .
# Train
newsclf train --data_csv dataset/bbc_news_text_complexity_summarization.csv --epochs 5
# Predict
newsclf predict --text "Parliament approves the annual budget"
# Test
pytest -q-
DistilBERT backbone (speed vs. headroom). I picked distilbert-base-uncased because it hits a sweet spot: strong accuracy on BBC-style topics with fast training/inference on CPU/MPS and smaller memory. The trade-off is a little less ceiling than bert-base-uncased or larger encoder models.
-
Manual PyTorch loop (control vs. convenience). I skipped transformers.Trainer and wrote a clean training loop (AdamW, warmup/decay, grad clip, early stopping). This gives me predictable behavior across versions and makes it obvious what’s happening at each step.
-
Macro-F1 for model selection. The best checkpoint is picked by macro-F1 on the validation set. Macro-F1 forces the model to respect every class, not just the common ones, and early stopping prevents overfitting while saving the actually best epoch—not the last one. The trade-off is that macro-F1 is slightly less intuitive than accuracy and can be noisier on tiny validation splits, but it yields fairer, more reliable performance across labels.