Newspaper Classificator
Supervised ML pipeline for news article categorization – 3rd place out of 200 groups
Project Description
A supervised machine learning pipeline developed at Politecnico di Torino for categorizing web-scraped news articles into seven distinct topics. The project competed against 200 groups and achieved 3rd place with a public test macro F1 score of 0.741.
Classification Categories
- International News
- Business
- Technology
- Entertainment
- Sports
- General News
- Health
Dataset
~80,000 labeled training articles and 20,000 unlabeled test samples. Each entry includes article text, title, news source, timestamp, and PageRank score.
Key Challenges
- Class imbalance across the seven categories
- Noisy HTML artifacts in raw article content
- Semantic overlap between International News and General News
- Short/brief article texts reducing signal
Feature Engineering
- Metadata boosting: replicating high-purity news source identifiers (e.g. ESPN → Sports, PCWorld → Technology)
- Text representations: word-level and character-level TF-IDF with Snowball stemming
- Temporal features: day-of-week and time-of-day classifications
- Structural metrics: article and title lengths with log(1 + x) transformation
- Categorical encoding: OneHotEncoder with frequency thresholds
- Normalization: StandardScaler on numerical features
A ColumnTransformer orchestrated all parallel feature extraction pipelines.
Model Selection
Multiple algorithms were evaluated:
| Model | Macro F1 |
|---|---|
| Naive Baseline | 0.443 |
| Random Forest | 0.689 |
| LinearSVC | 0.703 |
| SGDClassifier | 0.726 |
SGDClassifier (SVM-style loss) was chosen for its effectiveness with high-dimensional sparse data and computational scalability.
Technologies Used
Language: Python
Libraries: Scikit-learn, Pandas, NumPy
Techniques: TF-IDF Vectorization, SGDClassifier, SVM, GridSearchCV
Results
- Validation Macro F1: 0.726
- Public Test Macro F1: 0.741
- Competition Ranking: 3rd of 200 groups
Link repo: GitHub