Newspaper Classificator

Supervised ML pipeline for news article categorization – 3rd place out of 200 groups

Project Description

A supervised machine learning pipeline developed at Politecnico di Torino for categorizing web-scraped news articles into seven distinct topics. The project competed against 200 groups and achieved 3rd place with a public test macro F1 score of 0.741.

Classification Categories

  • International News
  • Business
  • Technology
  • Entertainment
  • Sports
  • General News
  • Health

Dataset

~80,000 labeled training articles and 20,000 unlabeled test samples. Each entry includes article text, title, news source, timestamp, and PageRank score.

Key Challenges

  • Class imbalance across the seven categories
  • Noisy HTML artifacts in raw article content
  • Semantic overlap between International News and General News
  • Short/brief article texts reducing signal

Feature Engineering

  • Metadata boosting: replicating high-purity news source identifiers (e.g. ESPN → Sports, PCWorld → Technology)
  • Text representations: word-level and character-level TF-IDF with Snowball stemming
  • Temporal features: day-of-week and time-of-day classifications
  • Structural metrics: article and title lengths with log(1 + x) transformation
  • Categorical encoding: OneHotEncoder with frequency thresholds
  • Normalization: StandardScaler on numerical features

A ColumnTransformer orchestrated all parallel feature extraction pipelines.

Model Selection

Multiple algorithms were evaluated:

Model Macro F1
Naive Baseline 0.443
Random Forest 0.689
LinearSVC 0.703
SGDClassifier 0.726

SGDClassifier (SVM-style loss) was chosen for its effectiveness with high-dimensional sparse data and computational scalability.

Technologies Used

Language: Python
Libraries: Scikit-learn, Pandas, NumPy
Techniques: TF-IDF Vectorization, SGDClassifier, SVM, GridSearchCV

Results

  • Validation Macro F1: 0.726
  • Public Test Macro F1: 0.741
  • Competition Ranking: 3rd of 200 groups

Link repo: GitHub