Lucio Baiocchi

Project Description

A supervised machine learning pipeline developed at Politecnico di Torino for categorizing web-scraped news articles into seven distinct topics. The project competed against 200 groups and achieved 3rd place with a public test macro F1 score of 0.741.

Classification Categories

International News
Business
Technology
Entertainment
Sports
General News
Health

Dataset

~80,000 labeled training articles and 20,000 unlabeled test samples. Each entry includes article text, title, news source, timestamp, and PageRank score.

Key Challenges

Class imbalance across the seven categories
Noisy HTML artifacts in raw article content
Semantic overlap between International News and General News
Short/brief article texts reducing signal

Feature Engineering

Metadata boosting: replicating high-purity news source identifiers (e.g. ESPN → Sports, PCWorld → Technology)
Text representations: word-level and character-level TF-IDF with Snowball stemming
Temporal features: day-of-week and time-of-day classifications
Structural metrics: article and title lengths with log(1 + x) transformation
Categorical encoding: OneHotEncoder with frequency thresholds
Normalization: StandardScaler on numerical features

A ColumnTransformer orchestrated all parallel feature extraction pipelines.

Model Selection

Multiple algorithms were evaluated:

Model	Macro F1
Naive Baseline	0.443
Random Forest	0.689
LinearSVC	0.703
SGDClassifier	0.726

SGDClassifier (SVM-style loss) was chosen for its effectiveness with high-dimensional sparse data and computational scalability.

Technologies Used

Language: Python
Libraries: Scikit-learn, Pandas, NumPy
Techniques: TF-IDF Vectorization, SGDClassifier, SVM, GridSearchCV

Results

Validation Macro F1: 0.726
Public Test Macro F1: 0.741
Competition Ranking: 3rd of 200 groups

Link repo: GitHub

Newspaper Classificator