LSA with Randomized SVD

Latent semantic analysis of large text corpora using a custom rSVD implementation with GUI and semantic search

Project Description

This project demonstrates how to extract latent concepts from a large text corpus by reducing the dimensionality of the Term-Document matrix. By using rSVD (Randomized Singular Value Decomposition), the system achieves significantly faster performance than classical SVD on large matrices while maintaining high approximation accuracy.

Dataset

20 Newsgroups — ~18,000 Usenet posts across 20 topic categories. The dataset was preprocessed with Snowball Stemming and custom stop-word filtering.

Key Features

  • Custom rSVD: Implementation with Power Iterations to enhance singular value approximation accuracy.
  • Optimal Clustering: Automated search for the best number of clusters using the Silhouette Score.
  • GUI Mode: Interactive charts via Plotly and a built-in Semantic Search engine.
  • CLI Mode: Horizontal bar plots via Matplotlib for fast headless analysis.
  • Preprocessing Pipeline: Advanced tokenization with Snowball Stemming and custom stopword lists.

Technical Details

The rSVD Algorithm

Classical SVD scales as O(mn·min(m,n)) — prohibitive for large matrices. The randomized approach projects the matrix onto a lower-dimensional subspace first, then applies SVD to the smaller matrix, achieving near O(mnk) complexity for rank-k approximation. Power Iterations further refine the approximation quality.

Clustering Workflow

After dimensionality reduction, documents are clustered in the latent concept space. The Silhouette Score is computed across a range of k values to automatically identify the most cohesive grouping.

Technologies Used

Language: Python 3
Libraries: NumPy, SciPy, Scikit-Learn, Matplotlib, Plotly, NLTK
Techniques: Randomized SVD, LSA, K-Means, Silhouette Score, TF-IDF, Snowball Stemming
Interfaces: GUI (Plotly + Tkinter), CLI (Matplotlib)

Link repo: GitHub