LSA with Randomized SVD
Latent semantic analysis of large text corpora using a custom rSVD implementation with GUI and semantic search
Project Description
This project demonstrates how to extract latent concepts from a large text corpus by reducing the dimensionality of the Term-Document matrix. By using rSVD (Randomized Singular Value Decomposition), the system achieves significantly faster performance than classical SVD on large matrices while maintaining high approximation accuracy.
Dataset
20 Newsgroups — ~18,000 Usenet posts across 20 topic categories. The dataset was preprocessed with Snowball Stemming and custom stop-word filtering.
Key Features
- Custom rSVD: Implementation with Power Iterations to enhance singular value approximation accuracy.
- Optimal Clustering: Automated search for the best number of clusters using the Silhouette Score.
- GUI Mode: Interactive charts via Plotly and a built-in Semantic Search engine.
- CLI Mode: Horizontal bar plots via Matplotlib for fast headless analysis.
- Preprocessing Pipeline: Advanced tokenization with Snowball Stemming and custom stopword lists.
Technical Details
The rSVD Algorithm
Classical SVD scales as O(mn·min(m,n)) — prohibitive for large matrices. The randomized approach projects the matrix onto a lower-dimensional subspace first, then applies SVD to the smaller matrix, achieving near O(mnk) complexity for rank-k approximation. Power Iterations further refine the approximation quality.
Clustering Workflow
After dimensionality reduction, documents are clustered in the latent concept space. The Silhouette Score is computed across a range of k values to automatically identify the most cohesive grouping.
Technologies Used
Language: Python 3
Libraries: NumPy, SciPy, Scikit-Learn, Matplotlib, Plotly, NLTK
Techniques: Randomized SVD, LSA, K-Means, Silhouette Score, TF-IDF, Snowball Stemming
Interfaces: GUI (Plotly + Tkinter), CLI (Matplotlib)
Link repo: GitHub