Comparative Analysis of Clustering and Dimensionality Reduction Techniques

Overview

This project, completed for the Statistical Learning for Data Science course, presents a comprehensive comparison of clustering and dimensionality reduction methods: K-means, Soft K-means, PCA, and Linear AutoEncoder. Experiments were conducted on both tabular data (wheat seed dataset) and image data to evaluate each algorithm’s strengths and failure modes across different data types.

PCA and AutoEncoder dimensionality reduction results

Results

Clustering on the Wheat Seed Dataset (210 samples, 7 features, 3 true classes):

Algorithm Initialization Split & Merge Accuracy Score
K-means (k=3) Optimized No 0.7166
Soft K-means (k=3) Optimized Yes 0.6717
K-means (k=10) Optimized No 0.3750
Soft K-means (k=10) Random Yes 0.6085

Key finding: when k matches the true number of classes (k=3), hard K-means outperforms Soft K-means. When k is over-specified (k=10), Soft K-means’s flexible assignments significantly outperform hard K-means.

Dimensionality Reduction on Image Data:

  • PCA and Linear AutoEncoder produce visually comparable reconstructions for n=1,2,3 principal components.
  • When n=3, AutoEncoder loss converges most rapidly, capturing the highest-level structure.
  • Subsequent Soft K-means color clustering (k=1,2,3) produced meaningful image compression at all tested dimensions.

Technical Details

  • K-means Enhancements:
    • Optimized centroid initialization: Selected initial centroids by maximum pairwise distance rather than random sampling, consistently avoiding degenerate clustering.
    • Non-local split and merge: Split the largest cluster and merged the two closest; allowed the algorithm to escape local minima, though improvement depended on random factors.
  • Soft K-means:
    • Responsibility (r_k^{(n)} = \frac{\exp(-\beta d(m_k, x^{(n)}))}{\sum_j \exp(-\beta d(m_j, x^{(n)}))}).
    • Explored the effect of β: high β approaches hard K-means; low β causes cluster centers to collapse toward each other.
  • PCA:
    • Covariance matrix eigendecomposition; top-M eigenvectors used as projection basis.
    • Image reconstructed by projecting to M-dimensional subspace and back.
  • Linear AutoEncoder:
    • Encoder z = Wx, decoder x̂ = Vz; trained by minimizing MSE reconstruction loss.
    • With linear activations, mathematically equivalent to PCA; trained via gradient descent.
  • Implementation: All algorithms implemented from scratch in Python without ML frameworks for clustering; PyTorch used for the AutoEncoder.

Challenges

  • K-means sensitivity to initialization: Random initialization frequently led to incorrect cluster assignments on the wheat dataset. The optimized initialization method (furthest-distance selection) reliably improved results.
  • β tuning in Soft K-means: Too low β caused cluster centers to merge; too high β reproduced K-means behavior exactly, removing the benefit of soft assignment.
  • AutoEncoder convergence: Loss curves for n=1 were much flatter than for n=3, reflecting that lower-dimensional projections have fewer degrees of freedom to minimize.

Reflection and Insights

The contrast between K-means and Soft K-means under different k values illustrates a fundamental trade-off: hard, definitive assignments are powerful when the number of clusters is known, but soft probabilistic assignments provide robustness when cluster assumptions are violated. The equivalence between a Linear AutoEncoder and PCA — proven analytically and confirmed empirically — demonstrated that gradient-based learning can recover classical statistical results, offering a framework for extending dimensionality reduction to nonlinear settings.

Team and Role

  • Solo project: All algorithms implemented, experiments designed, and analysis conducted independently.

Mathematical Contest In Modeling

Overview

As part of the Mathematical Contest in Modeling (MCM), our team developed models to analyze and predict trends in the game “Wordle.” Leveraging ARIMA for time-series forecasting, BP neural networks for player behavior prediction, and K-means clustering for difficulty classification, we provided actionable insights for game developers and researchers. The project demonstrated the applicability of machine learning techniques to analyze complex patterns in player data.

Results

  • Achievements:
    • Modeled and predicted participant numbers with ARIMA, achieving high accuracy (R² = 0.982).
    • Classified Wordle words into “easy,” “medium,” and “difficult” categories using K-means clustering.
    • Predicted distribution of attempts for words based on their features with BP neural networks.
  • Model Outputs:
    • Predicted the number of participants on March 1, 2023, to be between 10,288 and 10,624.
    • Classified the word EERIE as of medium difficulty based on clustering results.
    • Visualized player behavior with accurate predictions for multiple words.

Technical Details

  • Time-Series Forecasting (ARIMA):
    • Applied ARIMA (1,1,0) to model participant trends over time.
    • Conducted statistical tests (ADF and Ljung-Box) to ensure model validity.
    • Predicted the number of participants with high reliability.
  • Attempts Percentage Prediction (BP Neural Network):
    • Extracted word features, including isolation level, priority, and elimination value.
    • Designed and trained a BP neural network with 70% training data, achieving optimal performance for predicting attempt distributions.
    • Key Metrics: RMSE = 3.57, MAPE = 45.28%.
  • Difficulty Classification (K-means):
    • Clustered words into three difficulty levels using player attempt distributions.
    • Evaluated and classified the word EERIE as medium difficulty based on clustering results.
    • Improved model robustness by excluding outliers and noisy data.

Challenges

  • Data limitations: A small dataset (~359 samples) constrained model generalization.
  • Non-linear relationships between features and outcomes made interpretability challenging.
  • Overfitting risks in ARIMA required careful parameter tuning and validation.

Reflection and Insights

This competition provided valuable experience in applying statistical and machine learning techniques to real-world problems. It reinforced the importance of data cleaning, feature extraction, and model validation. The integration of multiple models (ARIMA, BP neural networks, and K-means) highlighted the versatility of these methods in addressing diverse challenges.

Team and Role

  • Team: Collaborated with two teammates on data analysis, modeling, and report writing.
  • My Role:
    • Played a leading role in modeling tasks, including the design and implementation of ARIMA, BP neural networks, and K-means clustering.
    • Assisted in programming tasks, focusing on feature engineering and model optimization.
    • Contributed significantly to the report writing by summarizing results, interpreting findings, and drafting technical sections.