IDSS: Interpretable Diving Action Quality Assessment Platform

Overview

Built IDSS (Interpretable Diving Scoring System), an end-to-end action quality assessment framework for competitive diving, as a course project at CMU (Fall 2025), with teammates Xin Lin and Vincent Nie.

Standard AQA systems output a numeric score but provide no explanation — coaches and athletes cannot learn from them. IDSS addresses this by combining a procedure-aware deep learning backbone (based on FineDiving [Xu et al., CVPR 2022]) with a Heuristic Quality Assessment Pipeline (HQAP): a rule-based, pose-driven system that computes five physically interpretable kinematic indicators and uses them as auxiliary supervision during training.

The result: IDSS not only outperforms the baseline on all metrics but also generates structured, athlete-readable diagnostic reports with frame-aligned GIF visual evidence.

Results

Performance on the FineDiving dataset (3000 diving videos, 52 action types):

Metric Baseline IDSS Improvement
Spearman Rank Correlation (ρ) 0.9272 0.9302 +0.33%
tIoU@0.5 0.9373 0.9559 +1.99%
tIoU@0.75 0.5407 0.5714 +5.68%
Relative L2 Distance (R-ℓ2 ×100) 0.3313 0.3099 −6.46%

Convergence acceleration: IDSS achieves performance comparable to the baseline’s best 200-epoch checkpoint in ~30 epochs, and surpasses it by epoch 88. At epoch 10, IDSS achieves a 62.56% improvement in R-ℓ2 over the baseline — the pose supervision provides immediate learning signal.

Technical Details

Heuristic Quality Assessment Pipeline (HQAP)

HQAP runs three models in parallel on each video frame to extract object and pose signals:

  • Two Detectron2 (Mask R-CNN) instances for platform, splash, and diver detection.
  • One HRNet model for 16-keypoint diver pose estimation.

Raw outputs are fused into a per-video JSON. Cleaning: Savitzky-Golay filter on pose trajectories (internal gaps only), linear interpolation for platform stability; splash data preserved raw.

Five kinematic metrics, each computed only during its relevant dive phase:

  1. Somersault Tightness (TUCK phase): shoulder–hip–knee angle; lower = tighter tuck. Temporal mean.
  2. Body Straightness (ENTRY phase): shoulder–hip–ankle angle; 180° = perfectly straight. Temporal mean.
  3. Entry Verticalness (ENTRY phase): body vector vs. vertical; 0° = perfect vertical entry. Temporal mean.
  4. Splash Size (ENTRY phase): total pixel area of splash bounding boxes. Maximum over entry frames.
  5. Distance from Platform (PIKE/TUCK phases): absolute horizontal distance from diver hip to smoothed platform centroid (sliding window average). 5th percentile value (robust minimum distance during flight).

For the Distance metric, GMM analysis revealed a bimodal distribution corresponding to two natural dive classes — a Gaussian Mixture Model clusters scores and assigns one of four labels (Too Close / Close / Reasonable / Far).

Each metric is converted to a 3-tier label (“Excellent” / “Average” / “Need Improvement”) using the 25th/75th percentile thresholds across the training set.

IDSS Model Architecture

The backbone follows the procedure-aware FineDiving formulation:

  1. Temporal Segmentation: I3D features → segmentation module predicts L step-transition probabilities, dividing each dive into L+1 sub-actions (take-off, flight, entry, …).
  2. Procedure-Aware Cross-Attention: step-level features from query and exemplar videos passed through Multi-head Cross-Attention to capture relative quality differences per phase.
  3. Multi-Task Regression Head:
    • Score head: predicts relative score difference per step, aggregated to final AQA score.
    • Pose metric head (auxiliary): predicts the 5-dimensional HQAP vector from the same procedure-aware embeddings.

Joint loss: L = L_AQA (pairwise MSE) + L_TAS (temporal segmentation BCE) + λ · L_Pose (pose metric MSE). The auxiliary pose supervision acts as a structural prior, guiding learned features toward physically interpretable quality signals.

Report Generation

A deterministic statistics-driven template system computes each metric’s percentile, retrieves the matching natural-language template from a pre-defined library, and dynamically fills in the precise value and qualitative evaluation tier. No LLM required. Reports rendered as interactive HTML dashboards with frame-aligned GIFs highlighting the detected issue windows — enabling visual verification by athletes and coaches.

A lightweight Flask web interface handles video upload, processing dispatch, and report delivery.

Challenges

  1. Pose estimation reliability during water entry: HRNet confidence degrades when limbs are submerged. Handled by Savitzky-Golay smoothing on pose trajectories and preserving NaN values at sequence ends rather than extrapolating — preventing corrupted late-frame estimates from affecting metric aggregation.

  2. Bimodal distance distribution: The distance-from-platform metric has a dual-peak histogram due to structurally different dive types. A simple percentile threshold would mislabel one cluster entirely. GMM fitting discovered the two underlying distributions and enabled cluster-conditioned labeling.

  3. Auxiliary supervision calibration: Incorrect λ weighting for the pose loss destabilized early training. Tuning λ and verifying that pose metric regression errors decreased monotonically before score prediction improved confirmed that the auxiliary head was providing useful rather than noisy gradients.

Reflection and Insights

The central lesson: interpretability and performance are complementary, not opposed, when the interpretable component encodes genuine domain knowledge. The HQAP metrics are not post-hoc explanations added after training — they are causal signals that correlate directly with score deductions. Using them as auxiliary training supervision rather than just evaluation labels is what drove both the accuracy improvements and the convergence acceleration. The structural prior “works” precisely because it is physically grounded.

The convergence result is particularly striking: reaching 200-epoch baseline quality in 30 epochs means the interpretable supervision provides a strong inductive bias that dramatically reduces the search space the optimizer must explore. This generalizes: domain-specific supervision is often more sample-efficient than scaling model size or training longer.

Stack

Python, PyTorch, Detectron2, HRNet, I3D, OpenCV, Flask, HTML/CSS (report generation), FineDiving dataset

Cloud-Native IoT Network Security Analysis Pipeline

Overview

Built a cloud-native, end-to-end intrusion detection pipeline for IoT network traffic as a course project in 18763 System Toolchains (Fall 2025, CMU), in collaboration with Yiqiao Zhou. The system processes the MQTTset dataset — 20 million MQTT records from a simulated smart home environment — to perform 6-class attack classification across normal traffic and five attack types (brute force, DoS, flood, SlowITe, malformed packet).

The pipeline is structured as four tasks: database design and population (Task I), large-scale analytics (Task II), ML modeling (Task III), and full cloud deployment on GCP (Task IV/Bonus).

Results

Machine Learning (Task III):

Model Framework Best Test Accuracy
Logistic Regression Spark ML 74.97%
Shallow MLP PyTorch 75.40%
Deep MLP PyTorch 75.40%
Random Forest Spark ML 78.52%

Random Forest (30 trees, depth 7) was the best-performing model — the tree-based ensemble’s non-linear boundaries were more effective than neural networks on this structured tabular feature set. Key finding: maxDepth=7 vs maxDepth=5 improved test accuracy by ~4 percentage points, while increasing from 30 to 50 trees provided no benefit.

Feature engineering: 47-dimensional feature vectors generated from 34 raw MQTT/TCP columns via one-hot encoding of 4 categorical flag fields + standardized 15 numerical features + constant-column removal.

Technical Details

Data Ingestion (Task I):

  • Loaded train70_augmented.csv (14M rows) and test30_augmented.csv (6M rows) from GCS into a Dockerized PostgreSQL 16 database via JDBC on GCE VM.
  • 16-partition parallel JDBC write; combined dataset includes a split column distinguishing train/test.

Analytics (Task II — PySpark):

  • Average MQTT message length by attack class; TCP statistics and MQTT header flag distributions.
  • Top TCP flags filtered by time delta; target class distribution histograms.
  • Kafka streaming pipeline (Task II-Q5): YouTube API producer publishes cybersecurity video comments to a Kafka topic; Spark Streaming consumer performs real-time keyword-frequency analysis.

Distributed Feature Engineering (Task III — Spark ML Pipeline):

  • Removed 10 near-zero-variance columns (stddev < 1e-6).
  • StringIndexer + OneHotEncoder on 4 categorical flag columns → 32 dimensions.
  • VectorAssembler + StandardScaler on 15 numerical columns → 15 dimensions.
  • Combined via VectorAssembler → 47-dim features column; labels indexed to integers 0–5.
  • Full dataset checkpointed to GCS Parquet to break the Spark execution graph before training.

ML Models:

  • Spark ML: Logistic Regression (L1/Lasso regularization, regParam=0.001) and Random Forest (numTrees=30, maxDepth=7, subsamplingRate=0.5), tuned via 80/20 TrainValidationSplit.
  • PyTorch: Shallow MLP (47→96→128→6, ~7K params) and Deep MLP (47→128→128→64→6 with BatchNorm+Dropout, ~25K params). Features exported to GCS Parquet, loaded via custom ParquetArrayIterable DataLoader. Adam optimizer, linear warmup (10K steps), gradient clipping, early stopping.

Cloud Deployment (Task IV):

  • All three notebooks run on a GCP Dataproc cluster (Apache Spark 3.5.3) with JupyterLab.
  • PostgreSQL runs in a Docker container on a separate GCE VM, accessed via private IP within a VPC-secured network — no managed Cloud SQL.
  • Full pipeline: GCS CSV → Spark ingestion → Dockerized PostgreSQL → Spark analytics → Spark ML + PyTorch training.

Challenges

  1. JDBC write contention at scale: 20 parallel Spark tasks writing to the same PostgreSQL table caused severe lock contention. Diagnosed via pg_locks slow query logs; resolved by pre-hashing records into non-overlapping partition ranges before JDBC write.

  2. Parquet-to-PyTorch data pipeline: Spark’s sparse vector format is not directly consumable by PyTorch. Implemented a custom ParquetArrayIterable class using pyarrow.dataset to convert sparse vectors to dense tensors in a streaming fashion, with configurable batch sizes and value clipping ([-10, 10]) for training stability.

  3. Neural networks underperforming tree-based models: Shallow and Deep MLPs achieved only 75.40% — 3.5pp below Random Forest. Root cause: the engineered tabular features with protocol flag one-hot vectors are well-suited for tree-based splits that can isolate specific flag combinations, whereas MLP architectures require different inductive biases to capture equivalent patterns.

Reflection and Insights

This project demonstrated concretely why Random Forest often outperforms neural networks on structured tabular data: the tree’s per-split logic naturally handles mixed feature types (categorical one-hot + numerical) without requiring normalization, and ensemble voting reduces variance effectively. The Shallow and Deep MLPs achieved nearly identical accuracy, confirming that added depth and regularization provided no benefit once the architectural bottleneck was at feature representation rather than model capacity.

The infra experience also made explicit how the choice of data partitioning scheme (JDBC write partitioning, GCS Parquet layout) dominates end-to-end pipeline throughput more than algorithmic choices.

Stack

GCP (Dataproc, GCS, GCE VM, VPC), Apache Spark 3.5.3 (PySpark), PyTorch, PostgreSQL 16 (Docker), JDBC, Apache Kafka, Neo4j, Python 3.9+

Intelligent Scissors: Interactive Image Segmentation Tool

Overview

Implemented an Intelligent Scissors (Live-Wire) interactive image segmentation tool in Java, as a course project at CMU (Spring 2025), with teammates Tingjun Huang and Guoheng Ma. The tool allows a user to click seed points on an image and have the system automatically trace optimal boundaries between objects in real time.

The implementation models the image as an 8-neighborhood weighted pixel graph with edge costs derived from Sobel gradient magnitudes, then computes live boundary paths via Dijkstra and A* shortest-path search from the last seed to the current cursor position.

Results

Processing time (measured on Intel Core i9-12900H, 32 GB RAM, Windows 11, vs. Adobe Photoshop 2020 manual polygon-lasso):

Test Image Photoshop Intelligent Scissors Speedup
Image 1 (Mickey Mouse cartoon, low-contrast boundary, 661×494 px) 551 s 106 s 5.2×
Image 2 (two parrots, high-contrast textured, 707×480 px) 279 s 25 s 11.2×

Segmentation accuracy:

Test Image IoU Dice Coefficient
Image 1 0.88 0.94
Image 2 0.97 0.99

Both results exceed the 0.85 Dice threshold commonly considered “high-quality” in interactive segmentation literature.

Technical Details

Image Processing and Cost Function:

  • Input image converted to grayscale (ITU-R BT.709 weighted average).
  • Sobel operators compute horizontal and vertical gradients G_x, G_y; gradient magnitude G(x,y) = sqrt(G_x² + G_y²).
  • Pixel cost: C(x,y) = 1 / (1 + G(x,y)) — strong edges yield low cost, making them preferred by the pathfinder.
  • Border regions optionally enhanced by boosting edge strength at image boundaries to ensure complete closed paths.

Graph and Shortest-Path Search:

  • 8-neighborhood graph: each pixel node connected to 8 neighbors; edge weight = (C(i,j) + C(k,l)) / 2 × d, where d = 1 for orthogonal, √2 for diagonal.
  • Dijkstra computes shortest paths from the active seed to all pixels; results cached as a per-pixel cost map + predecessor map — path retrieval on cursor move is O(1) (no re-computation until seed changes).
  • A* used in ROI-constrained mode with Euclidean-distance heuristic (α = 0.005), focusing search within a dynamic window around the start/end points for faster interactive response.

Cursor Snap:
Within radius r, the cursor position is attracted to the pixel maximizing:
S(x’, y’) = E(x’, y’) / (1 + α · d((x,y), (x’,y’)))
where E is precomputed edge strength and α = 0.3 is a decay factor. This snaps the active endpoint to nearby strong edges, reducing required precision and stabilizing boundary tracing along prominent features.

Path Cooling (auto-seeding):
When the live path remains stable across N consecutive frames (measured by Euclidean distance between uniformly sampled points), and a candidate point c satisfies minimum spacing (||c − s_last|| > D_min) and minimum edge strength (E(c) > E_thresh), a new seed is automatically inserted. This progressively locks completed path segments without explicit clicks, reducing total click count.

Performance Optimizations:

  • Precomputed cost matrix cached after edge detection — no recomputation during pathfinding.
  • Dynamic window restriction: graph construction limited to a bounding region around start/end points; window size adapts to the distance between them.
  • IndexMinPQ priority queue for efficient A* open-set management.
  • Boolean mask matrix for O(1) pixel state queries in segmentation and path cooling.

Challenges

  1. Real-time Dijkstra at megapixel scale: A naïve re-run on every cursor move (O(N log N), N ≈ 300K for a 661×494 image) was too slow for interactive use. The seed-caching architecture — precomputing the full Dijkstra result once per seed and recovering paths via predecessor-map traversal — reduced per-frame work from O(N log N) to O(path length).

  2. Path instability in low-gradient regions: In flat-texture regions, small cursor perturbations caused large path jumps. Path Cooling’s stability detection (consecutive-frame similarity check) added path inertia, while the sharpening kernel parameter tuning (value 5) improved edge detectability in low-contrast areas at the cost of some visual noise.

  3. Closed-path formation: The algorithm naturally traces from a seed toward the cursor; closing the path requires the endpoint to meet the starting seed. Implemented arc-length parameterization and a minimum-seed-distance threshold to prevent the path from “collapsing” prematurely near the starting seed.

Reflection and Insights

Two lessons dominated this project: caching architecture matters more than algorithmic choice, and UX features drive perceived performance more than raw compute.

The 5.2×–11.2× speedup over Photoshop came almost entirely from the Dijkstra seed-cache — a single structural insight eliminating redundant computation. Meanwhile, Cursor Snap and Path Cooling — entirely orthogonal to the shortest-path algorithm — were responsible for the majority of improvement in actual segmentation speed during user testing. These two auxiliary features reduced user effort (click count, repositioning) far more than algorithmic refinements to the path cost function did.

This generalizes to interactive system design broadly: the component the user interacts with most frequently (cursor movement, click placement) has disproportionate leverage on end-to-end task time.

Team and Role

Course project at CMU with Tingjun Huang (12112619) and Guoheng Ma (12211611). My contributions: graph construction and Dijkstra/A* implementation, cost matrix caching architecture, Path Cooling algorithm design and implementation, and performance benchmarking.

Stack

Java 17, IntelliJ IDEA, Swing (GUI), Princeton algorithm library (EdgeWeightedDigraph, DijkstraSP, IndexMinPQ)

Reinforcement Learning for Inverted Pendulum Control

Overview

This project involved applying Reinforcement Learning (RL) algorithms to control both single and double inverted pendulum systems. Using algorithms such as Q-Learning, DQN, DDPG, and PPO, we implemented controllers to achieve swing-up and stabilization tasks. The project explored the dynamic complexities of inverted pendulum systems and highlighted the effectiveness of RL techniques for non-linear control problems.

Results

  • Single Inverted Pendulum:
    • Achieved a 100% success rate for swing-up and stabilization tasks under ideal conditions.
    • Maintained a 90% success rate under noisy conditions with a simulation time of 30 seconds.
  • Double Inverted Pendulum:
    • Successfully stabilized the pendulum but encountered challenges in achieving swing-up with model-free RL methods.
  • Performance Metrics:
    • Trained RL models for swing-up and stabilization tasks in under 50,000 episodes.
    • Demonstrated the effectiveness of custom reward functions for dynamic control tasks.

GitHub (Chinese README)

Swing-up from a stationary state
Swing-up under noisy conditions
Stabilization of the double inverted pendulum

Technical Details

  • Algorithms Applied:
    • Q-Learning and DQN: Explored discrete action spaces for initial experiments.
    • A2C and PPO: Achieved robust performance for stabilization tasks in continuous action spaces.
    • DDPG: Provided smooth control for swing-up tasks with deterministic policy gradients.
  • Custom Toolkit:
    • Developed RL agents from scratch using PyTorch, including functions for initialization, model updates, and action sampling.
    • Designed visualization tools to monitor reward curves and training metrics.
  • Reward Design:
    • Swing-up Task: Rewarded higher pendulum angles while penalizing velocity at the peak.
    • Stabilization Task: Encouraged minimal deviation from the vertical position and low angular velocity.

Challenges

  • Swing-Up Task:
    • Coordinating motion during the throw-and-catch process was challenging, especially under noisy conditions.
    • Solution: Implemented collaborative agents for swing-up and stabilization, with separate reward functions for each sub-task.
  • Double Inverted Pendulum:
    • Model-free RL struggled with the system’s chaotic behavior.
    • Solution: Transitioned to model-based approaches like PILCO for better state-action-reward predictions.

Reflection and Insights

This project deepened my understanding of reinforcement learning and its application to real-world control problems. It highlighted the importance of tailored reward functions and robust algorithm selection for dynamic systems. The challenges in handling chaotic behaviors inspired further exploration into model-based strategies to enhance RL performance.

Team and Role

  • Team: Worked collaboratively with two teammates on RL model implementation and evaluation.
  • My Role:
    • Focused on the single inverted pendulum tasks, including algorithm selection and reward function design.
    • Developed custom RL agents using PyTorch, optimizing hyperparameters for efficient training.
    • Led the implementation of the collaborative “throw-catch” process for swing-up tasks.

Statistical Learning for Data Science

Overview

This project series was completed as part of the Statistical Learning for Data Science course at Southern University of Science and Technology. The work covered two major tasks: applying hybrid deep learning and traditional machine learning pipelines for medical image classification, focusing on fundus lesion diagnosis. The project explored how feature representations from pre-trained deep networks can be combined with classical classifiers to achieve high accuracy with reduced computational cost.

Results

  • Task 1 — Pre-trained Feature Extraction: Used ResNet18 as a frozen feature extractor; downstream classifiers (Linear Regression, KNN, SVM) achieved 100% accuracy on the test set, demonstrating the quality of ResNet18’s learned representations.
  • Task 2 — Fine-tuned ResNet18: Fine-tuned ResNet18 end-to-end on the 3-class fundus dataset, converging in ~3 epochs with 100% test accuracy and near-perfect AUC across all classes.
  • Bonus — Custom CNN: Designed a lightweight CNN from scratch using PyTorch, achieving 99.53% accuracy in 155 s training time vs. 348 s for ResNet18, demonstrating favorable speed-accuracy trade-off.
  • Extension — 7-class Classification: Extended the fine-tuned ResNet18 to a 7-class problem; all classes achieved AUC = 1.00, validating the method’s scalability.

Technical Details

  • Dataset: Fundus lesion images categorized into 3 (and later 7) classes; standard preprocessing with resize, normalization, and contrast adjustment.
  • Hybrid Pipeline:
    • ResNet18 (pre-trained on ImageNet) used as a backbone to extract 1000-dimensional feature vectors.
    • Traditional classifiers (LE, KNN, SVM, MLP) trained on extracted features using sklearn.
  • Custom CNN Architecture:
    • Convolutional channels: [16, 32, 64], kernel size 3×3, max pooling with stride 2.
    • Grayscale edge-detected preprocessing (Canny, Gaussian blur) to reduce input redundancy.
    • Fully connected MLP head for multi-class output.
  • Training Setup: SGD optimizer (lr=0.001, momentum=0.9), cross-entropy loss, 3–5 epochs.
  • Evaluation: Accuracy, ROC curves, and AUC per class; all reported in the final report.

Challenges

  • Speed vs. accuracy trade-off: The custom CNN was significantly faster (2.2×) but slightly less accurate than ResNet18. The gap was attributed to the simplicity of convolution layers and grayscale conversion that discards color information.
  • Feature quality vs. training cost: Frozen ResNet18 features were so discriminative that even linear classifiers achieved perfect accuracy, raising the question of when fine-tuning is truly necessary.
  • 7-class generalization: Extending to a harder 7-class scenario required careful dataset balancing and preprocessing to maintain generalization.

Reflection and Insights

This project reinforced a key principle in applied machine learning: strong pre-trained feature representations can often substitute for expensive end-to-end training, especially when labeled data is limited. The hybrid approach — deep features paired with classical classifiers — offers a practical and interpretable alternative to black-box deep models in medical contexts. Designing the custom CNN from scratch also deepened understanding of how architectural choices (depth, width, pooling strategy) affect both accuracy and training efficiency.

Team and Role

  • Team: Collaborated with two teammates on methodology design, experiments, and report writing.
  • My Role: Led the custom CNN design and preprocessing pipeline; contributed to the hybrid pipeline experiments and analysis of width/depth trade-offs.

eMeritBox

Overview

The eMeritBox project combines traditional Buddhist cultural elements with modern technology, creating an interactive gravity-sensing electronic donation box. Built with a Raspberry Pi, the system integrates PWM control, motion sensing, and a web-based interface to modernize the concept of a traditional donation box. This innovative design bridges traditional practices with digital solutions, offering a seamless and meaningful user experience.

Results

  • System Features:
    • Automatic wooden fish strikes with real-time donation ball accumulation.
    • Gravity-sensing motion control for dynamic donation ball movement.
    • Dual operational modes: manual and auto donation switching.
  • Achievements:
    • Successfully implemented a complete hardware-software system using Raspberry Pi and Flask.
    • Developed reusable classes for matrix display and gravity sensing, enabling future adaptations.

GitHub (Chinese README)

eMeritBox system overview

eMeritBox functional demonstration

eMeritBox functional demonstration

Technical Details

  • System Architecture:
    • Controller: Raspberry Pi handles signal processing, PWM control, and web server operations.
    • Modules:
      • MG-90 servo for wooden fish strikes.
      • GY-25 gyroscope for motion sensing.
      • MAX7219 matrix display for donation ball visualization.
  • Key Functionalities:
    • Gravity-Sensing Donation: Balls dynamically move based on the box’s tilt angle.
    • Flask Web Server: Supports browser-based remote operation of wooden fish strikes.
    • Matrix Display: Visualizes donation balls in real-time, reflecting their position and state.
  • Software Implementation:
    • Developed Python classes for modular control:
      • GY25Ctrl for gyroscope data processing.
      • MatrixCtrl for donation ball display updates.
      • BGMPlayer for background music playback.
    • Solved hardware conflicts by reconfiguring UART ports and enabling additional I²C channels.

Challenges

  • UART hardware resource conflicts: Raspberry Pi’s default UART settings caused resource contention.
    • Solution: Re-mapped hardware and mini UARTs (ttyAMA0 ↔ ttyS0) and configured multiple UART ports (+ttyAMA1, 2, …) for simultaneous operation.
  • I²C channel conflicts: Dual I²C channels on Raspberry Pi conflicted with camera usage.
    • Solution: Disabled the camera function and enabled additional I²C channels with dtparam=i2c_vc=on.
  • SPI and I²C competing with UART ports: Enabling SPI and I²C modules on the Raspberry Pi caused UART port contention.
    • Solution: Adjusted hardware configurations to optimize resource allocation.
  • Synchronization of Multiple Modules: Managing the simultaneous operation of PWM, matrix display, and motion sensing.
    • Solution: Utilized multi-threading to ensure real-time responsiveness and system stability.

Reflection and Insights

The eMeritBox represents a modernized approach to traditional Buddhist donation practices, seamlessly integrating spiritual elements with advanced technology. By reimagining the donation process with dynamic visuals and interactive controls, this project demonstrates the potential of technology to preserve and innovate cultural traditions. The challenges in hardware-software integration further highlighted the importance of modular design and multi-threaded programming in building robust embedded systems.

ROS Robot Intelligent Navigation and Control System

Overview

This project was the final deliverable for the Robot Perception and Intelligence course (EE211) at Southern University of Science and Technology, built on the ROS2 platform. The goal was to develop a fully autonomous robot capable of navigating to a target location, recognizing and grasping an object using a robotic arm, and avoiding obstacles — all with custom-implemented planning and control modules.

Robot navigation and arm control demonstration

Results

  • Navigation: Successfully navigated to target points using the Nav2 stack with a custom global planner plugin.
  • Object Recognition and Grasping: Detected target objects via Aruco markers; the robotic arm computed inverse kinematics and executed reliable grasps within the reachable workspace.
  • Path Planning: Implemented a custom A* global planner and a trajectory feedback local controller as Nav2 plugins.
  • Extra Challenge: Handled randomly placed objects by dynamically querying IK solvability during slow-approach phases.

Technical Details

  • System Architecture:
    • Finite State Machine (FSM): Coordinated high-level task sequencing (navigate → approach → grasp → return).
    • Navigation: Nav2 stack with tuned parameters for global_costmap, local_costmap, planner_server, and controller_server.
    • Aruco-based Target Recognition: Used camera-based Aruco detection to estimate target pose; TF tree handled all coordinate transformations automatically.
  • Custom A Planner (MyPlanner)*:
    • Implemented as a Nav2 global planner plugin in C++.
    • Standard A* graph search on the occupancy grid with heuristic tuning for smooth paths.
  • Custom Trajectory Feedback Controller (MyController):
    • Local controller plugin computing velocity commands to track the reference path.
    • Feedback control based on cross-track error and heading error.
  • Robotic Arm Controller:
    • Queried IK solver (grasp_query_solved()) in a loop during slow approach to determine when the target entered the reachable envelope.
    • Designed custom grasp points with direction information from Aruco pose estimates.
  • PTZ (Pan-Tilt) Tracking:
    • Drove the camera gimbal to track the target during navigation, preventing loss of visibility.
    • Coordinate compensation handled via TF tree rather than manual recalibration.

Challenges

  • Odometry Drift: Wheel odometry accumulated error over longer paths, causing the robot to lose accurate positioning relative to the target. Resolved by switching reference to the Aruco marker position during the final approach phase.
  • IK Feasibility Window: The robotic arm’s reachable workspace was constrained, requiring continuous IK queries and a slow-approach strategy to enter the feasible zone before executing a grasp.
  • Costmap Configuration: Getting Nav2’s costmap inflation and obstacle layers tuned for the specific robot geometry required iterative testing.

Reflection and Insights

This project provided hands-on experience with the full stack of autonomous robotics: perception, planning, and control. Implementing A* and the trajectory controller as actual Nav2 plugin classes — rather than standalone scripts — deepened understanding of how ROS2’s modular architecture enables component reuse and testing. The challenge of handling coordinate frames across navigation, perception, and manipulation highlighted why a well-structured TF tree is foundational to multi-component robotic systems.

Team and Role

  • Team: Three-person team, each responsible for different subsystems.
  • My Role: Led the development of the custom A* global planner plugin and the trajectory feedback controller; contributed to the FSM design and arm approach strategy.

Comparative Analysis of Clustering and Dimensionality Reduction Techniques

Overview

This project, completed for the Statistical Learning for Data Science course, presents a comprehensive comparison of clustering and dimensionality reduction methods: K-means, Soft K-means, PCA, and Linear AutoEncoder. Experiments were conducted on both tabular data (wheat seed dataset) and image data to evaluate each algorithm’s strengths and failure modes across different data types.

PCA and AutoEncoder dimensionality reduction results

Results

Clustering on the Wheat Seed Dataset (210 samples, 7 features, 3 true classes):

Algorithm Initialization Split & Merge Accuracy Score
K-means (k=3) Optimized No 0.7166
Soft K-means (k=3) Optimized Yes 0.6717
K-means (k=10) Optimized No 0.3750
Soft K-means (k=10) Random Yes 0.6085

Key finding: when k matches the true number of classes (k=3), hard K-means outperforms Soft K-means. When k is over-specified (k=10), Soft K-means’s flexible assignments significantly outperform hard K-means.

Dimensionality Reduction on Image Data:

  • PCA and Linear AutoEncoder produce visually comparable reconstructions for n=1,2,3 principal components.
  • When n=3, AutoEncoder loss converges most rapidly, capturing the highest-level structure.
  • Subsequent Soft K-means color clustering (k=1,2,3) produced meaningful image compression at all tested dimensions.

Technical Details

  • K-means Enhancements:
    • Optimized centroid initialization: Selected initial centroids by maximum pairwise distance rather than random sampling, consistently avoiding degenerate clustering.
    • Non-local split and merge: Split the largest cluster and merged the two closest; allowed the algorithm to escape local minima, though improvement depended on random factors.
  • Soft K-means:
    • Responsibility (r_k^{(n)} = \frac{\exp(-\beta d(m_k, x^{(n)}))}{\sum_j \exp(-\beta d(m_j, x^{(n)}))}).
    • Explored the effect of β: high β approaches hard K-means; low β causes cluster centers to collapse toward each other.
  • PCA:
    • Covariance matrix eigendecomposition; top-M eigenvectors used as projection basis.
    • Image reconstructed by projecting to M-dimensional subspace and back.
  • Linear AutoEncoder:
    • Encoder z = Wx, decoder x̂ = Vz; trained by minimizing MSE reconstruction loss.
    • With linear activations, mathematically equivalent to PCA; trained via gradient descent.
  • Implementation: All algorithms implemented from scratch in Python without ML frameworks for clustering; PyTorch used for the AutoEncoder.

Challenges

  • K-means sensitivity to initialization: Random initialization frequently led to incorrect cluster assignments on the wheat dataset. The optimized initialization method (furthest-distance selection) reliably improved results.
  • β tuning in Soft K-means: Too low β caused cluster centers to merge; too high β reproduced K-means behavior exactly, removing the benefit of soft assignment.
  • AutoEncoder convergence: Loss curves for n=1 were much flatter than for n=3, reflecting that lower-dimensional projections have fewer degrees of freedom to minimize.

Reflection and Insights

The contrast between K-means and Soft K-means under different k values illustrates a fundamental trade-off: hard, definitive assignments are powerful when the number of clusters is known, but soft probabilistic assignments provide robustness when cluster assumptions are violated. The equivalence between a Linear AutoEncoder and PCA — proven analytically and confirmed empirically — demonstrated that gradient-based learning can recover classical statistical results, offering a framework for extending dimensionality reduction to nonlinear settings.

Team and Role

  • Solo project: All algorithms implemented, experiments designed, and analysis conducted independently.

Binary Classification Using Logistic Regression and MLP

Overview

This project, completed for the Statistical Learning for Data Science course, involved implementing Logistic Regression and a flexible Multi-Layer Perceptron (MLP) from scratch using Python and NumPy — without any deep learning framework. The goal was to deeply understand gradient-based optimization and the impact of architecture choices on classification performance.

MLP decision boundary and loss curve

Results

Logistic Regression on synthetic binary dataset:

Metric Value
Precision 0.9915
Recall 1.0000
F1 Score 0.9957
Final Loss 0.0180

MLP — Width Experiment (lr=0.025, epoch=250,000):

Neurons per Layer F1 Score Final Loss
3 0.5983 0.6342
6 0.7009 0.7484
20 0.8226 0.5845
50 0.9077 0.6618

MLP — Depth Experiment (lr=0.01, epoch=160,000):

Layers F1 Score Final Loss
2 0.9552 0.0346
3 0.9848 0.0127
6 0.9925 0.0043
10 1.0000 5.11e-06

Technical Details

  • Logistic Regression:
    • Forward pass: sigmoid activation on the linear combination of inputs.
    • Loss: binary cross-entropy (with clipping for numerical stability).
    • Optimization: vanilla gradient descent; weights and bias updated per iteration.
  • MLP:
    • Configurable layer sizes (e.g., [2, 20, 20, 1] for a 2-hidden-layer network).
    • Hidden layer activation: ReLU; output activation: sigmoid.
    • Full backpropagation implemented manually via chain rule.
    • Batch gradient descent; tolerance-based early stopping when loss reduction is negligible.
  • Dataset: Synthetic 2D binary classification datasets with controlled overlap; fixed random seeds for reproducibility.
  • Evaluation: Precision, Recall, F1 score, and loss curves; decision boundary visualization with contour plots.

Challenges

  • Vanishing gradients with deep networks: Observed that very deep networks sometimes failed to converge with sigmoid activations in hidden layers; resolved by switching hidden activations to ReLU.
  • Learning rate sensitivity: High learning rates (e.g., lr=0.3) caused the loss to diverge rather than converge. Empirically tuned learning rates per architecture; recorded the number of iterations for loss to drop to 30% of its initial value as a diagnostic.
  • Loss oscillation in MLP training: Unlike logistic regression, MLP loss sometimes fluctuated throughout training without reaching a clean plateau. Removing the tolerance-based early stopping and monitoring the full curve helped distinguish genuine convergence from oscillation.

Reflection and Insights

Implementing backpropagation by hand — rather than relying on autograd — forced a precise understanding of how gradients flow through a network and why architectural choices matter. The depth experiment aligned with theoretical expectations: deeper networks can represent more complex functions, but require careful learning rate selection and may suffer from instability. The width experiment confirmed that wider layers can compensate for shallower depth when the task requires capturing many parallel low-level features.

A particularly interesting observation was that in high-dimensional loss landscapes, saddle points become more common than local minima — meaning that with sufficient width, gradient descent rarely gets permanently stuck.

Team and Role

  • Solo project: Both models fully implemented and analyzed independently.

Lever-based Non-strain Mass Measurement Sensing System

Overview

This project was completed for the Sensing and Measurement lab course (SDM273). The goal was to design a mass measurement system for objects in the 75–750 g range with error within ±5 g, without using any strain-gauge sensors. The solution used a lever-balance principle, driven by a stepper motor, with a GY-25 gyroscope detecting the tilt angle of the lever arm.

Results

  • Measurement accuracy: Relative error consistently within ±1% across the 75–750 g test range.
  • Linear calibration: Fitted a first-order relationship between stepper motor position and mass: d = −1.4194m + 1193.7, validated through least-squares regression on calibration data.
  • Notable features:
    • Voice broadcast of measurement results.
    • No additional distance sensor needed — displacement computed from stepper motor step count.
    • Extreme pin-resource utilization via combined hard and soft UART.

System Design

Hardware:

  • Core structure: Roller-screw linear module acting as a lever arm, pivoting on flange-bearing supports to minimize friction.
  • Counterweight: Two 500 g weights on a slider; the stepper motor drives the slider along the lever to balance the unknown mass.
  • Sensing: GY-25 module (MPU6050-based) attached to the lever to read real-time pitch angle and determine which side is heavier.
  • Controller: Arduino UNO; stepper motor driver board (HPD970).

Software:

  • Balance detection: The system drives the slider in the direction that reduces tilt angle; balance is detected when the pitch sign reverses on consecutive readings (zero-crossing logic).
  • Friction compensation: Due to bearing friction, the equilibrium position differs depending on approach direction. The final position estimate averages two convergence positions: d = (d₁ + d₂) / 2.
  • Kalman filter: Applied to MPU6050 angular data to suppress noise and drift; the raw sensor exhibited >2% fluctuation even at rest.
  • Serial communication: Combined hardware UART and software serial (SoftwareSerial) to manage the GY-25 module, voice broadcast module, and PC debug output simultaneously within Arduino UNO’s limited pin set.

Challenges

  1. MPU6050 noise and zero-drift: Raw gyroscope data was highly noisy; addressed with Kalman filtering and GY-25’s onboard attitude fusion.
  2. Friction asymmetry: The approaching-from-above vs. approaching-from-below slider positions differed due to pulley friction, causing measurement bias. Averaging the two crossing positions significantly reduced this error.
  3. Limited Arduino UNO pins: Needed simultaneous UART communication with GY-25 (115200 baud), the voice module, and the PC. Solved by remapping hardware/software serial ports and enabling additional UART channels.
  4. Small calibration dataset: Only a limited number of reference masses were available, risking underfitting/overfitting of the linear model; addressed by careful selection of calibration points across the full range.

Reflection and Insights

This project demonstrated that non-electrical sensing principles — in this case, mechanical leverage — can achieve high-precision measurements when combined with careful signal processing and systematic calibration. The Kalman filter was essential to making the gyroscope data useful in practice, and the friction-averaging technique was a simple but highly effective engineering heuristic. The resource constraints of the Arduino UNO also provided practical experience with low-level embedded systems optimization.

Team and Role

  • Team: Two-person team sharing hardware construction and software development.
  • My Role: Primarily responsible for the control algorithm, Kalman filter implementation, and serial communication configuration.