IDSS: Interpretable Diving Action Quality Assessment Platform

Overview

Built IDSS (Interpretable Diving Scoring System), an end-to-end action quality assessment framework for competitive diving, as a course project at CMU (Fall 2025), with teammates Xin Lin and Vincent Nie.

Standard AQA systems output a numeric score but provide no explanation — coaches and athletes cannot learn from them. IDSS addresses this by combining a procedure-aware deep learning backbone (based on FineDiving [Xu et al., CVPR 2022]) with a Heuristic Quality Assessment Pipeline (HQAP): a rule-based, pose-driven system that computes five physically interpretable kinematic indicators and uses them as auxiliary supervision during training.

The result: IDSS not only outperforms the baseline on all metrics but also generates structured, athlete-readable diagnostic reports with frame-aligned GIF visual evidence.

Results

Performance on the FineDiving dataset (3000 diving videos, 52 action types):

Metric Baseline IDSS Improvement
Spearman Rank Correlation (ρ) 0.9272 0.9302 +0.33%
tIoU@0.5 0.9373 0.9559 +1.99%
tIoU@0.75 0.5407 0.5714 +5.68%
Relative L2 Distance (R-ℓ2 ×100) 0.3313 0.3099 −6.46%

Convergence acceleration: IDSS achieves performance comparable to the baseline’s best 200-epoch checkpoint in ~30 epochs, and surpasses it by epoch 88. At epoch 10, IDSS achieves a 62.56% improvement in R-ℓ2 over the baseline — the pose supervision provides immediate learning signal.

Technical Details

Heuristic Quality Assessment Pipeline (HQAP)

HQAP runs three models in parallel on each video frame to extract object and pose signals:

  • Two Detectron2 (Mask R-CNN) instances for platform, splash, and diver detection.
  • One HRNet model for 16-keypoint diver pose estimation.

Raw outputs are fused into a per-video JSON. Cleaning: Savitzky-Golay filter on pose trajectories (internal gaps only), linear interpolation for platform stability; splash data preserved raw.

Five kinematic metrics, each computed only during its relevant dive phase:

  1. Somersault Tightness (TUCK phase): shoulder–hip–knee angle; lower = tighter tuck. Temporal mean.
  2. Body Straightness (ENTRY phase): shoulder–hip–ankle angle; 180° = perfectly straight. Temporal mean.
  3. Entry Verticalness (ENTRY phase): body vector vs. vertical; 0° = perfect vertical entry. Temporal mean.
  4. Splash Size (ENTRY phase): total pixel area of splash bounding boxes. Maximum over entry frames.
  5. Distance from Platform (PIKE/TUCK phases): absolute horizontal distance from diver hip to smoothed platform centroid (sliding window average). 5th percentile value (robust minimum distance during flight).

For the Distance metric, GMM analysis revealed a bimodal distribution corresponding to two natural dive classes — a Gaussian Mixture Model clusters scores and assigns one of four labels (Too Close / Close / Reasonable / Far).

Each metric is converted to a 3-tier label (“Excellent” / “Average” / “Need Improvement”) using the 25th/75th percentile thresholds across the training set.

IDSS Model Architecture

The backbone follows the procedure-aware FineDiving formulation:

  1. Temporal Segmentation: I3D features → segmentation module predicts L step-transition probabilities, dividing each dive into L+1 sub-actions (take-off, flight, entry, …).
  2. Procedure-Aware Cross-Attention: step-level features from query and exemplar videos passed through Multi-head Cross-Attention to capture relative quality differences per phase.
  3. Multi-Task Regression Head:
    • Score head: predicts relative score difference per step, aggregated to final AQA score.
    • Pose metric head (auxiliary): predicts the 5-dimensional HQAP vector from the same procedure-aware embeddings.

Joint loss: L = L_AQA (pairwise MSE) + L_TAS (temporal segmentation BCE) + λ · L_Pose (pose metric MSE). The auxiliary pose supervision acts as a structural prior, guiding learned features toward physically interpretable quality signals.

Report Generation

A deterministic statistics-driven template system computes each metric’s percentile, retrieves the matching natural-language template from a pre-defined library, and dynamically fills in the precise value and qualitative evaluation tier. No LLM required. Reports rendered as interactive HTML dashboards with frame-aligned GIFs highlighting the detected issue windows — enabling visual verification by athletes and coaches.

A lightweight Flask web interface handles video upload, processing dispatch, and report delivery.

Challenges

  1. Pose estimation reliability during water entry: HRNet confidence degrades when limbs are submerged. Handled by Savitzky-Golay smoothing on pose trajectories and preserving NaN values at sequence ends rather than extrapolating — preventing corrupted late-frame estimates from affecting metric aggregation.

  2. Bimodal distance distribution: The distance-from-platform metric has a dual-peak histogram due to structurally different dive types. A simple percentile threshold would mislabel one cluster entirely. GMM fitting discovered the two underlying distributions and enabled cluster-conditioned labeling.

  3. Auxiliary supervision calibration: Incorrect λ weighting for the pose loss destabilized early training. Tuning λ and verifying that pose metric regression errors decreased monotonically before score prediction improved confirmed that the auxiliary head was providing useful rather than noisy gradients.

Reflection and Insights

The central lesson: interpretability and performance are complementary, not opposed, when the interpretable component encodes genuine domain knowledge. The HQAP metrics are not post-hoc explanations added after training — they are causal signals that correlate directly with score deductions. Using them as auxiliary training supervision rather than just evaluation labels is what drove both the accuracy improvements and the convergence acceleration. The structural prior “works” precisely because it is physically grounded.

The convergence result is particularly striking: reaching 200-epoch baseline quality in 30 epochs means the interpretable supervision provides a strong inductive bias that dramatically reduces the search space the optimizer must explore. This generalizes: domain-specific supervision is often more sample-efficient than scaling model size or training longer.

Stack

Python, PyTorch, Detectron2, HRNet, I3D, OpenCV, Flask, HTML/CSS (report generation), FineDiving dataset

Cloud-Native IoT Network Security Analysis Pipeline

Overview

Built a cloud-native, end-to-end intrusion detection pipeline for IoT network traffic as a course project in 18763 System Toolchains (Fall 2025, CMU), in collaboration with Yiqiao Zhou. The system processes the MQTTset dataset — 20 million MQTT records from a simulated smart home environment — to perform 6-class attack classification across normal traffic and five attack types (brute force, DoS, flood, SlowITe, malformed packet).

The pipeline is structured as four tasks: database design and population (Task I), large-scale analytics (Task II), ML modeling (Task III), and full cloud deployment on GCP (Task IV/Bonus).

Results

Machine Learning (Task III):

Model Framework Best Test Accuracy
Logistic Regression Spark ML 74.97%
Shallow MLP PyTorch 75.40%
Deep MLP PyTorch 75.40%
Random Forest Spark ML 78.52%

Random Forest (30 trees, depth 7) was the best-performing model — the tree-based ensemble’s non-linear boundaries were more effective than neural networks on this structured tabular feature set. Key finding: maxDepth=7 vs maxDepth=5 improved test accuracy by ~4 percentage points, while increasing from 30 to 50 trees provided no benefit.

Feature engineering: 47-dimensional feature vectors generated from 34 raw MQTT/TCP columns via one-hot encoding of 4 categorical flag fields + standardized 15 numerical features + constant-column removal.

Technical Details

Data Ingestion (Task I):

  • Loaded train70_augmented.csv (14M rows) and test30_augmented.csv (6M rows) from GCS into a Dockerized PostgreSQL 16 database via JDBC on GCE VM.
  • 16-partition parallel JDBC write; combined dataset includes a split column distinguishing train/test.

Analytics (Task II — PySpark):

  • Average MQTT message length by attack class; TCP statistics and MQTT header flag distributions.
  • Top TCP flags filtered by time delta; target class distribution histograms.
  • Kafka streaming pipeline (Task II-Q5): YouTube API producer publishes cybersecurity video comments to a Kafka topic; Spark Streaming consumer performs real-time keyword-frequency analysis.

Distributed Feature Engineering (Task III — Spark ML Pipeline):

  • Removed 10 near-zero-variance columns (stddev < 1e-6).
  • StringIndexer + OneHotEncoder on 4 categorical flag columns → 32 dimensions.
  • VectorAssembler + StandardScaler on 15 numerical columns → 15 dimensions.
  • Combined via VectorAssembler → 47-dim features column; labels indexed to integers 0–5.
  • Full dataset checkpointed to GCS Parquet to break the Spark execution graph before training.

ML Models:

  • Spark ML: Logistic Regression (L1/Lasso regularization, regParam=0.001) and Random Forest (numTrees=30, maxDepth=7, subsamplingRate=0.5), tuned via 80/20 TrainValidationSplit.
  • PyTorch: Shallow MLP (47→96→128→6, ~7K params) and Deep MLP (47→128→128→64→6 with BatchNorm+Dropout, ~25K params). Features exported to GCS Parquet, loaded via custom ParquetArrayIterable DataLoader. Adam optimizer, linear warmup (10K steps), gradient clipping, early stopping.

Cloud Deployment (Task IV):

  • All three notebooks run on a GCP Dataproc cluster (Apache Spark 3.5.3) with JupyterLab.
  • PostgreSQL runs in a Docker container on a separate GCE VM, accessed via private IP within a VPC-secured network — no managed Cloud SQL.
  • Full pipeline: GCS CSV → Spark ingestion → Dockerized PostgreSQL → Spark analytics → Spark ML + PyTorch training.

Challenges

  1. JDBC write contention at scale: 20 parallel Spark tasks writing to the same PostgreSQL table caused severe lock contention. Diagnosed via pg_locks slow query logs; resolved by pre-hashing records into non-overlapping partition ranges before JDBC write.

  2. Parquet-to-PyTorch data pipeline: Spark’s sparse vector format is not directly consumable by PyTorch. Implemented a custom ParquetArrayIterable class using pyarrow.dataset to convert sparse vectors to dense tensors in a streaming fashion, with configurable batch sizes and value clipping ([-10, 10]) for training stability.

  3. Neural networks underperforming tree-based models: Shallow and Deep MLPs achieved only 75.40% — 3.5pp below Random Forest. Root cause: the engineered tabular features with protocol flag one-hot vectors are well-suited for tree-based splits that can isolate specific flag combinations, whereas MLP architectures require different inductive biases to capture equivalent patterns.

Reflection and Insights

This project demonstrated concretely why Random Forest often outperforms neural networks on structured tabular data: the tree’s per-split logic naturally handles mixed feature types (categorical one-hot + numerical) without requiring normalization, and ensemble voting reduces variance effectively. The Shallow and Deep MLPs achieved nearly identical accuracy, confirming that added depth and regularization provided no benefit once the architectural bottleneck was at feature representation rather than model capacity.

The infra experience also made explicit how the choice of data partitioning scheme (JDBC write partitioning, GCS Parquet layout) dominates end-to-end pipeline throughput more than algorithmic choices.

Stack

GCP (Dataproc, GCS, GCE VM, VPC), Apache Spark 3.5.3 (PySpark), PyTorch, PostgreSQL 16 (Docker), JDBC, Apache Kafka, Neo4j, Python 3.9+

Adaptive Model Selection for Real-Time Heart Disease Detection

Overview

As a Research Assistant at North Carolina State University (under Prof. Zhishan Guo), I contributed to building and evaluating an Adaptive Model Selection (AMS) framework for real-time cardiovascular disease detection on wearable embedded hardware — targeting deployment on a Raspberry Pi 4.

The core problem: ECG inference latency is bounded by the patient’s instantaneous heart rate (higher HR = shorter beat deadline), but accuracy increases with a heavier model. A fixed-complexity model either misses deadlines at high heart rate or wastes capacity at low heart rate. Our AMS framework solves this by dynamically selecting from three model tiers at every beat window based on real-time HR.

The work was published as a research paper at an IEEE conference.

Publication: “Adaptive Model Selection for Real-Time Heart Disease Detection on Embedded Systems” (2nd author)
IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2025)

Task Definition

The system performs 5-class cardiac severity classification (severity levels 0–4) on single-lead ECG data in real-time. Rather than classifying individual disease types (the full dataset covers 72 disease categories), the system assigns a severity score to each heartbeat cycle — enabling timely alerts and continuous risk monitoring on a wearable without requiring a full diagnostic workup.

Dataset: PhysioNet 2021 Challenge — filtered to 22,359 single-label ECG recordings, split 64% training / 16% validation / 20% test.

Model Architecture

Each input segment contains β consecutive R–R cycles, each resampled to 256 samples. The model has two parallel branches fused via global attention:

ECG branch: A stem convolution → three Residual Blocks (channel widths 8β → 16β → 32β → 64β) each containing a Squeeze-and-Excitation (SE) unit for channel recalibration → adaptive average pooling to length α.

Period branch: The inter-beat period vector passes through a FC block (Linear → BatchNorm → ELU), mapping to the same 64β feature space.

Global Attention fusion: The flattened ECG features and period embedding are concatenated, fed to a two-layer attention module that produces a sigmoid mask modulating the ECG features — allowing the network to weight cycle regions by their rhythm context.

Output: The attended features pass through two FC layers to produce 5 logits (severity 0–4).

This architecture couples morphological feature extraction (ResBlocks + SE) with rhythm-aware re-weighting (period branch + global attention), kept compact for embedded deployment.

AMS Framework and Anytime CNN

Three model tiers share a common parameter-shared Anytime CNN backbone with early-exit heads:

  • High HR (≥ 90 bpm): Lightweight exit — fastest path, 0.57 ms, handles tight deadlines.
  • Moderate HR (70–90 bpm): Moderate exit — adds one ResBlock+SE, 1.79 ms.
  • Low HR (< 70 bpm): Advanced exit — full depth with global attention, 1.94 ms, highest accuracy.

At every shifted window, the AMS controller reads instantaneous HR and routes to the shallowest model that can meet the beat’s timing deadline. All three exits are jointly trained with deep supervision (equal-weight loss summing), so each head remains independently accurate while sharing the backbone weights — keeping the total checkpoint under 5 MB in the two-cycle configuration.

Results

Model Cycles Accuracy F1 Inference (ms) Deadline Misses
AMS + Anytime 2 91.5% 90.6% 1.33 0
Advanced (standalone) 2 92.6% 91.1% 1.94 431/1000
Moderate 2 87.8% 87.7% 1.79 259/1000
Lightweight 2 86.5% 86.6% 1.05 0
CNN-LSTM (baseline) 2 87.3% 87.6% 3.33 1000/1000

Key finding: Two cardiac cycles is the optimal input length — one extra beat provides enough temporal context to improve accuracy meaningfully, while three or four cycles push latency past the real-time budget. The AMS+Anytime configuration achieves the accuracy sweet spot (91.5%) with zero deadline misses across all heart-rate regimes.

Technical Details

Preprocessing:

  • R-peaks detected using Hamilton’s algorithm (BioSPPy library); heartbeat cycles extracted as R–R intervals and resampled to 256 points.
  • Labels assigned per-cycle based on the recording’s severity score; multi-label recordings excluded to eliminate annotation ambiguity.
  • Fixed preprocessing/label-alignment issues in the PhysioNet dataset that caused unstable cross-fold metrics.

Scheduling:

  • EDF (Earliest-Deadline-First) schedulability analysis verified the system can co-exist with other concurrent tasks (UI, Bluetooth, sensor fusion) on a uniprocessor without deadline violations.
  • A microsecond-resolution watchdog can pre-empt inference at a configurable fraction of the beat budget and fall back to a shallower exit if needed.

Training:

  • Adam optimizer, lr 0.001, batch size 128, early stopping on validation loss.
  • Multi-exit deep supervision: losses from all three exit heads summed with equal weights.
  • Evaluated on Raspberry Pi 4 (quad-core ARM Cortex-A72) as a proxy for commercial wearable SoCs.

Challenges

  1. Latency–accuracy trade-off at the per-beat level: No single fixed model can meet deadlines at high HR while maximizing accuracy at low HR. The AMS+Anytime design resolves this by making depth selection a runtime policy rather than a design-time choice.

  2. Label-alignment bugs in PhysioNet preprocessing: Early experiments showed high cross-fold metric variance. Root cause was windowing misalignment causing future-label leakage. Fixing alignment via Hamilton R-peak anchoring eliminated the variance.

  3. Memory budget on embedded SoC: Three independent checkpoints would exceed wearable SRAM. Parameter sharing via early-exit architecture brings the two-cycle AMS model to under 5 MB — feasible for a smartwatch.

Reflection and Insights

The most important insight from this project: adaptive depth selection is not an optimization — it is a prerequisite for correctness in real-time embedded ML. A model that achieves 92.6% accuracy in batch evaluation but misses 431 out of 1000 deadlines on-device is not a working real-time system. Framing the problem through the lens of schedulability analysis (EDF, utilization bounds) made this explicit and led directly to the AMS design. The secondary insight is that multi-exit parameter sharing is the right architectural response to memory-constrained deployment: all complexity levels coexist in one checkpoint, switchable with zero weight reload overhead.

Team and Role

Research at NCSU under Prof. Zhishan Guo. My responsibilities: co-designing the CNN architecture (ResBlocks + SE + Global Attention), debugging the PhysioNet preprocessing pipeline, benchmarking model tiers on Raspberry Pi, contributing to AMS framework design, and co-authoring the RTCSA 2025 paper.

Reinforcement Learning for Inverted Pendulum Control

Overview

This project involved applying Reinforcement Learning (RL) algorithms to control both single and double inverted pendulum systems. Using algorithms such as Q-Learning, DQN, DDPG, and PPO, we implemented controllers to achieve swing-up and stabilization tasks. The project explored the dynamic complexities of inverted pendulum systems and highlighted the effectiveness of RL techniques for non-linear control problems.

Results

  • Single Inverted Pendulum:
    • Achieved a 100% success rate for swing-up and stabilization tasks under ideal conditions.
    • Maintained a 90% success rate under noisy conditions with a simulation time of 30 seconds.
  • Double Inverted Pendulum:
    • Successfully stabilized the pendulum but encountered challenges in achieving swing-up with model-free RL methods.
  • Performance Metrics:
    • Trained RL models for swing-up and stabilization tasks in under 50,000 episodes.
    • Demonstrated the effectiveness of custom reward functions for dynamic control tasks.

GitHub (Chinese README)

Swing-up from a stationary state
Swing-up under noisy conditions
Stabilization of the double inverted pendulum

Technical Details

  • Algorithms Applied:
    • Q-Learning and DQN: Explored discrete action spaces for initial experiments.
    • A2C and PPO: Achieved robust performance for stabilization tasks in continuous action spaces.
    • DDPG: Provided smooth control for swing-up tasks with deterministic policy gradients.
  • Custom Toolkit:
    • Developed RL agents from scratch using PyTorch, including functions for initialization, model updates, and action sampling.
    • Designed visualization tools to monitor reward curves and training metrics.
  • Reward Design:
    • Swing-up Task: Rewarded higher pendulum angles while penalizing velocity at the peak.
    • Stabilization Task: Encouraged minimal deviation from the vertical position and low angular velocity.

Challenges

  • Swing-Up Task:
    • Coordinating motion during the throw-and-catch process was challenging, especially under noisy conditions.
    • Solution: Implemented collaborative agents for swing-up and stabilization, with separate reward functions for each sub-task.
  • Double Inverted Pendulum:
    • Model-free RL struggled with the system’s chaotic behavior.
    • Solution: Transitioned to model-based approaches like PILCO for better state-action-reward predictions.

Reflection and Insights

This project deepened my understanding of reinforcement learning and its application to real-world control problems. It highlighted the importance of tailored reward functions and robust algorithm selection for dynamic systems. The challenges in handling chaotic behaviors inspired further exploration into model-based strategies to enhance RL performance.

Team and Role

  • Team: Worked collaboratively with two teammates on RL model implementation and evaluation.
  • My Role:
    • Focused on the single inverted pendulum tasks, including algorithm selection and reward function design.
    • Developed custom RL agents using PyTorch, optimizing hyperparameters for efficient training.
    • Led the implementation of the collaborative “throw-catch” process for swing-up tasks.

Statistical Learning for Data Science

Overview

This project series was completed as part of the Statistical Learning for Data Science course at Southern University of Science and Technology. The work covered two major tasks: applying hybrid deep learning and traditional machine learning pipelines for medical image classification, focusing on fundus lesion diagnosis. The project explored how feature representations from pre-trained deep networks can be combined with classical classifiers to achieve high accuracy with reduced computational cost.

Results

  • Task 1 — Pre-trained Feature Extraction: Used ResNet18 as a frozen feature extractor; downstream classifiers (Linear Regression, KNN, SVM) achieved 100% accuracy on the test set, demonstrating the quality of ResNet18’s learned representations.
  • Task 2 — Fine-tuned ResNet18: Fine-tuned ResNet18 end-to-end on the 3-class fundus dataset, converging in ~3 epochs with 100% test accuracy and near-perfect AUC across all classes.
  • Bonus — Custom CNN: Designed a lightweight CNN from scratch using PyTorch, achieving 99.53% accuracy in 155 s training time vs. 348 s for ResNet18, demonstrating favorable speed-accuracy trade-off.
  • Extension — 7-class Classification: Extended the fine-tuned ResNet18 to a 7-class problem; all classes achieved AUC = 1.00, validating the method’s scalability.

Technical Details

  • Dataset: Fundus lesion images categorized into 3 (and later 7) classes; standard preprocessing with resize, normalization, and contrast adjustment.
  • Hybrid Pipeline:
    • ResNet18 (pre-trained on ImageNet) used as a backbone to extract 1000-dimensional feature vectors.
    • Traditional classifiers (LE, KNN, SVM, MLP) trained on extracted features using sklearn.
  • Custom CNN Architecture:
    • Convolutional channels: [16, 32, 64], kernel size 3×3, max pooling with stride 2.
    • Grayscale edge-detected preprocessing (Canny, Gaussian blur) to reduce input redundancy.
    • Fully connected MLP head for multi-class output.
  • Training Setup: SGD optimizer (lr=0.001, momentum=0.9), cross-entropy loss, 3–5 epochs.
  • Evaluation: Accuracy, ROC curves, and AUC per class; all reported in the final report.

Challenges

  • Speed vs. accuracy trade-off: The custom CNN was significantly faster (2.2×) but slightly less accurate than ResNet18. The gap was attributed to the simplicity of convolution layers and grayscale conversion that discards color information.
  • Feature quality vs. training cost: Frozen ResNet18 features were so discriminative that even linear classifiers achieved perfect accuracy, raising the question of when fine-tuning is truly necessary.
  • 7-class generalization: Extending to a harder 7-class scenario required careful dataset balancing and preprocessing to maintain generalization.

Reflection and Insights

This project reinforced a key principle in applied machine learning: strong pre-trained feature representations can often substitute for expensive end-to-end training, especially when labeled data is limited. The hybrid approach — deep features paired with classical classifiers — offers a practical and interpretable alternative to black-box deep models in medical contexts. Designing the custom CNN from scratch also deepened understanding of how architectural choices (depth, width, pooling strategy) affect both accuracy and training efficiency.

Team and Role

  • Team: Collaborated with two teammates on methodology design, experiments, and report writing.
  • My Role: Led the custom CNN design and preprocessing pipeline; contributed to the hybrid pipeline experiments and analysis of width/depth trade-offs.