IDSS: Interpretable Diving Action Quality Assessment Platform

Overview

Built IDSS (Interpretable Diving Scoring System), an end-to-end action quality assessment framework for competitive diving, as a course project at CMU (Fall 2025), with teammates Xin Lin and Vincent Nie.

Standard AQA systems output a numeric score but provide no explanation — coaches and athletes cannot learn from them. IDSS addresses this by combining a procedure-aware deep learning backbone (based on FineDiving [Xu et al., CVPR 2022]) with a Heuristic Quality Assessment Pipeline (HQAP): a rule-based, pose-driven system that computes five physically interpretable kinematic indicators and uses them as auxiliary supervision during training.

The result: IDSS not only outperforms the baseline on all metrics but also generates structured, athlete-readable diagnostic reports with frame-aligned GIF visual evidence.

Results

Performance on the FineDiving dataset (3000 diving videos, 52 action types):

Metric Baseline IDSS Improvement
Spearman Rank Correlation (ρ) 0.9272 0.9302 +0.33%
tIoU@0.5 0.9373 0.9559 +1.99%
tIoU@0.75 0.5407 0.5714 +5.68%
Relative L2 Distance (R-ℓ2 ×100) 0.3313 0.3099 −6.46%

Convergence acceleration: IDSS achieves performance comparable to the baseline’s best 200-epoch checkpoint in ~30 epochs, and surpasses it by epoch 88. At epoch 10, IDSS achieves a 62.56% improvement in R-ℓ2 over the baseline — the pose supervision provides immediate learning signal.

Technical Details

Heuristic Quality Assessment Pipeline (HQAP)

HQAP runs three models in parallel on each video frame to extract object and pose signals:

  • Two Detectron2 (Mask R-CNN) instances for platform, splash, and diver detection.
  • One HRNet model for 16-keypoint diver pose estimation.

Raw outputs are fused into a per-video JSON. Cleaning: Savitzky-Golay filter on pose trajectories (internal gaps only), linear interpolation for platform stability; splash data preserved raw.

Five kinematic metrics, each computed only during its relevant dive phase:

  1. Somersault Tightness (TUCK phase): shoulder–hip–knee angle; lower = tighter tuck. Temporal mean.
  2. Body Straightness (ENTRY phase): shoulder–hip–ankle angle; 180° = perfectly straight. Temporal mean.
  3. Entry Verticalness (ENTRY phase): body vector vs. vertical; 0° = perfect vertical entry. Temporal mean.
  4. Splash Size (ENTRY phase): total pixel area of splash bounding boxes. Maximum over entry frames.
  5. Distance from Platform (PIKE/TUCK phases): absolute horizontal distance from diver hip to smoothed platform centroid (sliding window average). 5th percentile value (robust minimum distance during flight).

For the Distance metric, GMM analysis revealed a bimodal distribution corresponding to two natural dive classes — a Gaussian Mixture Model clusters scores and assigns one of four labels (Too Close / Close / Reasonable / Far).

Each metric is converted to a 3-tier label (“Excellent” / “Average” / “Need Improvement”) using the 25th/75th percentile thresholds across the training set.

IDSS Model Architecture

The backbone follows the procedure-aware FineDiving formulation:

  1. Temporal Segmentation: I3D features → segmentation module predicts L step-transition probabilities, dividing each dive into L+1 sub-actions (take-off, flight, entry, …).
  2. Procedure-Aware Cross-Attention: step-level features from query and exemplar videos passed through Multi-head Cross-Attention to capture relative quality differences per phase.
  3. Multi-Task Regression Head:
    • Score head: predicts relative score difference per step, aggregated to final AQA score.
    • Pose metric head (auxiliary): predicts the 5-dimensional HQAP vector from the same procedure-aware embeddings.

Joint loss: L = L_AQA (pairwise MSE) + L_TAS (temporal segmentation BCE) + λ · L_Pose (pose metric MSE). The auxiliary pose supervision acts as a structural prior, guiding learned features toward physically interpretable quality signals.

Report Generation

A deterministic statistics-driven template system computes each metric’s percentile, retrieves the matching natural-language template from a pre-defined library, and dynamically fills in the precise value and qualitative evaluation tier. No LLM required. Reports rendered as interactive HTML dashboards with frame-aligned GIFs highlighting the detected issue windows — enabling visual verification by athletes and coaches.

A lightweight Flask web interface handles video upload, processing dispatch, and report delivery.

Challenges

  1. Pose estimation reliability during water entry: HRNet confidence degrades when limbs are submerged. Handled by Savitzky-Golay smoothing on pose trajectories and preserving NaN values at sequence ends rather than extrapolating — preventing corrupted late-frame estimates from affecting metric aggregation.

  2. Bimodal distance distribution: The distance-from-platform metric has a dual-peak histogram due to structurally different dive types. A simple percentile threshold would mislabel one cluster entirely. GMM fitting discovered the two underlying distributions and enabled cluster-conditioned labeling.

  3. Auxiliary supervision calibration: Incorrect λ weighting for the pose loss destabilized early training. Tuning λ and verifying that pose metric regression errors decreased monotonically before score prediction improved confirmed that the auxiliary head was providing useful rather than noisy gradients.

Reflection and Insights

The central lesson: interpretability and performance are complementary, not opposed, when the interpretable component encodes genuine domain knowledge. The HQAP metrics are not post-hoc explanations added after training — they are causal signals that correlate directly with score deductions. Using them as auxiliary training supervision rather than just evaluation labels is what drove both the accuracy improvements and the convergence acceleration. The structural prior “works” precisely because it is physically grounded.

The convergence result is particularly striking: reaching 200-epoch baseline quality in 30 epochs means the interpretable supervision provides a strong inductive bias that dramatically reduces the search space the optimizer must explore. This generalizes: domain-specific supervision is often more sample-efficient than scaling model size or training longer.

Stack

Python, PyTorch, Detectron2, HRNet, I3D, OpenCV, Flask, HTML/CSS (report generation), FineDiving dataset

IDSS: Interpretable Diving Action Quality Assessment Platform

https://liferli.com/2025/12/31/projects/diving-aqa/

Author

Zhiling Li

Posted on

2025-12-31

Updated on

2026-02-27

Licensed under