Cloud-Native IoT Network Security Analysis Pipeline
Overview
Built a cloud-native, end-to-end intrusion detection pipeline for IoT network traffic as a course project in 18763 System Toolchains (Fall 2025, CMU), in collaboration with Yiqiao Zhou. The system processes the MQTTset dataset — 20 million MQTT records from a simulated smart home environment — to perform 6-class attack classification across normal traffic and five attack types (brute force, DoS, flood, SlowITe, malformed packet).
The pipeline is structured as four tasks: database design and population (Task I), large-scale analytics (Task II), ML modeling (Task III), and full cloud deployment on GCP (Task IV/Bonus).
Results
Machine Learning (Task III):
| Model | Framework | Best Test Accuracy |
|---|---|---|
| Logistic Regression | Spark ML | 74.97% |
| Shallow MLP | PyTorch | 75.40% |
| Deep MLP | PyTorch | 75.40% |
| Random Forest | Spark ML | 78.52% |
Random Forest (30 trees, depth 7) was the best-performing model — the tree-based ensemble’s non-linear boundaries were more effective than neural networks on this structured tabular feature set. Key finding: maxDepth=7 vs maxDepth=5 improved test accuracy by ~4 percentage points, while increasing from 30 to 50 trees provided no benefit.
Feature engineering: 47-dimensional feature vectors generated from 34 raw MQTT/TCP columns via one-hot encoding of 4 categorical flag fields + standardized 15 numerical features + constant-column removal.
Technical Details
Data Ingestion (Task I):
- Loaded
train70_augmented.csv(14M rows) andtest30_augmented.csv(6M rows) from GCS into a Dockerized PostgreSQL 16 database via JDBC on GCE VM. - 16-partition parallel JDBC write; combined dataset includes a
splitcolumn distinguishing train/test.
Analytics (Task II — PySpark):
- Average MQTT message length by attack class; TCP statistics and MQTT header flag distributions.
- Top TCP flags filtered by time delta; target class distribution histograms.
- Kafka streaming pipeline (Task II-Q5): YouTube API producer publishes cybersecurity video comments to a Kafka topic; Spark Streaming consumer performs real-time keyword-frequency analysis.
Distributed Feature Engineering (Task III — Spark ML Pipeline):
- Removed 10 near-zero-variance columns (stddev < 1e-6).
StringIndexer+OneHotEncoderon 4 categorical flag columns → 32 dimensions.VectorAssembler+StandardScaleron 15 numerical columns → 15 dimensions.- Combined via
VectorAssembler→ 47-dimfeaturescolumn; labels indexed to integers 0–5. - Full dataset checkpointed to GCS Parquet to break the Spark execution graph before training.
ML Models:
- Spark ML: Logistic Regression (L1/Lasso regularization,
regParam=0.001) and Random Forest (numTrees=30, maxDepth=7, subsamplingRate=0.5), tuned via 80/20TrainValidationSplit. - PyTorch: Shallow MLP (47→96→128→6, ~7K params) and Deep MLP (47→128→128→64→6 with BatchNorm+Dropout, ~25K params). Features exported to GCS Parquet, loaded via custom
ParquetArrayIterableDataLoader. Adam optimizer, linear warmup (10K steps), gradient clipping, early stopping.
Cloud Deployment (Task IV):
- All three notebooks run on a GCP Dataproc cluster (Apache Spark 3.5.3) with JupyterLab.
- PostgreSQL runs in a Docker container on a separate GCE VM, accessed via private IP within a VPC-secured network — no managed Cloud SQL.
- Full pipeline: GCS CSV → Spark ingestion → Dockerized PostgreSQL → Spark analytics → Spark ML + PyTorch training.
Challenges
JDBC write contention at scale: 20 parallel Spark tasks writing to the same PostgreSQL table caused severe lock contention. Diagnosed via
pg_locksslow query logs; resolved by pre-hashing records into non-overlapping partition ranges before JDBC write.Parquet-to-PyTorch data pipeline: Spark’s sparse vector format is not directly consumable by PyTorch. Implemented a custom
ParquetArrayIterableclass usingpyarrow.datasetto convert sparse vectors to dense tensors in a streaming fashion, with configurable batch sizes and value clipping ([-10, 10]) for training stability.Neural networks underperforming tree-based models: Shallow and Deep MLPs achieved only 75.40% — 3.5pp below Random Forest. Root cause: the engineered tabular features with protocol flag one-hot vectors are well-suited for tree-based splits that can isolate specific flag combinations, whereas MLP architectures require different inductive biases to capture equivalent patterns.
Reflection and Insights
This project demonstrated concretely why Random Forest often outperforms neural networks on structured tabular data: the tree’s per-split logic naturally handles mixed feature types (categorical one-hot + numerical) without requiring normalization, and ensemble voting reduces variance effectively. The Shallow and Deep MLPs achieved nearly identical accuracy, confirming that added depth and regularization provided no benefit once the architectural bottleneck was at feature representation rather than model capacity.
The infra experience also made explicit how the choice of data partitioning scheme (JDBC write partitioning, GCS Parquet layout) dominates end-to-end pipeline throughput more than algorithmic choices.
Stack
GCP (Dataproc, GCS, GCE VM, VPC), Apache Spark 3.5.3 (PySpark), PyTorch, PostgreSQL 16 (Docker), JDBC, Apache Kafka, Neo4j, Python 3.9+
Cloud-Native IoT Network Security Analysis Pipeline
https://liferli.com/2025/11/30/projects/iot-security-pipeline/