Posted 2025-11-30Updated 2026-02-27Project5 minutes read (About 726 words)

Cloud-Native IoT Network Security Analysis Pipeline

Overview

Built a cloud-native, end-to-end intrusion detection pipeline for IoT network traffic as a course project in 18763 System Toolchains (Fall 2025, CMU), in collaboration with Yiqiao Zhou. The system processes the MQTTset dataset — 20 million MQTT records from a simulated smart home environment — to perform 6-class attack classification across normal traffic and five attack types (brute force, DoS, flood, SlowITe, malformed packet).

The pipeline is structured as four tasks: database design and population (Task I), large-scale analytics (Task II), ML modeling (Task III), and full cloud deployment on GCP (Task IV/Bonus).

Results

Machine Learning (Task III):

Model	Framework	Best Test Accuracy
Logistic Regression	Spark ML	74.97%
Shallow MLP	PyTorch	75.40%
Deep MLP	PyTorch	75.40%
Random Forest	Spark ML	78.52%

Random Forest (30 trees, depth 7) was the best-performing model — the tree-based ensemble’s non-linear boundaries were more effective than neural networks on this structured tabular feature set. Key finding: maxDepth=7 vs maxDepth=5 improved test accuracy by ~4 percentage points, while increasing from 30 to 50 trees provided no benefit.

Feature engineering: 47-dimensional feature vectors generated from 34 raw MQTT/TCP columns via one-hot encoding of 4 categorical flag fields + standardized 15 numerical features + constant-column removal.

Technical Details

Data Ingestion (Task I):

Loaded train70_augmented.csv (14M rows) and test30_augmented.csv (6M rows) from GCS into a Dockerized PostgreSQL 16 database via JDBC on GCE VM.
16-partition parallel JDBC write; combined dataset includes a split column distinguishing train/test.

Analytics (Task II — PySpark):

Average MQTT message length by attack class; TCP statistics and MQTT header flag distributions.
Top TCP flags filtered by time delta; target class distribution histograms.
Kafka streaming pipeline (Task II-Q5): YouTube API producer publishes cybersecurity video comments to a Kafka topic; Spark Streaming consumer performs real-time keyword-frequency analysis.

Distributed Feature Engineering (Task III — Spark ML Pipeline):

Removed 10 near-zero-variance columns (stddev < 1e-6).
StringIndexer + OneHotEncoder on 4 categorical flag columns → 32 dimensions.
VectorAssembler + StandardScaler on 15 numerical columns → 15 dimensions.
Combined via VectorAssembler → 47-dim features column; labels indexed to integers 0–5.
Full dataset checkpointed to GCS Parquet to break the Spark execution graph before training.

ML Models:

Spark ML: Logistic Regression (L1/Lasso regularization, regParam=0.001) and Random Forest (numTrees=30, maxDepth=7, subsamplingRate=0.5), tuned via 80/20 TrainValidationSplit.
PyTorch: Shallow MLP (47→96→128→6, ~7K params) and Deep MLP (47→128→128→64→6 with BatchNorm+Dropout, ~25K params). Features exported to GCS Parquet, loaded via custom ParquetArrayIterable DataLoader. Adam optimizer, linear warmup (10K steps), gradient clipping, early stopping.

Cloud Deployment (Task IV):

All three notebooks run on a GCP Dataproc cluster (Apache Spark 3.5.3) with JupyterLab.
PostgreSQL runs in a Docker container on a separate GCE VM, accessed via private IP within a VPC-secured network — no managed Cloud SQL.
Full pipeline: GCS CSV → Spark ingestion → Dockerized PostgreSQL → Spark analytics → Spark ML + PyTorch training.

Challenges

JDBC write contention at scale: 20 parallel Spark tasks writing to the same PostgreSQL table caused severe lock contention. Diagnosed via pg_locks slow query logs; resolved by pre-hashing records into non-overlapping partition ranges before JDBC write.
Parquet-to-PyTorch data pipeline: Spark’s sparse vector format is not directly consumable by PyTorch. Implemented a custom ParquetArrayIterable class using pyarrow.dataset to convert sparse vectors to dense tensors in a streaming fashion, with configurable batch sizes and value clipping ([-10, 10]) for training stability.
Neural networks underperforming tree-based models: Shallow and Deep MLPs achieved only 75.40% — 3.5pp below Random Forest. Root cause: the engineered tabular features with protocol flag one-hot vectors are well-suited for tree-based splits that can isolate specific flag combinations, whereas MLP architectures require different inductive biases to capture equivalent patterns.

Reflection and Insights

This project demonstrated concretely why Random Forest often outperforms neural networks on structured tabular data: the tree’s per-split logic naturally handles mixed feature types (categorical one-hot + numerical) without requiring normalization, and ensemble voting reduces variance effectively. The Shallow and Deep MLPs achieved nearly identical accuracy, confirming that added depth and regularization provided no benefit once the architectural bottleneck was at feature representation rather than model capacity.

The infra experience also made explicit how the choice of data partitioning scheme (JDBC write partitioning, GCS Parquet layout) dominates end-to-end pipeline throughput more than algorithmic choices.

Stack

GCP (Dataproc, GCS, GCE VM, VPC), Apache Spark 3.5.3 (PySpark), PyTorch, PostgreSQL 16 (Docker), JDBC, Apache Kafka, Neo4j, Python 3.9+

Cloud-Native IoT Network Security Analysis Pipeline

https://liferli.com/2025/11/30/projects/iot-security-pipeline/

Author

Zhiling Li

Posted on

2025-11-30

Updated on

2026-02-27

Licensed under

Cloud-Native IoT Network Security Analysis Pipeline

Overview

Results

Technical Details

Challenges

Reflection and Insights

Stack

Author

Posted on

Updated on

Licensed under

Links

Categories

Tags