Zhiling Li

Posted 2024-07-01Updated 2026-02-27Research4 minutes read (About 671 words)

Cross-Modal LLM-Based Robotic Arm Interaction and Control System

Overview

As a Research Assistant at Southern University of Science and Technology (SUSTech), I led a research project integrating large language models (LLMs) with physical robotic arm control. The core idea was to allow a user to issue natural-language instructions to a robotic arm — and have those instructions reliably translated into executable, safe robot motion sequences — by designing an LLM agent with a structured tool/function-calling interface over a ROS action layer.

I led the proposal defense that secured research funding for the project, then drove the full system from design to physical hardware validation.

Results

Successfully demonstrated end-to-end natural-language-to-motion on a JAKA Zu5 robotic arm: spoken/typed instructions → structured action sequence → physical execution.
Validated the full stack in RViz/Gazebo simulation and on the physical JAKA Zu5 arm, confirming sim-to-real transfer with no manual re-tuning.
The LLM middleware correctly handled asynchronous inference latency: instructions were queued, execution proceeded without blocking, and feedback was synchronized on completion.

Technical Details

LLM Agent Design:

Designed an LLM agent with a tool/function-calling interface that maps natural-language commands to a predefined catalog of ROS action primitives (e.g., move-to-pose, grasp, release, home).
Implemented schema-based argument validation for each tool: the LLM must produce structured JSON arguments matching the action schema before any motion is dispatched, preventing malformed or unsafe commands from reaching the robot.

Sim-to-Real Stack (JAKA Zu5):

Refined the robot’s URDF: corrected link/joint frame origins and collision meshes to match the physical arm’s geometry.
Integrated the end-effector gripper into the URDF and MoveIt configuration.
Configured MoveIt motion planning with per-joint velocity and acceleration limits to ensure smooth, collision-free trajectories within hardware safety bounds.

ROS Action Middleware:

Built a custom ROS action middleware layer that bridges the asynchronous nature of LLM inference with real-time robot execution: actions are queued and dispatched with proper scheduling; the middleware provides feedback callbacks to the LLM agent so it can reason about the current execution state before issuing the next instruction.
Integrated vision-based pose estimation for object localization, feeding spatial information back into the LLM context to support pick-and-place style interactions.

Challenges

LLM latency vs. real-time execution: LLM inference is slow and non-deterministic in timing, but robot control requires timely, predictable action dispatch. Solved by decoupling inference and execution into separate threads with an action queue and explicit feedback synchronization rather than naive sequential calls.
Schema drift and hallucination: LLMs sometimes generate structurally invalid tool arguments. The schema-based validation layer acts as a strict interface contract — invalid arguments are rejected with an error message re-injected into the LLM context, prompting self-correction before execution.
URDF fidelity for sim-to-real transfer: Early Gazebo simulations showed trajectory drift on the physical arm due to inaccurate link inertia and joint offset values in the URDF. Systematically measuring and correcting these values eliminated the gap between simulated and physical behavior.

Reflection and Insights

This project made concrete the gap between LLM capability and deployment reliability: the model can understand intent fluently, but without an explicit typed interface contract and validation layer, the output is too unpredictable to safely actuate physical hardware. The schema/tool-calling design pattern — essentially treating the LLM as a high-level planner that calls typed APIs — was the key architectural insight that made the system robust. This pattern has become broadly influential in agentic LLM system design, and experiencing it in the context of a safety-critical physical system gave me a deep intuition for where it is and isn’t sufficient.

Team and Role

Research conducted at SUSTech under faculty supervision. My responsibilities included leading the proposal defense to secure funding, designing the LLM agent architecture and tool-calling interface, building the ROS action middleware, refining the JAKA Zu5 URDF and MoveIt configuration, integrating vision-based pose estimation, and coordinating system validation in simulation and on physical hardware.

Overview

This project was the final deliverable for the Robot Perception and Intelligence course (EE211) at Southern University of Science and Technology, built on the ROS2 platform. The goal was to develop a fully autonomous robot capable of navigating to a target location, recognizing and grasping an object using a robotic arm, and avoiding obstacles — all with custom-implemented planning and control modules.

Robot navigation and arm control demonstration

Results

Navigation: Successfully navigated to target points using the Nav2 stack with a custom global planner plugin.

Object Recognition and Grasping: Detected target objects via Aruco markers; the robotic arm computed inverse kinematics and executed reliable grasps within the reachable workspace.

Path Planning: Implemented a custom A* global planner and a trajectory feedback local controller as Nav2 plugins.

Extra Challenge: Handled randomly placed objects by dynamically querying IK solvability during slow-approach phases.

Technical Details

System Architecture:

Finite State Machine (FSM): Coordinated high-level task sequencing (navigate → approach → grasp → return).
Navigation: Nav2 stack with tuned parameters for global_costmap, local_costmap, planner_server, and controller_server.
Aruco-based Target Recognition: Used camera-based Aruco detection to estimate target pose; TF tree handled all coordinate transformations automatically.

Custom A Planner (MyPlanner)*:

Implemented as a Nav2 global planner plugin in C++.
Standard A* graph search on the occupancy grid with heuristic tuning for smooth paths.

Custom Trajectory Feedback Controller (MyController):

Local controller plugin computing velocity commands to track the reference path.
Feedback control based on cross-track error and heading error.

Robotic Arm Controller:

Queried IK solver (grasp_query_solved()) in a loop during slow approach to determine when the target entered the reachable envelope.
Designed custom grasp points with direction information from Aruco pose estimates.

PTZ (Pan-Tilt) Tracking:

Drove the camera gimbal to track the target during navigation, preventing loss of visibility.
Coordinate compensation handled via TF tree rather than manual recalibration.

Challenges

Odometry Drift: Wheel odometry accumulated error over longer paths, causing the robot to lose accurate positioning relative to the target. Resolved by switching reference to the Aruco marker position during the final approach phase.

IK Feasibility Window: The robotic arm’s reachable workspace was constrained, requiring continuous IK queries and a slow-approach strategy to enter the feasible zone before executing a grasp.

Costmap Configuration: Getting Nav2’s costmap inflation and obstacle layers tuned for the specific robot geometry required iterative testing.

Reflection and Insights

This project provided hands-on experience with the full stack of autonomous robotics: perception, planning, and control. Implementing A* and the trajectory controller as actual Nav2 plugin classes — rather than standalone scripts — deepened understanding of how ROS2’s modular architecture enables component reuse and testing. The challenge of handling coordinate frames across navigation, perception, and manipulation highlighted why a well-structured TF tree is foundational to multi-component robotic systems.

Overview

Results

Technical Details

Challenges

Reflection and Insights

Team and Role

Overview

Results

Technical Details

Challenges

Reflection and Insights

Team and Role

Links

Categories

Tags