Skip to main content

3. Cognitive Planning Using LLMs

Explanation: Translating Intent into Executable Robot Actions

The recent advancements in Large Language Models (LLMs) have opened up unprecedented opportunities for robotics, particularly in cognitive planning. This involves enabling robots, especially humanoids, to understand high-level, abstract natural language commands (e.g., "clean the room," "prepare coffee") and autonomously decompose them into a sequence of concrete, executable low-level robot actions. This capability significantly bridges the gap between human intent and robot execution, making robots more accessible, versatile, and intelligent.

The Role of LLMs in Robotic Planning

Traditional robotic planning often relies on symbolic planners that require a meticulously defined model of the world and the robot's capabilities (e.g., PDDL - Planning Domain Definition Language). While powerful, these systems are brittle to unknown situations and require expert knowledge to set up. LLMs offer a more flexible approach by leveraging their vast world knowledge, common sense reasoning, and ability to process natural language.

The cognitive planning process using LLMs typically involves these steps:

  1. Natural Language Input: The robot receives a high-level command from a human user (e.g., through a Voice-to-Action pipeline, as discussed in the previous chapter).
  2. Prompt Engineering: The core of LLM-based planning. A carefully crafted prompt is constructed, providing the LLM with:
    • The user's command.
    • The robot's current state and its perception of the environment (e.g., a list of visible objects with their properties and poses, the robot's current location). This information typically comes from the vision system.
    • A list of the robot's available primitive actions or skills (e.g., navigate_to(location), pick_up(object_id), open_gripper()). These are the fundamental, reliably executable building blocks of robot behavior.
    • Instructions on the desired output format for the plan (e.g., a JSON list of action calls).
  3. LLM Inference: The LLM processes the prompt and generates a response, ideally a step-by-step plan composed of the specified primitive actions. The LLM acts as a high-level cognitive engine, reasoning about the task and the environment based on its training data.
  4. Plan Parsing and Validation: The textual plan generated by the LLM is parsed into a structured, robot-executable format. This step also includes validation to ensure the plan is syntactically correct, semantically meaningful in the robot's context, and potentially safe.
  5. Action Execution: An orchestration layer (e.g., a Behavior Tree or State Machine) executes the plan by invoking the appropriate low-level robot controllers via ROS 2 services or actions.
  6. Feedback and Replanning: The robot continuously monitors the execution of the plan and the state of the environment. If an action fails, or the environment changes unexpectedly, feedback is provided, and the LLM can be re-queried for replanning or recovery strategies.

Advantages for Humanoid Robotics

  • Generalization: LLMs can generalize to new, unseen instructions and tasks, drawing on their broad knowledge base, reducing the need for explicit pre-programming for every possible scenario.
  • Intuitive Interface: Enables humans to interact with robots using natural language, making humanoids more accessible and user-friendly.
  • Common Sense Reasoning: LLMs can infuse common sense into robotic tasks, helping to resolve ambiguities or infer implicit steps in a command.
  • Contextual Awareness: By incorporating environmental context into the prompt, LLMs can generate plans that are grounded in the robot's current perception.
  • Decomposition: Effectively breaks down complex, abstract goals into smaller, manageable sub-goals.

Code Examples

LLM-Powered Plan Generation with Context and Skills (Python)

This example demonstrates a conceptual Python component that leverages an LLM (via OpenAI API) to generate a plan based on a natural language command, current environmental context (from a simulated vision system), and a defined set of robot skills.

import openai
import json
import os

# Assume OPENAI_API_KEY is set in environment variables
# os.environ["OPENAI_API_KEY"] = "sk-..."

class RobotPerception:
"""Simulates a robot's perception system providing environment context."""
def get_current_scene(self):
# In a real robot, this would come from a vision system (e.g., Isaac ROS)
return {
"robot_location": "living_room",
"objects": [
{"id": "red_apple", "type": "fruit", "color": "red", "location": "on_kitchen_table"},
{"id": "blue_cup", "type": "container", "color": "blue", "location": "on_living_room_coffee_table", "state": "empty"},
{"id": "door_kitchen", "type": "door", "location": "between_living_room_and_kitchen", "state": "closed"},
{"id": "tray", "type": "container", "color": "silver", "location": "on_kitchen_counter"}
],
"human_presence": {"location": "living_room", "status": "idle"}
}

class RobotSkills:
"""Defines the primitive actions a robot can execute."""
def get_skills_definition(self):
return [
{"name": "navigate_to", "description": "Moves the robot to a specified location.", "parameters": {"location": "string"}},
{"name": "open_door", "description": "Opens a specified door.", "parameters": {"door_id": "string"}},
{"name": "pick_up", "description": "Picks up an object from a specified location.", "parameters": {"object_id": "string", "from_location": "string"}},
{"name": "place_on", "description": "Places an object onto a specified target location.", "parameters": {"object_id": "string", "target_location": "string"}},
{"name": "report_status", "description": "Reports a message to the human operator.", "parameters": {"message": "string"}}
]

class LLMPlanner:
def __init__(self, openai_api_key=None):
if openai_api_key:
self.client = openai.OpenAI(api_key=openai_api_key)
else:
self.client = openai.OpenAI() # Reads from OPENAI_API_KEY env var
self.perception_system = RobotPerception()
self.robot_skills = RobotSkills()

def generate_plan(self, human_command: str):
current_scene = self.perception_system.get_current_scene()
available_skills = self.robot_skills.get_skills_definition()

system_message = {
"role": "system",
"content": f"""
You are a helpful humanoid robot assistant. Your goal is to generate a step-by-step plan
to fulfill a given natural language command.

The plan must be a JSON array of skill calls, where each skill call is an object
with "skill" (name of the skill) and "args" (a dictionary of parameters).

Use only the following available skills:
{json.dumps(available_skills, indent=2)}

Current Scene Information:
{json.dumps(current_scene, indent=2)}

Reason step-by-step before providing the final JSON plan.
"""
}

user_message = {
"role": "user",
"content": f"Natural Language Command: '{human_command}'"
}

try:
response = self.client.chat.completions.create(
model="gpt-4o-mini", # Or 'gpt-4o', 'gpt-4' for more complex reasoning
messages=[system_message, user_message],
temperature=0.0, # Make the output deterministic
max_tokens=500
)

# The LLM's response might contain reasoning before the JSON.
# We need to extract the JSON part.
full_response_content = response.choices[0].message.content
# Find the first and last curly brace of a JSON array
json_start = full_response_content.find('[')
json_end = full_response_content.rfind(']')

if json_start != -1 and json_end != -1:
json_plan_str = full_response_content[json_start : json_end + 1]
print("--- LLM Raw Response ---")
print(full_response_content)
print("--- Extracted JSON ---")
print(json_plan_str)
return json.loads(json_plan_str)
else:
print("--- LLM Raw Response (No JSON found) ---")
print(full_response_content)
print("LLM did not return a valid JSON plan.")
return []
except openai.APIError as e:
print(f"OpenAI API Error: {e}")
return []
except json.JSONDecodeError as e:
print(f"Failed to decode JSON from LLM response: {e}")
return []
except Exception as e:
print(f"An unexpected error occurred: {e}")
return []

if __name__ == "__main__":
# Ensure OPENAI_API_KEY is set in your environment
if "OPENAI_API_KEY" not in os.environ:
print("Please set the OPENAI_API_KEY environment variable.")
exit()

planner = LLMPlanner()

# Simple command
command1 = "Pick up the red apple from the kitchen table and put it on the tray."
plan1 = planner.generate_plan(command1)
print(f"\nPlan for '{command1}':")
for step in plan1:
print(f" - {step['skill']}({step['args']})")

# More complex command
command2 = "Go to the kitchen, then open the door, and then place the blue cup on the coffee table."
plan2 = planner.generate_plan(command2)
print(f"\nPlan for '{command2}':")
for step in plan2:
print(f" - {step['skill']}({step['args']})")

Diagrams (in Markdown)

LLM-Based Cognitive Planning Pipeline

graph TD
A[Human Natural Language Command] --> B{Transcribed Text (from Whisper)};
B --> C{Robot's Primitive Skills Definition};
D[Robot's Current Environment State (from Vision)];

subgraph LLM Cognitive Planner
C -- "Context" --> E[Prompt Engineering];
D -- "Scene Info" --> E;
E --> F[LLM Inference (e.g., GPT-4o)];
F --> G[Textual Plan Output (JSON)];
end

G --> H[Plan Parser & Validator];
H --> I[Executable Robot Actions (Structured Plan)];
I --> J[Robot Control System (ROS 2 Execution)];
J --> K[Humanoid Robot];

style A fill:#cfc,stroke:#333,stroke-width:1px;
style B fill:#add8e6,stroke:#333,stroke-width:1px;
style C fill:#ffc,stroke:#333,stroke-width:1px;
style D fill:#a2e0ff,stroke:#333,stroke-width:1px;
style E fill:#dee,stroke:#333,stroke-width:1px;
style F fill:#f9f,stroke:#333,stroke-width:2px;
style G fill:#ffc,stroke:#333,stroke-width:1px;
style H fill:#bde,stroke:#333,stroke-width:1px;
style I fill:#a2e0ff,stroke:#333,stroke-width:1px;
style J fill:#dee,stroke:#333,stroke-width:1px;
style K fill:#bde,stroke:#333,stroke-width:1px;

Figure 4.3: LLM-Based Cognitive Planning Pipeline. The LLM takes a human command, the robot's current environment state (from vision), and a definition of available robot skills as input. Through prompt engineering, it generates a textual plan (often in JSON) that is then parsed, validated, and executed by the robot's control system.

Tables

StageDescriptionKey ChallengesLLM Contribution
NLU (Intent)Understanding the user's high-level goal.Ambiguity, context-dependency.Directly interprets natural language.
Task DecompositionBreaking abstract goal into sub-goals.Identifying necessary steps, ordering.Leverages common sense; breaks down tasks based on knowledge.
Action GroundingMapping sub-goals to robot's primitive skills.Matching abstract concepts to concrete actions.Selects appropriate skills; fills in parameters from context.
State RepresentationProviding the LLM with relevant environment info.Keeping context window small; real-time updates.Consumes structured (e.g., JSON) world state.
Constraint SatisfactionEnsuring plans respect robot capabilities & environment.Avoiding impossible or unsafe actions.Can be prompted to respect constraints; requires external validation.
Error RecoveryHandling unexpected failures during execution.Identifying root cause; generating alternative plans.Can suggest recovery steps; replan based on feedback.

Callouts

tip

Prompt Templates are Crucial: Developing effective prompt templates that consistently guide the LLM to output well-formatted and executable plans is an iterative process. Clearly define the robot's capabilities, the environment state, and the desired output format.

warning

Hallucinations & Grounding: LLMs can "hallucinate" (generate factually incorrect or illogical outputs). It's critical to validate the LLM's generated plan against the robot's physical capabilities and the real-world environment before execution to ensure safety and feasibility.

Step-by-step Guides

Implementing an LLM-Based Cognitive Planner for a Humanoid:

  1. Define Robot Primitive Skills: Create a comprehensive, unambiguous definition of every action your humanoid can reliably perform. Each skill should have a name, a clear description, and defined parameters (type hints are useful). Represent these as functions or API endpoints the robot can call.
  2. Integrate Perception System: Ensure your robot's perception system (from Module 3) can provide a structured, up-to-date representation of the environment. This typically involves a list of detected objects with their types, properties (color, size), and 3D poses.
  3. Construct the Prompt:
    • Start with a clear system message instructing the LLM on its role (e.g., "You are a helpful robot assistant...").
    • Include the user's natural language command.
    • Provide the current environment state (e.g., as a JSON dump from your perception system).
    • List the available robot skills with their descriptions and parameter schemas.
    • Explicitly instruct the LLM on the desired output format for the plan (e.g., JSON array of skill calls).
  4. Implement LLM Query Logic: Use an LLM API client (e.g., openai Python library) to send the constructed prompt and receive the generated plan.
  5. Develop a Plan Parser and Executor:
    • Create a component that can parse the LLM's textual output into an executable data structure (e.g., a list of Python function calls or ROS 2 actions).
    • Implement a dispatcher that can invoke the actual robot primitive skills (which would internally call ROS 2 services/actions).
    • Incorporate validation checks to ensure the parsed plan is valid and safe before execution.
  6. Integrate Feedback Loop: Design the system to monitor plan execution. If an action fails or the environment changes, use this feedback to inform the LLM for replanning.

Summary

This chapter delved into the transformative potential of Large Language Models (LLMs) for cognitive planning in humanoid robotics. By enabling robots to translate high-level natural language commands into sequences of executable actions, LLMs provide a powerful engine for task decomposition, common sense reasoning, and generalizing to new scenarios. The process involves meticulous prompt engineering that integrates human intent, environmental context, and the robot's primitive skills. While challenges like grounding and hallucination remain, LLM-based planning promises to make humanoids more intuitive to interact with, vastly expanding their operational versatility and intelligence in complex human environments.

Exercises

  1. Prompt Engineering for Ambiguity: A humanoid robot needs to handle the command "Make the room look nice." Design a prompt for an LLM that would allow the robot to:
    • Query the user for clarification if the command is too vague.
    • Propose a list of possible actions to "make the room look nice" (e.g., "put items away," "straighten cushions").
    • Generate a plan once a specific sub-task is chosen.
  2. Skill-Set Expansion: You add a new skill to your humanoid: detect_person(location). How would you update the LLM planner's prompt and skill definitions to enable the robot to use this new skill to fulfill a command like "Go greet the visitor at the door"?
  3. Visual Grounding in Planning: How would you modify the get_current_scene function in the code example to include visual information about objects (e.g., their exact 3D coordinates, their color from camera input), and how would you adapt the LLM prompt to leverage this for more precise planning (e.g., "pick up the red mug at (X, Y, Z)")?
  4. Failure and Replanning: A humanoid is executing a plan generated by an LLM, but a pick_up(object) action fails repeatedly (e.g., the object slips). Describe how a feedback mechanism could inform the LLM, and what kind of prompt you would generate to ask the LLM for a revised plan or a recovery strategy.
  5. Ethical Planning: Discuss a potential ethical dilemma that could arise from an LLM-generated plan (e.g., prioritizing one task over another in an emergency, or performing an action that could cause minor damage but completes a primary task). How would you incorporate "ethical constraints" into the LLM's prompt or the plan validation stage?