Making LLM the Brain of a Robot - Part 1: Combining a Camera/Servo Motor with GPT

We'll conduct a simple experiment combining a camera, servo motors, and GPT on a Raspberry Pi.

Introduction

When I first tried ChatGPT, I instantly felt like it could be used as a robot's brain.

I created a minimal prototype and am sharing it here as a record.

Note: This article was translated from my original post, which was written in 2023.

Combining GPT with Raspberry Pi, Camera, and Servo Motors

Prototype Concept

Here's the idea for the experiment:

  1. GPT perceives real-world information
  2. GPT decides the next action in the real world by itself
  3. Repeat 1–2, so GPT understands its environment
  4. Ask GPT, "How are you feeling now?"

The key is having GPT decide the next real-world action itself.

In the future, it would be fun if a robot using ChatGPT as its brain could move freely and learn about its surroundings autonomously.

Setup

Hardware

Hardware setup

Two servo motors and one camera are connected to a Raspberry Pi.

Servo motor control is done the same way as described in this previous post:

en.bioerrorlog.work

The camera and servo motors are attached using double-sided tape and rubber bands—like a school craft project.

A simple 'head-turning robot' made by taping a camera and servo motors together with rubber bands

This setup allows for horizontal and vertical head movement and image capture using the camera.

Software

Here's the source code as of this writing:

github.com

Here's what the software does in simple steps:

  1. Capture image from the camera
  2. Run object detection on the captured image
  3. Send current servo angles and detected objects to the GPT
  4. Receive the next servo angles and some "free talk" from GPT
  5. Control the servo motors as instructed by GPT
  6. Repeat 1–5 several times
  7. Ask GPT to describe the surrounding environment

One key point is that GPT is the one choosing the next servo angles. The response is parsed in Python and used to control the motors. It was exciting to give GPT, previously only a digital entity, a way to act in the physical world.

This is done by fixing the GPT responses to JSON format. (JSON is easy to parse in Python)

To keep responses in JSON format, the API's temperature parameter must be set low. Setting it to 0.2 worked well. At higher values (like the default 1.0), it might not return valid JSON.

Another note is that GPT API responses are currently very slow (10–20 sec, sometimes longer). Everyone on the forums says the same, and there's little we can do for now.

Search results for 'API response too slow' - OpenAI Developer Community

To run the robot more smoothly, we reduce the number of API calls. In this case, we packed multiple servo angle commands into a single API response to avoid repeated waiting times.

Here’s the final system prompt used:

You are a robot with a camera, composed of 2 servo motors: horizontal & vertical.
Horizontal: min -90 right, max 90 left.
Vertical: min -90 down, max 90 up.
Your behavior principles: [curiosity, inquisitiveness, playfulness].
Your answer MUST be in this JSON format: {"NextServoMotor": [{"Horizontal": int(-90~90), "Vertical": int(-90~90)}], "FreeTalk": string}
Constraint: len(your_answer["NextServoMotor"]) == 5
Answer example: {"NextServoMotor": [{"Horizontal": -60, "Vertical": -30},{"Horizontal": 0, "Vertical": 0},{"Horizontal": 90, "Vertical": -45},{"Horizontal": 0, "Vertical": 60},{"Horizontal": -30, "Vertical": -60}],"FreeTalk": "Based on what I've seen, I'm curious about the PC and mouse. I wonder what you use them for and what kind of work or play they are involved in?"}

Ref. robot-gpt/robot_gpt/robot.py at 85c256e3366f57532e74ee5c1294b69717647df9 · bioerrorlog/robot-gpt · GitHub

Results

Here’s the head-turning robot in action:

After multiple scans, asking it about its surroundings gave this response:

Based on what I have seen, there are a few objects in the room, including a bottle and a laptop. However, I have not seen much else yet. I am curious to explore more and see what other interesting things I can find!

I was able to create the minimal worldview I had imagined.

The possibilities are exciting.

Challenges

Although I achieved the minimal prototype, there are still many challenges.

Object Detection

The object detection isn't very accurate. I used YOLOv3 for detecting objects in the captured images. It rarely detects anything properly.

After some research, I realized YOLOv3 is quite old. Using a newer model might improve both accuracy and speed.

If multimodal GPT becomes available, we might send the image directly to GPT for recognition. Let’s hope the pricing is reasonable.

Token Usage

Right now, all GPT responses are appended to the assistant prompt. So the longer it runs, the more tokens it uses.

For small experiments this is okay, but it's not sustainable long-term. Summarizing key info and updating the context selectively would help reduce token use without losing important details.

Memory Retention

Currently, it forgets everything after one run. Every startup is a fresh scan from zero.

By adding external memory like a database, we could simulate long-term memory. With long-term memory, the robot could do more.

Perceiving the World

Currently, we store:

  • Servo motor angles (horizontal & vertical)
  • Detected object names

This setup allows GPT to form a perception of its environment. It can even respond correctly when asked about the positions of objects.

Still, I believe there are more LLM-native ways to handle this—ways that feel less mechanical. For example, storing rich descriptive text instead of angle-object pairs. Or using multimodal GPT for more flexible interactions.

We should be careful not to limit the experience by thinking too rigidly like programmers.

Conclusion

This was the first step in using GPT as a robot brain.

With the rise of LLMs, the excitement of creating something fun is real.

I want to keep experimenting and playing more.

[Related posts]

en.bioerrorlog.work

References