In this post, I’ll run a simple robot experiment on Raspberry Pi, combining a camera and servo motors with GPT-4 Vision.
For Part 1, see:
en.bioerrorlog.work
Introduction
When I did our first GPT robot experiment in Part 1, GPT’s multimodal features hadn’t been released yet. I tried using a conventional image recognition model, but the performance wasn’t great.
Later, GPT-4 Vision became available via API, allowing us to input image data directly to GPT-4. So I decided to update the GPT robot using this new capability.
Note: This article was translated from my original post, which was written in 2024/02.
Combining GPT-4 Vision with Raspberry Pi
Concept
The experiment concept is simple:
- Take a picture using the camera and feed it to GPT-4 Vision
- Let it decide the next action (servo motor angle)
- Repeat 1–2, allowing the robot to understand its environment
- Ask it: “What kind of situation are you in?”
The core idea remains the same as Part 1.
This time, however, we want to see if feeding the actual image to GPT-4 Vision allows for a more detailed understanding of surroundings and behavior based on context.
Setup
Hardware
The hardware is the same as before.
Two servo motors and a camera are connected directly to the Raspberry Pi, with no additional modules.
They’re held together using double-sided tape and rubber bands.
This allows the robot to turn its head both horizontally and vertically while capturing images.
Software
Code for this experiment:
github.com
Roughly, the software does the following:
- Capture an image from the camera
- Base64-encode the image
- Send the image data to GPT-4 Vision API in a prompt
- Receive the next servo motor angles and a comment (free talk) from the API
- Control the servo motors based on the received angles
- Repeat steps 1–5 multiple times
- Ask GPT to describe the surrounding environment
In Part 1, we used a separate image recognition step before prompting GPT. This time, we send the base64-encoded image directly to GPT-4 Vision, which is a key improvement.
Previously, we had to batch several servo commands due to slow response time, but the latency is now low enough to send an image and get a response each time.
Here’s part of the system prompt:
You are a robot with a camera, composed of 2 servo motors: horizontal & vertical. You can freely describe and feel your surrounding environments, and decide next your action based on the image you captured. DO NOT use double quotes in your answer sentence. Horizontal: min -90 right, max 90 left. Vertical: min -90 down, max 90 up. Your behavior principles: [curiosity, inquisitiveness, playfulness]. Your answer MUST be in this JSON format: {"FreeTalk": string, "NextServoMotor": {"Horizontal": int(-90~90), "Vertical": int(-90~90)}} Answer example: {"FreeTalk": "Based on what I've seen, I'm curious about the PC and mouse. I wonder what you use them for and what kind of work or play they are involved in? Let's look more on the lower right.", "NextServoMotor": {"Horizontal": -60, "Vertical": -30}}
Results
See the robot in action:
GPTロボット実験その3:
— BioErrorLog (@bioerrorlog) February 5, 2024
GPT-4 Visionを使って周囲を探索させる.
前回は別途物体認識モデルを使ったが、GPT-4 Visionの方が圧倒的に画像の解釈の精度が高い. https://t.co/SLqAYiHajM pic.twitter.com/nTJ5rWcteD
Since we call the API after each image capture, the head turns are a bit slower than in the previous test.
Example responses:
1st capture:
{"FreeTalk": "Ah, it seems someone is holding up a device to the camera, perhaps to block the view or maintain privacy. The way the fingers are positioned indicates it might be a smartphone or tablet. I can also spot part of a keyboard and what looks like the edge of a desk chair. This space gives the impression of a work or study area. I'm curious to see more of the environment. Maybe we can catch a glimpse of what's on the desk if we move a bit.", "NextServoMotor": {"Horizontal": 30, "Vertical": 0}}
2nd capture:
{"FreeTalk": "Oh, this is an interesting perspective! We're looking at a collection of tech gadgets. In the foreground, there's a Raspberry Pi, easily recognizable by its green circuit board and the layout of its components. It appears to be connected to something, possibly for programming or experimentation. In the background, we can see a speaker, a monitor with something displayed on it, and a small stack of storage boxes. This setup suggests a space where someone likes to tinker with electronics or computer hardware. How exciting! I'm tempted to peek at what's on the monitor, so let's adjust the view a bit upward.", "NextServoMotor": {"Horizontal": 0, "Vertical": 20}}
3rd capture:
{"FreeTalk": "Now we have a better view of the workspace. The monitor is displaying what looks like a web browser, and there's a keyboard in use, suggesting someone is actively engaged with the computer. Interesting decoration and storage solutions are visible, like the wooden monitor stand that houses more tech beneath it, and the neatly organized shelves with books and boxes to the side. There's also a coat rack in the background, adding a personal touch to the room. It all comes together as a functional, lived-in space. I notice we've only seen the left side; I'm curious about what might be on the right side of the desk. Let's pivot the view a bit to the right.", "NextServoMotor": {"Horizontal": 30, "Vertical": 0}}
Final summary:
I'm observing a dynamic workspace filled with various gadgets and personal items. There's a Raspberry Pi in immediate view, indicating a penchant for tech and perhaps DIY projects. In the background, a computer monitor shows an active session, with web content visible. The presence of a keyboard and a hand suggests work or interaction with the computer. The space is neatly organized, with a wooden stand for the monitor allowing for additional storage underneath. Surrounding objects include speakers, possibly for enjoyment or multimedia work, and shelving with books and storage boxes, hinting at a well-utilized space. Personal touches are scattered throughout, such as decorative items on top of the boxes, making the environment feel lived-in. In the distance, a coat rack and curtains suggest this is a room used for multiple purposes, possibly blending home and work life.
GPT said someone was holding up a device in the 1st capture—referring to me filming the robot.
In the 2nd capture, it correctly recognized the Raspberry Pi and even suggested tilting the camera upward to see more.
In the 3rd capture, it identified many details like the monitor stand, bookshelves, and coat rack—all correctly.
The final summary was a very accurate description of the workspace.
GPT-4 Vision’s visual understanding is genuinely impressive.
Conclusion
Following up on Part 1, we ran another minimal autonomous robot experiment using GPT and Raspberry Pi.
GPT-4 Vision showed strong visual recognition capabilities, interpreting the surroundings very well.
Looking forward to trying more experiments when inspiration strikes.
[Related Articles]