Back to all posts
35 min read

yet another shitty robot

RobotTTSLLMTTSESP32VADPythonPI-Agent

The Idea

A few weeks ago I came across Mario Zechner's post about building a tiny robot: https://mariozechner.at/posts/2026-05-30-shitty-robot/

It immediately caught my attention. I have two young kids, so I was looking for a fun project we could build together. At the same time, I wanted an excuse to dive deeper into the intersection of robotics and AI. So heres yet another shitty robot πŸ˜‰

Beyond just getting a robot to move, I wanted to learn more about running language models locallyβ€”model inference, quantization, and how to optimize everything for the hardware I already have.

One thing that quickly became clear is that real-time interaction is what makes a robot feel alive. It's not just about generating responses; it's about knowing when to listen, detecting when someone has finished speaking, and responding naturally without awkward pauses or interruptions. Those small details are what make human-robot interaction feel intuitive.

To get started, I ordered two Octobots. Just look at those cuties dancing together, the chinese dancing robots can't compete with them πŸ˜….

robots-dancing

My plan was to reuse my old smartphone as the robot's "brain," and make use of the toys motor. The approach was to try building it form scratch based on Marios blog post. To learn as much as possible.

Text-to-speech: making it fast and making it sound like our robot

Before this project, I spent some time experimenting with Home Assistant, Whisper (for speech-to-text), and Piper (for text-to-speech) on an Intel NUC (i5, 32GB RAM). Honestly? The results were pretty underwhelming.

So, I decided to give it another shotβ€”this time unleashing the power of my Mac M1 Pro (32GB RAM). I wanted that perfect, charming "robot" voice. Here is how I went from a painfully slow, crackly mess to a blazing-fast, naturally sounding robo-companion.

πŸ›‘ The Pitfalls: Slow Loads and Terrible DSP Filters

Initially, I tried running standard German voices. While they were fast on the M1, they lacked character. I wanted that classic "robot" vibe, so I ran the audio through a digital signal processor (DSP) with a DIY robot filter (bitcrush, ring modulation and so on).

The result? It sounded incredibly harsh, crackly, and downright unsatisfying.

Next, I tried running Qwen3- TTS. This brought two major headaches: The Lag: It took almost two minutes before the audio even started playing. The Quality: It still sounded pretty bad. Why was it so slow? I realized I was running Qwen3-TTS through the default PyTorch/transformers path. On Apple Silicon, this runs float32 on the Metal (MPS) backend.

It resulted in a real-time factor of about 4β€”meaning it took 4 seconds of intense computation to generate just 1 second of audio. A 10-second clip took a whopping 40 seconds to compute, plus another 10 seconds just to load the model every single time. Combined with my heavy-handed robot DSP effect, the final voice was rough, slow, and unusable.

πŸ’‘ The Breakthrough: In-Context Learning & Apple MLX The turning point came when I looked at how Mario built pibot. He bypassed the slow PyTorch path entirely and skipped the fake DSP audio effects.

Instead, he did two incredibly smart things:

He used the Qwen3-TTS Base model, which features In-Context Learning (ICL). This allows the model to clone any voice on the fly using just a short reference audio clip and its transcript.

He ran it using native, quantized runtimes (MLX 6-bit for Apple Silicon).

For his reference voice, he simply used a clean, friendly clip generated on ElevenLabs.

πŸ› οΈ The New Setup: 5x Faster and Crystal Clear I decided to replicate this Python/MLX workflow:

The Voice: Together with my kids we iterated and ceated a custom reference voice on ElevenLabs until we were happy with it (aiming for a friendly, small teaching robot) and exported the MP3.

The Model: Loaded the lightweight, 6-bit quantized Base model (mlx-community/Qwen3-TTS-12Hz-0.6B-Base-6bit) and cloned my voice on the fly using model.generate() using the reference mp3 and the text transcript

⚑ Soundcheck Successful!

The difference is night and day.

πŸš€ The Speed: The real-time factor dropped from ~4 to under 1 on longer texts (0.85). That is roughly 4 to 5 times faster, generating audio quicker than it takes to speak it!

🎡 The Quality: The voice actually sounds like the friendly robot I designed.

So heres how my first version looked like.

first-avatar-test

πŸš€ Making the Robot Responsive: Streaming Audio and Smarter Architecture

Building a great voice assistant isn't just about getting the tech to workβ€”it's about making the interaction feel alive and natural.

After getting the initial voice model running on my M1 Mac, I hit a few classic "gotchas" that made the experience feel sluggish. Here is how I debugged the early confusion, implemented instant audio streaming, and completely overhauled the pipeline for a 60% boost in responsiveness.

πŸ’‘ Clearing Up the "Hello" Lag

When I first started running tests, there was one major point of confusion during early development:

"Why is a simple 'Hello' so incredibly slow?" My initial test script was starting from scratch every single time, spending about 7 seconds just loading the massive AI models into memory before speaking. In a real application, the model stays loaded in the background, making this startup lag a non-issue.

Solution: Streaming Audio Instead of Waiting

Initially, our text-to-speech (TTS) setup had a massive bottleneck: it was buffering everything.

For a longer 70-word reply, the system would wait until the entire paragraph was completely generated and converted to audio before playing a single sound. This resulted in 17 agonizing seconds of total silence followed by a massive wall of speech.

The Fix: Speak as You Think The Qwen3-TTS model is naturally capable of generating audio in tiny, bite-sized chunks. Instead of waiting for the whole response:

We now grab the very first chunk of audio (representing just the first fraction of a second of speech).

We pipe it straight to the speaker immediately.

The robot starts speaking in just 1 to 2 seconds, while the rest of the sentence is still being synthesized in the background.

βš™οΈ Phase 1 Complete: The New Event-Driven Architecture To support this instant streaming, we had to move away from a slow, step-by-step pipeline (Record βž” Transcribe βž” Think βž” Speak) and rewrite it into a smart, modular background service.

Here are the three pillars of the new architecture:

The Sentence Chunker: A smart text-splitter that watches the AI's thoughts stream in and groups them into complete sentences. It is clever enough not to get tripped up by German abbreviations (like "z. B.") or decimals (like "3.14").

The Orchestrator: This acts as the brain of the operation. It runs a background thread that overlaps tasksβ€”allowing the AI to keep "thinking" and generating the next sentence while the robot is already busy speaking the first one.

A Lightweight Control Server: We separated the voice engine from the user interface. By setting up a lightweight background server, any client (like a terminal or a web page) can now easily trigger the robot and receive a live stream of audio.

πŸ“Š The Headline Metric: A 60% Cut in Wait Time To prove this new architecture works, I benchmarked a 3-sentence reply. The difference between waiting for the entire process to finish versus streaming the first sentence is night and day:

Metric Processing Time Speech-to-Text (STT) 451 ms LLM Thinking (Full Reply) 2,297 ms Text-to-Speech (Full Reply) 6,841 ms First Audio Played (Perceived Latency) 3,332 ms πŸš€ Total Turnaround Time 9,589 ms

Without streaming, you'd wait nearly 10 seconds in dead silence before hearing a word.

With our new streaming setup, the robot starts talking in just 3.3 secondsβ€”a massive 60% reduction in perceived latency. The longer the response, the bigger this victory feels. The robot now feels alert, snappy, and ready to converse.

Phase 2 β€” fleet distribution (foundation laid, hardware pending)

Phase 2 makes placement real: LLM on the on-demand Gaming-PC GPU, always-on services on the NUC, with graceful fallback. The code that doesn't need the other hosts is in and tested on the Mac; the actual cross-host benchmark runs wait on bringing the NUC + GPU box online.

Routed LLM + Wake-on-LAN. src/llm/routed_llm.py composes a remote-GPU primary with a local fallback behind the same stream() interface (LLM_BACKEND=routed). On a request it checks the primary port; if closed it sends a WoL magic packet (src/net/wol.py) and polls /api/tags until the box is up (logged as a COLD run); if it doesn't wake in time it downgrades to the local small model and notes the downgrade. An idle timer suspends the GPU box over SSH. Proven locally: with an unreachable primary the turn completed on the local fallback, emitting downgrade -> local fallback llama3.2 β€” orchestrator unchanged.

Standalone services. services/stt_server and services/tts_server run the STT/TTS backends as HTTP services so they can live on the NUC. The STT service was validated end-to-end on the Mac: http_stt β†’ stt_server β†’ faster-whisper round-tripped "Hallo, das ist ein Test." correctly. (Multipart parsing is hand-rolled to avoid the deprecated cgi module β€” future-proof for Python 3.13.)

Presets. src/presets.py declares 4 host layouts (mac-local, nuc-gpu, nuc-only, distributed-stt-tts); python -m src.presets <key> prints the matching .env. Real MACs/IPs get filled in as hosts come online.

Still pending (needs the hardware): actually waking the Gaming PC, and the side-by-side fleet benchmark rows (Mac-only vs NUC+GPU warm/cold vs NUC-only). The WoL packet builder is unit-tested; the wake itself waits on a configured MAC and a box that's plugged in.

The phone is the face, the ESP32 is the body

A design fork worth recording. The plan (Phase 4) imagined an ESP32-S3-Box as the audio front-end. But the board I actually have is a bare ESP32-S3-DevKitC-1 (N16R8) β€” WiFi, 36 GPIO, one WS2812 RGB LED on GPIO48, and no mic, speaker, camera, or display. Meanwhile the robot is "a small robot with a smartphone" (pibot's framing) and I have a Pixel 3 with all of that I/O plus a great screen.

So the roles invert from the naive "ESP32 = edge device" reading:

  • Phone = front-end + caller. It runs the web-face in the browser, captures the mic, plays TTS, shows the camera, renders the animated avatar, and is the one that calls the pipeline (over the control-server WebSocket).
  • ESP32 = body. It's a second subscriber to the same WebSocket. It never calls the pipeline; it reacts to phase events (RGB LED now, motors later).
  • Fleet = brains. STT β†’ LLM β†’ TTS + the broadcast hub.

The one change this forced: the control server used to give each WS connection its own orchestrator, so the phone's turn was invisible to the ESP32. It's now one shared orchestrator + a broadcast hub β€” any client can send input, and every client (phone face + ESP32 body) receives the same event stream. One robot, one face, one body, one pipeline.

The avatar (phone web-face)

An SVG robot face β€” two eyes with pupils + glowing irises, two eyebrows, and a mouth β€” rigged in src/server/static/face.js. A tiny tween engine eases between per-phase expression targets, with idle micro-behaviours (blink every few seconds, subtle pupil drift) so it feels alive. Phase β†’ expression:

PhaseEyesBrowsMouth
inactivehalf-liddedneutralgentle smile
listeningopenraisedfriendly smile
thinkinglook up/sideone raisedsmall, focused
speakingopen, livelyneutralanimated talking + smile
errornarrowedfurrowedconcerned frown

The mouth is a quadratic curve whose middle dips down for a smile or up for a frown; talking is a speaking-gated oscillation plus a per-tts_audio-chunk twitch. app.js is the WS client: it maps events to expressions, queues + plays TTS segments in order, records mic audio via MediaRecorder and ships the blob over the WS as a binary frame (the server ffmpeg-converts β†’ STT β†’ turn), and has an optional local webcam preview (a hook for future vision).

Kiosk on the phone

To make the Pixel behave like a robot face and not a browser tab: a PWA manifest (display: fullscreen) so Add to Home screen launches it chrome-less, a navigator.wakeLock so the screen never sleeps while it's up, and tap-the-face β†’ requestFullscreen. First mic + webcam test on real hardware: the avatar heard the prompt, answered, and the camera preview worked.

Secure-context gotcha. getUserMedia (mic + camera) only works in a secure context: localhost is exempt, but a LAN IP over plain HTTP is not β€” so the phone at http://<mac-ip>:8010 gets the mic blocked (the browser reports it as a generic "denied"). Two fixes landed: the client now names the real cause (permission vs insecure-origin vs no-device vs busy) instead of a blanket message, and the server gained optional HTTPS (SERVER_TLS=1) with an auto-generated self-signed cert. Over TLS the page is a secure context, mic and camera work from the phone, and the WebSocket auto-upgrades to wss://. On the Mac itself the cause is usually OS-level: Chrome needs Microphone access in macOS System Settings β–Έ Privacy & Security.

The segfault after one turn. Driving the live server with Qwen3-TTS (MLX) crashed the process β€” a Segmentation fault: 11, no Python traceback β€” right after a successful turn or two. The cause: the orchestrator synthesizes on a fresh per-turn worker thread, and MLX/Metal is not thread-safe, so touching the GPU from a different thread each turn eventually faults natively. nohup buffering had been hiding the evidence, so the first fix was a debug launcher (now cli/robot start, formerly tools/dev_server.sh) that runs unbuffered with PYTHONFAULTHANDLER=1 and tees live logs β€” which is what surfaced the faulthandler dump. The same MLX path is also the slow one (~15s cold, ~2.7s/turn warm). Switching the live server to Piper (a subprocess: thread-safe + fast) fixed both at once β€” turns dropped to ~1.4–1.6s and the server survived repeated turns. Qwen3-TTS stays the quality voice for offline generation; bringing it back to the live loop needs a single dedicated TTS thread so Metal is always touched from the same thread.

Borrowing badlogic's worker pattern

Mario Zechner's pibot solved this exact problem, and reading his code + write-up confirmed the fix. His architecture is the one we converged on independently: the server is the brain (STT/LLM/TTS/agent), the phone is a "dumb renderer" (mic up, audio down, tools) β€” he even skipped the ESP32 and drives the motor from the phone over USB. The key move: STT and TTS each run as a separate, long-lived worker process that the server talks to over a binary stdio protocol, streaming audio. The model loads once and stays warm; a worker crash fails one turn and respawns instead of taking down the server.

And the language question answers itself: his Rust TTS worker uses MLX-C β€” the same Metal kernels as Python MLX (he even had to patch an MLX Metal-kernel bug himself). It's parity performance "without all the Python gunk"; his default worker is actually C++/GGML because it's cross-platform (Metal + Vulkan). So Rust isn't better at Metal β€” the win is process isolation, not the language.

So I built the same thing in Python. services/tts_worker/worker.py is a persistent process that loads the engine once and synthesizes on its main thread only (no cross-thread Metal), with stdout reserved as a clean JSON protocol channel (an fd-dup forces all model chatter to stderr). WorkerTTS (src/tts/worker_tts.py) is the client: it implements the same synthesize() protocol so the orchestrator is unchanged, spawns the worker, auto-restarts it on crash, and is selected with TTS_BACKEND=worker (the DSP effect still wraps it in-process; the worker runs raw). The control server pre-warms it at startup.

Result, driving the qwen3 voice that used to segfault: the server survived repeated turns, the worker stayed warm as one process, and with pre-warm the first turn dropped from ~40s cold to ~5.7s, then ~2.9s warm. Same cloned voice, no crash. (Piper is still the snappy ~1.4s default for fast iteration; the worker is the quality option.) Footnote on placement: this also clarified the "server on the gaming PC?" question β€” MLX is Apple-only, so the RTX 2080's real job is the LLM (via the Phase-2 Wake-on-LAN routing), while STT/TTS/orchestrator stay on the always-on host.

STT: dropping Whisper for Parakeet (and the same Metal-thread trap)

Whisper large-v3-turbo was accurate but slow on the M1 β€” ~3.6s per turn, which is the whole latency budget. Swapped in NVIDIA Parakeet TDT 0.6b v3 (the multilingual release β€” 25 European languages incl. German) on MLX: ~150ms warm, ~25x faster, with equal-or-better German transcripts (it nailed the test clip word-for-word). It was already half-wired as a Phase-1 candidate; only the model ID needed bumping v2(English)β†’v3(multilingual).

But switching engines re-tripped the Metal-thread trap from the TTS saga: the first time I actually spoke to the robot, MLX threw no stream gpu in current thread. Same root cause β€” the control server spawns a fresh thread per turn, so the model loaded on the prewarm thread but transcribed on a turn thread, and MLX keeps a per-thread GPU stream. This time it's a recoverable exception, not a segfault, so the fix is lighter than the TTS subprocess: PinnedSTT (src/stt/pinned.py) runs the backend's load and every transcribe on one dedicated single-thread executor. The server also warms the JIT with a 1s silent clip at startup so the first real turn is ~150ms, not ~420ms. Lesson, twice over: any MLX/Metal model must be touched from a single consistent thread β€” pin it (STT, recoverable) or isolate it in a process (TTS, segfault-prone).

The ESP32 firmware

firmware/esp32_face_led/ β€” an Arduino sketch (arduinoWebSockets + ArduinoJson + Adafruit NeoPixel) that joins the same ws://host:8010/ws, parses phase events, and eases the onboard WS2812 between colours (listening = blue, thinking = pulsing purple, speaking = green, error = red). It sends nothing; it's pure output. The motor/motion tool events are a marked TODO for the next Phase-4 step.

LLM upgrade β€” Gemma 4 and the thinking-mode trap

After the pipeline was stable I wanted a better model than llama3.2 β€” snappier German, better child-appropriate phrasing. Looked at the 2026 landscape:

  • Gemma 4 (Google, April 2026, Apache 2.0): two variants β€” a 26B MoE with only 3.8 B active parameters (fast!) and a 31B dense (quality). German-language reviews specifically called it out: "Gemma 4 formuliert auf Deutsch spuerbar natuerlicher und fluessiger" (noticeably more natural and fluent German than Qwen 3.5). For a robot that talks to kids in German that matters.
  • Qwen 3.5 (Alibaba): hybrid thinking + direct mode, 201 languages, 262K context -- strong all-rounder but slightly below Gemma 4 on German.
  • Mistral Small 3 7B: fastest raw tok/s on Apple Silicon (~50 tok/s), good if you need rock-bottom latency at some quality cost.

With 32 GB RAM the gemma4:12b variant fits easily (Q4_K_M, 8 GB, 100% GPU offloaded via Metal). Pulled it, switched LLM_MODEL=gemma4:12b -- done.

The trap: thinking mode is on by default

First real turn: STT 206ms, then 80 seconds of silence, then finally a German kid-joke. Total turn 101 s. The model was running 100% GPU, RAM pressure fine. So what was taking 76 seconds before the first token?

Running a quick test revealed it immediately:

$ echo "Hi" | ollama run gemma4:12b
Thinking...
The user said "Hi". This is a standard greeting.
Acknowledge the greeting and offer assistance.

Hallo! ...

Gemma 4 ships with a hidden chain-of-thought (CoT) thinking mode enabled by default. Before outputting a single word it silently generates hundreds of reasoning tokens. For a voice assistant those tokens are pure latency -- the user stares at a silent robot while the model debates how to say "Hallo".

The same capability flag exists on Qwen 3, DeepSeek-R1, and any Ollama model that lists thinking under its capabilities.

The fix: one line

The Ollama /api/chat endpoint accepts a top-level think boolean. Adding it to the payload disables CoT globally for that request:

# src/llm/ollama_llm.py
payload = {
    "model": self.model,
    "messages": messages,
    "stream": True,
    "think": False,          # disable CoT/thinking mode (Gemma4, Qwen3, etc.)
    "options": {"temperature": 0.7},
}

Result:

BeforeAfter
LLM first token76,001 ms~400 ms
LLM total80,321 ms~3,000 ms
Full turn101,869 ms~6,000 ms

27x speedup from one boolean. Quality is identical for child-appropriate conversation -- the thinking tokens add nothing for "tell me a joke". Reasoning mode is useful for hard math/logic tasks, not for a chatty robot.

Rule of thumb going forward

For any interactive / voice use case: always check whether the model has a thinking capability and explicitly set think: false. Load the model in the CLI with ollama run <model> and see if it prints Thinking... before answering. If it does, the API caller must opt out -- Ollama does not disable it automatically just because you are streaming a voice loop.

Phase A/B/C β€” making it an actual conversation (no button, barge-in)

Up to here the robot worked, but the interaction was a lie. You held a push-to-talk button, let go, waited, and a complete sentence came back as a WAV file the phone downloaded and played. That is a walkie-talkie, not a conversation. Two things bugged the kids immediately: the gaps between sentences ("why does it pause like that?") and the fact that you can't interrupt it β€” once it starts a 4-sentence answer you're stuck listening to all of it.

I sat down, traced exactly where the time goes, and then ported the three mechanisms that make pibot feel alive. Wrote it up as ADR 0003 first, then implemented all of it.

Where the gaps actually came from

I'd assumed the gaps were the LLM being slow. They weren't. The LLM streams fine. The problem was the consumer loop in the orchestrator: it was fully serial β€” synthesize sentence 1, play sentence 1, synthesize sentence 2, play sentence 2. While sentence 1 is playing, nothing is synthesizing sentence 2. So every sentence boundary costs you a full TTS synth time of silence. The more natural the LLM's punctuation, the worse it sounded, because more sentences = more gaps.

And within a sentence there was no streaming at all: PiperTTS and the macOS say backend both call subprocess.run(...), which only returns once the whole WAV is written. Even the Qwen3-MLX backend β€” whose underlying model.generate() is a generator that yields audio chunks as it produces them β€” was being wrapped in list(...), which throws the streaming away and waits for the last chunk. I was paying for streaming-capable models and then buffering them by hand.

The three phases

Phase A β€” kill the button. The phone now opens a single AudioContext, captures the mic continuously, resamples to 16 kHz PCM16 in an onaudioprocess callback, and streams raw frames to the server over a binary WebSocket. The server runs the voice-activity detection now, not the human finger: a per-client StreamingSTT (src/stt/streaming.py) gates frames through a VAD (src/stt/vad.py β€” a zero-dependency energy VAD by default, optional Silero ONNX), keeps a short preroll so it doesn't clip your first syllable, emits interim transcripts every 250 ms for live captions, and fires final after ~800 ms of silence. That final is what starts a turn. No button, and it reuses the existing STT.transcribe so I didn't have to touch Parakeet.

Phase B β€” stream the audio out, end to end. I gave the orchestrator an AudioSink (start / pcm / done callbacks) and iter_sentence_pcm(), and taught the Qwen3-MLX backend a real stream_pcm() that stops wrapping the generator in list() and yields PCM frames as the model makes them. The server forwards those frames to the phone as binary WS messages (tts_start{sample_rate} β†’ binary PCM β†’ tts_done), and the phone schedules them gaplessly with the Web Audio API β€” createBuffer, convert Int16β†’Float32, and a nextPlayTime accumulator with an 80 ms jitter buffer so chunks butt up against each other seamlessly instead of each being a separate <audio> download. This one change fixes both problems: first audio starts after the first chunk of the first sentence, and there are no inter-sentence gaps because playback is a continuous scheduled stream, not a sequence of files.

Phase C β€” barge-in. This is the bit that makes it feel human: you can talk over it. The naive approach (just keep the mic open while the robot talks) fails because the mic hears the robot's own voice from the speaker and treats it as you interrupting. Browser echo cancellation wasn't clean enough for STT, so I ported pibot's trick (src/server/static/barge-in.js): keep a ring buffer of the audio we're playing, and for each mic frame correlate the mic signal against that playback reference at delays of 20–420 ms to estimate how much of the mic energy is just the robot bleeding back in. Barge-in only fires when the mic is loud and the unexplained residual is high for several consecutive frames β€” i.e. you're really talking, not just picking up the speaker. When it fires, the client flushes the buffered mic preroll so the server transcribes your interruption from its true start, and sends barge_in. The orchestrator cancels cooperatively via a threading.Event the LLM and TTS loops check β€” no thread is killed, the partial reply is kept in history so context stays coherent β€” and stop-words ("stopp", "halt", "sei still") on the interim transcript abort instantly without even waiting for the full sentence.

Why a single shared AudioContext matters

One subtle bug I hit: the barge-in correlation only works if the mic frames and the playback-reference frames are at the same sample rate. If mic capture and TTS playback live in separate AudioContexts with different rates, the correlation silently returns "all residual" and barge-in fires on the robot's own voice. The fix is to use one AudioContext for both capture and playback, and tap the actual output samples (via a pass-through ScriptProcessor) to feed the reference ring β€” so what you correlate against is literally what came out of the speaker.

What it cost / what's left

The whole thing is behind a CONVERSATION_MODE flag; the old push-to-talk + WAV-download path still works as a fallback. I validated the Python end with lightweight fakes β€” VAD boundary detection, PCM framing, AudioSink streaming, and "cancel mid-stream truncates the turn" all pass β€” and confirmed the server boots, serves the new assets, and advertises conversation_mode in /api/config.

What I can't validate from the dev box is the stuff that needs the real hardware loop: the phone mic into Silero, Qwen3-MLX stream_pcm on Metal, and β€” most importantly β€” the barge-in thresholds. The 0.018 mic-RMS / 0.62 residual / 5-frame defaults are pibot's numbers for his speaker and room; mine will need tuning against the actual octobot speaker and the phone mic before it feels right. That's the next on-hardware session.

Face refresh: visor-style robot expression set

The original SVG face worked, but it looked more like floating eyes than a robot head. I replaced it with a visor-style panel face in src/server/static/face.js while keeping the same public API so the rest of the web app (app.js) did not need changes.

What changed:

  • New head shell + visor panel geometry (rounded frame, internal grid texture).
  • Rectangular eye modules with glow layers and square pupils.
  • Animated lids and brows still map to the same phases (inactive, listening, thinking, speaking, error).
  • Mouth is now a filled path that morphs between smile/frown/open speech shapes instead of only resizing a flat bar.
  • Status cheek LEDs pulse with phase glow to make idle/listening/speaking states clearer from a distance.

Behavior contracts that stayed stable:

  • setPhase(...) still drives phase expressions.
  • setTalking(true|false) still gates speech animation to real playback.
  • pulseMouth() still adds chunk-level talking twitches.

Net result: same control logic, more intentional "robot face" styling.

Fix: face dropped to "inactive" on the last sentence while still speaking

Symptom: at the very end of a reply, the avatar's face snapped to the inactive expression even though the speaker was still playing the final sentence.

Root cause β€” a client/server race, not an animation bug. In orchestrator.respond() the phase is flipped back to inactive (push-to-talk) or listening (conversation) as soon as the last sentence is synthesized, and then assistant_end / latency / phase events are emitted. But in the browser the final WAV segment is still sitting in the playback queue (audioQ), playing. When the phase=inactive event arrived, app.js called face.setPhase("inactive"), which both forces _talking = false and switches the base expression β€” mid-sentence.

Fix (client-side, src/server/static/app.js): defer any non-speaking face expression until local playback actually drains. setPhase() now parks the incoming phase in pendingFacePhase while audio is busy (audioBusy() = WAV queue still playing or conv._isPlaying() for streamed PCM) and keeps the face in speaking. The pending expression is applied via applyPendingPhase() when the WAV queue empties (nextAudio) or when the conversation engine reports talking stopped (onTalking(false)). The header status label still updates immediately; only the face is held in sync with what you actually hear. A new speaking phase (next turn / next segment) applies right away and clears any pending phase.

Fix: speaking mouth was two lines instead of a solid area

The mouth <path> is a closed shape (M … Q … Q … Z) but was drawn with fill: "none", so only its stroked outline showed. When the mouth opened to speak, the top-lip curve and bottom-lip curve separated and read as two parallel lines rather than an open mouth.

Fix (src/server/static/face.js): give the mouth path a fill (same cyan as the stroke) so the enclosed region renders as one solid area, add stroke-linejoin: round for clean corners, and stop growing the stroke width with mouth openness (it was 8 + open*4, which exaggerated the outline). The stroke is now a fixed thin edge that just rounds the filled shape β€” open speech is a single solid blob, idle is a thin solid bar.

Tweak: open mouth fill matches the lip lines

The open-mouth fill was a dark inner color (rgba(4,10,28,0.92)) to mimic a real open mouth. Changed it to the same cyan as the lip lines (#8be8ff) so the inside reads as a solid colored area rather than a dark cavity.

Revert to circle eyes, keep the solid-color mouth

Went back to the round-eye avatar (circular sockets + glowing iris + sliding eyelids) since it read better than the rectangular visor. Kept the mouth improvement: the open mouth fills solid with var(--face) β€” the same color as the lip lines β€” instead of the old translucent tint (rgba(126,224,255,0.18)).

Mouth: shorter + always-solid inner fill

Two tweaks: narrowed the mouth (mw 150 β†’ 104) so it isn't so wide, and made the inner mouth fill solid (var(--face)) in every phase β€” previously only filled when open past a threshold, so inactive/listening showed a hollow outline. Now the mouth reads as a solid colored shape in all states.


Hardware build log

Background: Octobot PCB

The Octobot (Silverlit) runs on a single-layer PCB with three subsystems:

  • Brain IC (center): handles IR remote, LED effects, and motor direction.
  • IR receiver (right): receives commands from the toy remote.
  • H-bridge (small black IC, bottom): drives the motor from the brain's direction signals.

There is only one brushed DC motor. The gearbox has two gear paths: reversing the motor direction engages a different set of gears, so one direction walks and the other turns the head/platform. This means the robot can only walk forward and turn in one direction (counter-clockwise).

Decision: keep old PCB for sound / light / IR, replace motor path

Why not parallel the old H-bridge with DRV8833

Two H-bridges driving the same motor simultaneously is a short-circuit hazard. If one drives forward while the other drives reverse (or into brake mode), the outputs fight each other and can destroy one or both drivers. Do not connect both H-bridge outputs to the motor at the same time.

Chosen approach

  1. Keep the original PCB powered β€” IR remote, music, and LEDs continue to work.
  2. Disconnect the two motor wires from the original H-bridge outputs (cut traces or unsolder).
  3. Connect the motor wires to DRV8833 AOUT1 / AOUT2 instead.
  4. ESP32 drives AIN1 / AIN2 on the DRV8833.

The old "brain IC" can stay. It will try to drive its own H-bridge outputs, but those are now floating (disconnected from the motor), so it causes no harm.

Wiring

SignalESP32-S3 GPIODRV8833 pin
AIN1GPIO 4AIN1
AIN2GPIO 5AIN2
Motor +β€”AOUT1
Motor βˆ’β€”AOUT2
VMBattery +VM
GNDBattery βˆ’GND (shared with ESP32 GND)

Add a 100 Β΅F electrolytic cap across VM / GND close to the DRV8833 for motor inrush protection (same role as the caps on the original PCB).

Motor direction

DRV8833 truth table (xIN1=H, xIN2=L β†’ forward; xIN1=L, xIN2=H β†’ reverse):

CommandAIN1 (GPIO 20)AIN2 (GPIO 21)
forwardHIGHLOW
turn_leftLOWHIGH
stop/coastLOWLOW

If forward and turn are physically reversed after assembly, swap the AIN1 / AIN2 pin assignments in esp32/octobot.ino (the #define lines at the top).

Power

  • DRV8833 VM: wire directly to the battery positive rail (4Γ—AA β‰ˆ 6 V). VM range is 2.7–10.8 V.
  • ESP32 3.3 V: use the ESP32 devkit's onboard 3.3 V regulator (fed from USB during development; for standalone use, feed the devkit's 5 V pin from a 5 V LDO/buck tied to the battery).
  • Common GND: battery negative, DRV8833 GND, and ESP32 GND must all connect.

No extra capacitors on the motor supply are required beyond the 100 Β΅F bulk cap β€” the DRV8833 has internal bootstrap circuitry. The original PCB's caps were there because the original brain IC had no such protection.

Software architecture

Previous approach (FT232H)

Server (laptop) β†’ WebSocket β†’ Phone browser β†’ WebUSB/FT232H β†’ H-bridge β†’ Motor

The phone acted as a USB-to-GPIO adapter. This required the phone to stay connected over USB.

New approach (ESP32 WiFi)

Server (laptop) β†’ HTTP POST /motor β†’ ESP32 WiFi β†’ DRV8833 β†’ Motor
Phone: audio input/output, display, camera only

The ESP32 connects to the same WiFi network as the server. Set ESP32_URL=http://<esp32-ip> in the server environment. The server calls the ESP32 directly; the phone no longer participates in motor control.

About USB from phone to ESP32

The FT232H used WebUSB (vendor-specific USB class, supported in Chrome/Edge). The ESP32's built-in USB port enumerates as a CDC-Serial device, which requires the Web Serial API (different from WebUSB). Chrome/Edge on Android support Web Serial, but:

  • iOS Safari supports neither WebUSB nor Web Serial.
  • The existing motor.ts client code speaks the FTDI protocol; it would need a full rewrite.
  • WiFi removes the USB cable entirely and lets the server drive the motor without routing through the phone, which is simpler and more reliable.

WiFi is the recommended path. Web Serial is an option only if a USB tether from phone to ESP32 is specifically required and iOS is not a target.

ESP32 firmware

See esp32/src/main.cpp and esp32/platformio.ini. The firmware uses the Arduino framework targeting ESP32 β€” no Arduino IDE needed.

Tooling: PlatformIO in VS Code

  1. Install the PlatformIO IDE extension in VS Code.
  2. Open the esp32/ folder (File β†’ Open Folder).
  3. PlatformIO detects platformio.ini and downloads the ESP32 toolchain and libraries automatically.
  4. Click Upload (β†’ icon) or run pio run --target upload to flash.
  5. Click Monitor (plug icon) or pio device monitor for Serial output.

Libraries are declared in platformio.ini under lib_deps β€” no manual installs:

  • ArduinoJson 7.x (bblanchon/ArduinoJson)
  • WiFiManager 2.0.x (tzapu/WiFiManager)

If your board is not a generic ESP32 DevKit, change the board value in platformio.ini. See the comment at the top of that file for common alternatives (S3, C3, etc.).

WiFi credentials β€” no hardcoding

Credentials are not in the sketch file and not in git. Instead WiFiManager is used:

  1. First boot (or after a credential reset): ESP32 creates an open AP called PiBot-Setup.
  2. Connect to it from any phone or laptop.
  3. Open http://192.168.4.1 β€” a captive portal lets you pick your SSID and enter the password.
  4. Credentials are saved to ESP32 NVS flash and reused on every subsequent boot.
  5. To reconfigure: hold the BOOT button (GPIO 0) for 3 seconds at startup.

After first-time setup, open Serial Monitor at 115200 baud to read the assigned IP, then set ESP32_URL=http://<ip> in the server environment.

The /motor endpoint blocks for durationMs before responding, so the HTTP response signals completion β€” this matches the existing RPC semantics where the server waits for the tool to finish before the LLM receives the result.

Server integration

Set ESP32_URL=http://<esp32-ip> in the environment. When this is set, the server's motor tool sends commands directly to the ESP32 via HTTP instead of forwarding through the phone's WebSocket connection. The phone WebSocket path remains as a fallback when ESP32_URL is not set.

turn_left_degrees (which normally uses the phone's orientation sensor for closed-loop angle control) falls back to a timed turn_left when routing through the ESP32 β€” the server already pre-calculates durationMs from the requested degrees, so the behaviour degrades gracefully.

Spotify: from broken playback to room-aware music via Music Assistant

The original Spotify integration had two hard problems. Asking the bot to play a playlist from your library mostly failed because the agent called Spotify's public search API, which only returns public playlists β€” your private ones are behind /me/playlists. And there was no way to target a specific room: the bot always played on whatever device was currently selected in the browser UI.

Library playlists

Added a spotify_my_playlists tool that fetches /me/playlists directly. The tool description tells the LLM to always try this first for any playlist request and only fall back to spotify_search with itemType=playlist if nothing matches.

Device targeting β€” the Spotify Web API dead-end

Added spotify_list_devices (calls /me/player/devices) and a deviceId parameter on spotify_play. This works fine for Spotify Connect devices with an active session β€” phones, computers β€” but Google Home and Chromecast Audio only appear in that list when they have the Spotify Cast receiver running. When idle, the Web API cannot see them at all. The phone Spotify app discovers them locally via the Google Cast SDK; the Web API has no such mechanism.

Tried a cascade of HA-based workarounds:

  • media_player.play_media on the Cast entity with media_content_type: music β€” HA launched the receiver (audible connection chime) but Spotify couldn't authenticate it without credentials
  • HA's Spotify integration (media_player.spotify_hasi) β€” play_media is not supported when the entity is idle; select_source needs active playback already running; circular dependency
  • media_content_type: spotify β€” same result as music

Music Assistant

Music Assistant (MA) solves the problem cleanly. It holds a Spotify OAuth token, uses the Google Cast SDK on the HA server (which is on the local network), and handles device wakeup + authentication internally. After installing MA, testing music_assistant.play_media confirmed it could play Spotify URIs on idle Cast devices via HA.

The pibot integration:

SPOTIFY_HA_ROOMS β€” a comma-separated Name:entity_id map in the environment (e.g. KΓΌche:media_player.kuche,Bad:media_player.badezimmer). At startup this is parsed and injected into the spotify_list_devices tool description so the LLM knows the exact entity for each room name without a discovery call.

MA_CONFIG_ENTRY_ID β€” the HA config entry ID for MA, injected into the description as the config_entry_id parameter for music_assistant.search and music_assistant.get_library calls.

HOME_ASSISTANT_ALLOWED_DOMAINS β€” added music_assistant to the allowlist so music_assistant.play_media and music_assistant.transfer_queue are reachable.

Flow for room requests: the LLM skips spotify_list_devices entirely for known rooms and calls homeassistant_call_service directly β€” domain=music_assistant, service=play_media, entity_id=<from map>, data={media_id: <name or spotify uri>, media_type: <type>, enqueue: play}. MA does its own internal search so passing the name the user said is sufficient; no separate Spotify search step. Two tool calls total: get the content, play it.

Flow for no-room requests: unchanged β€” Spotify Web API spotify_play on the active device.

Pause: media_player.media_pause on the MA entity. The service must be named exactly media_pause; pause, stop, media_player_pause all return 400.

Room switching / exclusive playback: music_assistant.transfer_queue moves the queue from one MA player to another, stopping the source device. Combined with a HA automation that pauses all other MA entities when one starts playing, this enforces "last wins" exclusive playback without any code in pibot.

AVR wakeup timeout: the Denon AVR in the Wohnzimmer takes more than 15 seconds to wake from standby. Increased the HA tool request timeout from 15 s to 30 s so the call doesn't abort while the receiver is still powering up β€” the music plays a moment after the bot has confirmed success rather than the bot reporting a false timeout.

Lessons

  • Spotify's Web API device list only reflects devices with an active session. Cast devices on idle are invisible to it regardless of authentication or scopes.
  • HA's Spotify integration (media_player.spotify_*) is a control mirror for existing sessions, not a session initiator β€” play_media requires an active device.
  • Music Assistant is the right abstraction for home speaker playback: it handles Cast SDK, provider auth, and device wakeup in one layer that exposes a simple HA service call.
  • Receiver wakeup time (~20 s) matters for timeout budgets. Fire-and-forget is better than fail-fast here β€” HA accepts the request immediately and MA handles the wakeup asynchronously.