The Robotic Voice Was a Dependency, Not My Code

Aoede's neural voice, Kokoro, started sounding like a malfunctioning droid. Every sentence opened with a garbled, robotic burst, settled into normal speech, then went robotic again at the start of the next one. The synthesized audio was already wrong before it ever reached the playback engine.

I spent a long night chasing it with an AI coding agent, and we went down every plausible road. We captured the raw audio and looked at it as waveforms. We rendered the same buffers offline through a copy of the audio graph to rule out playback. We checked the voice file's shape, read the vocoder's source, and even found the synthesis was non-deterministic from an unseeded random generator. Every theory pointed back at our own code: buffering, the audio graph, PCM conversion, timing. We tested them all and kept coming up empty.

The breakthrough came when we stopped asking "what is wrong with our code?" and started asking "what is different about this environment?" We forced the model onto the CPU instead of the GPU, and the robotic artifact vanished. That was the whole answer: a numerical regression in the GPU/Metal path of a specific dependency version, mlx-swift 0.30.x, on this particular Apple M5 hardware. Pinning back to 0.29.1 fixed it, and the newer 0.31.4 fixes it too. The actual change was a one-line version bump after hours of looking in the wrong place.

Here is the part worth remembering. I ran this with two different frontier models over the night, Opus 4.8 and GPT-5.5, and neither found the cause on its own. They were genuinely useful: fast at writing capture harnesses, patient at trying ideas, good at reading unfamiliar vocoder code. But the move that mattered, deciding to stop debugging the code and start bisecting the environment, came from human collaboration on how to debug, not from the agent. It eventually found and confirmed the root cause, but only because we changed what it was looking for.

That matches how I have come to think about these tools. They are excellent at executing a search and weak at noticing the search is aimed at the wrong thing. Knowing when to abandon a whole class of hypotheses is still a human call. The fix was one line. Getting there was a collaboration.