Aoede's quality voice is Kokoro, a small open neural text-to-speech model. My first integration ran it through FluidAudio's CoreML and Apple Neural Engine pipeline, which is the obvious on-device path and works on most Macs.
It did not work on mine. On this Apple M5, FluidAudio's Kokoro chain crashed across every compute unit I tried: a Metal "JIT not supported" error on the GPU, and a low-level BNNS segfault on CPU and ANE. The same library runs fine on older Apple silicon, so this was the new hardware getting ahead of the dependency. An older app of mine on an older FluidAudio build still spoke, which told me the model was fine and the runtime was the problem.
So I moved Kokoro to a different runtime: the MLX port, which runs the model through Apple's MLX framework instead of CoreML. That brought its own friction. MLX's Metal shaders only compile under a real Xcode build, not plain swift build, so the provider has to live in the app target, and it needs the Metal Toolchain installed. But it runs on the M5, and it exposes per-token timestamps. The result was simple: Kokoro worked, the voices sounded better, and I gained access to word-level timestamps that made the karaoke highlight more accurate instead of estimated. I kept the FluidAudio work on a branch in case the hardware story changes.
This switch is also what set up the debugging night in the next update. Moving runtimes solved the crash and quietly introduced a subtler problem.