A Windows desktop app for AI text-to-speech with 13 built-in voices, voice cloning from a 6-second sample, AI audio enhancement, and queue-based batch generation. Runs 100% offline — no cloud, no account, no character caps. Fast mode is free forever. Pro unlocks everything — $12/mo or $89 once, your call.
Kokoro is the "done in half a second" TTS engine — instant speech synthesis, no GPU needed. Chatterbox is the "wait a bit, get something you'd actually play for a client" engine — human-sounding narration with voice cloning. Pick per line, per project, whenever.
Near-instant generation. Sounds clean and consistent — perfect for narration, audiobooks, and anything where you need speed and clarity over emotional range.
Slower, but genuinely human-sounding. Clones any voice from a 6-second reference sample. The one you'd use for a real podcast, ad, or client deliverable.
| VoxWild Free | VoxWild Pro | ElevenLabs Starter | Murf Creator | PlayHT Creator | |
|---|---|---|---|---|---|
| Monthly cost | $0 | $0 after buy | $22 | $29 | $31 |
| 3-year total | $0 | $89 total | $792 | $1,044 | $1,116 |
| Runs offline | Yes | Yes | No | No | No |
| Voice cloning | — | Yes, 6s sample | Yes (≥Creator $22) | No | Yes |
| Voice count | 13 | 13 + unlimited clones | ~120 | ~200+ | ~800 |
| Character limit | unlimited | unlimited | 30k / month | 200k / month | 250k / month |
| Your text goes to | your laptop | your laptop | their servers | their servers | their servers |
| Commercial use | Yes | Yes | Yes | Yes | Yes |
| Open-source models | Yes | Yes | No | No | No |
VoxWild is a Python desktop app that bundles a small number of open-source AI text-to-speech models. Here's exactly what they are and where they come from, because you shouldn't install a speech synthesis tool that won't tell you.
We're Cookie Studios — a small independent team building desktop software that respects your computer and your wallet.
VoxWild started in early 2026 because the decent AI text-to-speech options all wanted $22+ a month to rent voices that run fine on a laptop. Nobody had shipped a proper desktop wrapper around Kokoro and Chatterbox — two genuinely great open-source speech models. So we built one. Then we kept building, because every time we used it we noticed something else it needed.
We're small enough that support goes straight to a person who knows the code. Email us if something breaks or you want a feature — we reply within a day, usually within an hour on weekdays.
Yes, but we get why you're asking. SmartScreen warns because code-signing certificates cost ~$400/year and VoxWild is a small indie shop. The installer is hosted on GitHub Releases (public repo, every commit is visible) and the download link on this page points there directly. You can inspect every line of Python before trusting it. We plan to get the exe signed once revenue justifies the cost.
Yes. The only network requests the app ever makes are: (1) a single HTTPS call to Gumroad when you enter a license key, and (2) a daily check to GitHub for new versions. Your text, your audio, your voice clones — none of it leaves your computer. Ever. No account. No cloud.
Cookie Studios — a small independent team building desktop tools. Support email is above, and a real person replies within a day (usually within an hour on weekdays). We're not VC-funded; revenue comes from users buying the app.
MP3 (configurable bitrate, with ID3 metadata), WAV (16-bit), and SRT subtitle files timed to the audio. FLAC is on the roadmap.
No. Paste a novel. The queue splits long text into chunks at sentence boundaries and concatenates the output. I've tested it with book-length scripts.
Yes, on every tier including Free, for both engines. Two caveats: (a) Chatterbox output contains an inaudible neural watermark (a Perth watermark, required by the upstream license — doesn't affect audio quality). (b) The EULA requires that you only clone voices you have rights to clone. Beyond that, what you make is yours.
Honest answer: good enough to fool a listener in a podcast context, not good enough to fool the cloned person's spouse. Longer reference samples (30–60 seconds of clean audio) produce noticeably better clones. The quality of the source audio matters more than the length — a clean 8s sample beats a noisy 60s one.
14 days, no questions asked, through Gumroad. Reply to your receipt email and ask for a refund.
Fast mode keeps working forever. Natural mode and cloning stop generating new audio. Audio you already generated stays on your disk — it's yours, always.
Yes. A single license activates on up to two machines. If the old one is dead and you need a seat freed, email us and we'll release it the same day.
Windows only right now. Mac is possible eventually — the underlying models run fine on Apple Silicon — but packaging and testing on a second platform is a lot of work for a small team. If you want Mac support, drop us a line so we can gauge demand.
Short-term: more export formats, pronunciation dictionary improvements, better clone management. Longer-term: FLAC, maybe a macOS build, maybe Linux if demand exists. No AI video, no avatars, no chatbot — this app does one thing.
Drag across the box. Release to hear it. This has nothing to do with the app.