Does VoxWild actually run offline?

Yes. The only network requests the app ever makes are a single HTTPS call to Gumroad when you enter a license key, and a daily check to GitHub for new versions. Your text, your audio, your voice clones — none of it leaves your computer. No account required.

Can I use VoxWild audio commercially?

Yes, on every tier including the Free tier, for both the Fast (Kokoro) and Natural (Chatterbox) engines. Chatterbox output includes an inaudible neural watermark (Perth) required by the upstream license. Voice cloning requires that you have rights to the source voice.

Is VoxWild available for Mac or Linux?

Windows only right now. Mac support is possible since the underlying models run on Apple Silicon, but packaging and testing on a second platform is a lot of work for a small team.

VoxWild — Offline AI Text-to-Speech for Windows

Q: The installer is unsigned. Is it safe?

Yes. SmartScreen warns because code-signing certificates cost ~$400/year and VoxWild is a small indie shop. The installer is hosted on GitHub Releases (public repo, every commit is visible) and you can inspect every line of Python before trusting it. We plan to get the exe signed once revenue justifies the cost.

Q: What file formats can VoxWild export?

MP3 with configurable bitrate and ID3 metadata, WAV 16-bit, and SRT subtitle files timed to the audio. FLAC export is on the roadmap.

Q: Is there a maximum text length for text-to-speech?

No, VoxWild has no text-length limit. The queue splits long text into chunks at sentence boundaries and concatenates the output. It has been tested with book-length scripts.

Q: How good is 6-second voice cloning?

Good enough to fool a listener in a podcast context, though not good enough to fool the cloned person's spouse. Longer reference samples (30–60 seconds of clean audio) produce noticeably better clones. Audio quality of the source matters more than length.

Q: Can I move my license to a new PC?

Yes. A single VoxWild license activates on up to two machines. If the old one is dead and you need a seat freed, email us and we'll release it the same day.

Two engines

Two AI text-to-speech engines, because voice is a trade-off.

Kokoro is the "done in half a second" TTS engine — instant speech synthesis, no GPU needed. Chatterbox is the "wait a bit, get something you'd actually play for a client" engine — human-sounding narration with voice cloning. Pick per line, per project, whenever.

Fast mode

Kokoro TTS

Kokoro 82M · ONNX runtime · CPU

Near-instant generation. Sounds clean and consistent — perfect for narration, audiobooks, and anything where you need speed and clarity over emotional range.

Latency~0.4s per 30s of audio
Voices13 built-in (US + UK, M/F)
Model size82 MB (bundled)
RequiresCPU only · 4 GB RAM min
CostFree forever, unlimited

Natural mode

Chatterbox TTS

Chatterbox · PyTorch · CPU/GPU

Slower, but genuinely human-sounding. Clones any voice from a 6-second reference sample. The one you'd use for a real podcast, ad, or client deliverable.

Latency~4s per 10s of audio (CPU)
Voice cloningYes, 6s sample minimum
Model size~3 GB (first-use download)
Requires6 GB RAM min · GPU optional
Cost3 free, then Pro unlocks unlimited

Pricing

VoxWild vs. ElevenLabs, Murf, and PlayHT.

Three ways to pay, one app. Free if Fast mode text-to-speech is enough. Monthly if you want flexibility. Lifetime if you're in for the long haul.

	VoxWild Free	VoxWild Pro	ElevenLabs Starter	Murf Creator	PlayHT Creator
Monthly cost	$0	$0 after buy	$22	$29	$31
3-year total	$0	$89 total	$792	$1,044	$1,116
Runs offline	Yes	Yes	No	No	No
Voice cloning	—	Yes, 6s sample	Yes (≥Creator $22)	No	Yes
Voice count	13	13 + unlimited clones	~120	~200+	~800
Character limit	unlimited	unlimited	30k / month	200k / month	250k / month
Your text goes to	your laptop	your laptop	their servers	their servers	their servers
Commercial use	Yes	Yes	Yes	Yes	Yes
Open-source models	Yes	Yes	No	No	No

Free

Unlimited Fast mode. 3 free Natural mode + 3 free Enhancement.

Download

Pro Monthly

$12 /mo

Everything unlimited. Cancel anytime from Gumroad.

[ most people pick this ]

Pro Lifetime

$89 once

Pays for itself in 7 months. All future updates included.

Buy lifetime

What's inside

Open-source AI speech models. No black boxes.

VoxWild is a Python desktop app that bundles a small number of open-source AI text-to-speech models. Here's exactly what they are and where they come from, because you shouldn't install a speech synthesis tool that won't tell you.

Kokoro TTS Fast-mode speech synthesis. 82M params, ONNX, runs on CPU. Apache 2.0
Chatterbox TTS Natural-mode synthesis and 6-second voice cloning. Made by Resemble AI. MIT
Resemble Enhance Optional AI audio enhancement — denoise + dereverb + upsample. MIT
Perth watermark Inaudible neural watermark embedded in Chatterbox output. Required by the upstream license. MIT
PyTorch · ONNX Runtime Inference runtimes. Everything runs locally through these — no network calls. BSD · MIT
CustomTkinter The desktop UI toolkit. Makes Tk not look like Tk. MIT

PlatformWindows 10/11 x64

Installer size~377 MB

Disk after install~800 MB (Fast mode only) · ~5 GB with Natural mode

Minimum RAM4 GB (Fast mode) · 6 GB (Natural mode)

GPUOptional · speeds up Enhancement if CUDA available

NetworkOnly for license activation and update checks

TelemetryNone

Who made this

We're Cookie Studios — a small independent team building desktop software that respects your computer and your wallet.

VoxWild started in early 2026 because the decent AI text-to-speech options all wanted $22+ a month to rent voices that run fine on a laptop. Nobody had shipped a proper desktop wrapper around Kokoro and Chatterbox — two genuinely great open-source speech models. So we built one. Then we kept building, because every time we used it we noticed something else it needed.

We're small enough that support goes straight to a person who knows the code. Email us if something breaks or you want a feature — we reply within a day, usually within an hour on weekdays.

— Cookie Studios
cookiestudios.dev@gmail.com

Frequently asked questions

VoxWild FAQ — the ones people actually ask.

— Is this real?

The installer is unsigned. Is it safe?

Yes, but we get why you're asking. SmartScreen warns because code-signing certificates cost ~$400/year and VoxWild is a small indie shop. The installer is hosted on GitHub Releases (public repo, every commit is visible) and the download link on this page points there directly. You can inspect every line of Python before trusting it. We plan to get the exe signed once revenue justifies the cost.

Does it actually run offline?

Yes. The only network requests the app ever makes are: (1) a single HTTPS call to Gumroad when you enter a license key, and (2) a daily check to GitHub for new versions. Your text, your audio, your voice clones — none of it leaves your computer. Ever. No account. No cloud.

Who's behind it?

Cookie Studios — a small independent team building desktop tools. Support email is above, and a real person replies within a day (usually within an hour on weekdays). We're not VC-funded; revenue comes from users buying the app.

— Will it do what I need?

What can I export?

MP3 (configurable bitrate, with ID3 metadata), WAV (16-bit), and SRT subtitle files timed to the audio. FLAC is on the roadmap.

Is there a max text length?

No. Paste a novel. The queue splits long text into chunks at sentence boundaries and concatenates the output. I've tested it with book-length scripts.

Can I use the audio commercially?

Yes, on every tier including Free, for both engines. Two caveats: (a) Chatterbox output contains an inaudible neural watermark (a Perth watermark, required by the upstream license — doesn't affect audio quality). (b) The EULA requires that you only clone voices you have rights to clone. Beyond that, what you make is yours.

How good is the 6-second voice cloning?

Honest answer: good enough to fool a listener in a podcast context, not good enough to fool the cloned person's spouse. Longer reference samples (30–60 seconds of clean audio) produce noticeably better clones. The quality of the source audio matters more than the length — a clean 8s sample beats a noisy 60s one.

— What if I change my mind?

Refund policy?

14 days, no questions asked, through Gumroad. Reply to your receipt email and ask for a refund.

What if I cancel Pro Monthly?

Fast mode keeps working forever. Natural mode and cloning stop generating new audio. Audio you already generated stays on your disk — it's yours, always.

Can I move my license to a new PC?

Yes. A single license activates on up to two machines. If the old one is dead and you need a seat freed, email us and we'll release it the same day.

What about Mac / Linux?

Windows only right now. Mac is possible eventually — the underlying models run fine on Apple Silicon — but packaging and testing on a second platform is a lot of work for a small team. If you want Mac support, drop us a line so we can gauge demand.

Roadmap?

Short-term: more export formats, pronunciation dictionary improvements, better clone management. Longer-term: FLAC, maybe a macOS build, maybe Linux if demand exists. No AI video, no avatars, no chatbot — this app does one thing.

Offline text-to-speech
that doesn't sound like
a robot.

Two AI text-to-speech engines, because voice is a trade-off.

Kokoro TTS

Chatterbox TTS

VoxWild vs. ElevenLabs, Murf, and PlayHT.

Open-source AI speech models. No black boxes.

VoxWild FAQ — the ones people actually ask.

Draw a wave.

Download VoxWild for Windows.

Offline text-to-speech that doesn't sound like a robot.

Two AI text-to-speech engines, because voice is a trade-off.

Kokoro TTS

Chatterbox TTS

VoxWild vs. ElevenLabs, Murf, and PlayHT.

Open-source AI speech models. No black boxes.

VoxWild FAQ — the ones people actually ask.

Draw a wave.

Download VoxWild for Windows.

Offline text-to-speech
that doesn't sound like
a robot.