How I built a free Hindi TTS pipeline for YouTube — no paid APIs, no watermarks

Introduction

I run a Hindi storytelling channel — StoryBlazeWorld — and like every faceless channel creator, I needed a voiceover pipeline that didn't cost me ₹5,000/month in API fees. I tried ElevenLabs. Beautiful voice. Empty wallet. 💸

So I spent a week going down every rabbit hole — Coqui XTTS v2, AI4Bharat, Azure Neural, edge-tts — breaking things, fixing things, and occasionally yelling at my terminal. This post is everything I learned, condensed into a stack that actually works.

The problem

Hindi TTS in 2025 is a mess if you want it free + local + high quality. Here's what I ran into:

🔴 AI4Bharat — model name doesn't exist in Coqui's registry. KeyError: 'hi'.
🔴 Coqui XTTS v2 — PyTorch 2.6 broke weights_only loading. Then torchaudio wanted torchcodec. Then 400 token limit per call.
🔴 ElevenLabs — Viraj voice is chef's kiss but the free tier runs out in 10 minutes of audio.
🔴 edge-tts "ReaanNeural" — doesn't exist. NoAudioReceived. Classic.

Every option had a catch. I needed something that was genuinely free, ran on my M2 Mac, produced audio good enough for YouTube, and didn't fight me every step of the way.

The solution — edge-tts + pydub

After all that pain, the winning stack was surprisingly simple:

pip install edge-tts pydub
brew install ffmpeg

edge-tts uses Microsoft's Azure Neural voices through the Edge browser endpoint — completely free, no API key, same neural engine as the paid Azure TTS. For Hindi, hi-IN-MadhurNeural is the only male option, and with the right RATE/PITCH settings it sounds genuinely professional.

VOICE = "hi-IN-MadhurNeural"
RATE = "+8%" # professional pace
PITCH = "-3Hz" # richer, warmer tone

The key insight: instead of passing the whole script in one call, I split on sentence boundaries (। and .), generate each sentence separately, then stitch them with pydub — adding 200ms silence between sentences and 700ms between sections. This alone makes the audio feel 10x more natural.

Lessons learned

SSML doesn't work in edge-tts the way you expect. Passing XML tags just makes it read them aloud. Use pydub silence instead for pauses — it's more reliable.

PyTorch 2.6 broke Coqui TTS silently. The fix is monkey-patching torch.load before importing TTS — but honestly, edge-tts is less friction for most use cases.

Roman Hindi → Devanagari matters. TTS models trained on Hindi expect Devanagari script. Numbers especially — "76" should be "छिहत्तर" or the pronunciation breaks completely.

Voice enhancement > voice switching. Run your audio through Audacity — bass boost (80–200Hz), compressor at -18db/3:1, and 15% reverb. MadhurNeural sounds studio-grade after this.

Audience doesn't care if it's AI. Millions of Hindi YouTube views are AI-voiced. Good script + good music + good pacing beats human voice with bad content every time.

Conclusion

You don't need to spend money to build a professional Hindi voiceover pipeline. edge-tts + pydub + Audacity gets you 80% of ElevenLabs quality at ₹0/month. The remaining 20% is your script, your music, and your editing.

The full script (segment splitting, auto-merge, final_audio.mp3 generation) is up on GitHub — link below. If you're building a Hindi faceless channel, this is your starting point. 🎙️

How I built a free Hindi TTS pipeline for YouTube — no paid APIs, no watermarks

Related Posts

The SaaS Shifts Every Senior Engineer Should Be Watching in 2026

Why Your GA4 Events Aren't Showing Up (And How I Fixed It)