It was the second week of April. I had left my incubation role at a college and was figuring out what to do next. I started visiting the iStart incubation center regularly and exploring what founders were building there.
I started working at a startup on the third floor. The work was mostly WordPress development, so it was straightforward. Through that, I met people from other startups, including a founder from Aguken, a voice AI startup building voice agents for FMCG brands.
He suggested I try fine-tuning Qwen3 TTS in Hindi as an experiment. I jumped in directly and began the journey. Since I did not have access to strong GPU resources, I used Lightning AI credits for new users, prepared a Python notebook, ran tests, and generated my first audio sample.
The result felt great. The audio quality was surprisingly strong. After that, I went deeper: LoRA fine-tuning experiments on both Qwen3 0.6B and Qwen3 1.7B TTS.
I started with a smaller run on around 4K dataset rows. Training loss was around 7-8, but the generated audio was still quite decent. Later, I trained the 0.6B model on the full dataset and then tried quantization to reduce VRAM usage.
Quantization did reduce resource needs, but it also lowered quality a bit. We eventually published the model on Hugging Face as well.
Here's the link if you want to check it out: Hugging Face Model.