Practical AI Utilities

Text to Speech Speech to Text Image to Text Vietnamese to English

About this AI module

This Text-to-Speech module is part of the small set of practical tools I maintain for simple media and content work. It turns written text into spoken audio for narration, accessibility, and faster content preparation.

I first built it in an AI-related working environment and later rebuilt it into a more stable version for my own use. It now runs on my physical server and supports guides, product descriptions, short scripts, and task documentation.

It helps reduce repetitive recording work and makes voice output easier to prepare for Reels and YouTube Shorts.

Who should use this module?

Creators working on tutorials, explainers, and short social content.
Support teams that need quick voice output for guides and onboarding.
Students and mobile users who prefer listening instead of reading.

How My Vietnamese Text-to-Speech Module Works on a Flask AI Server

Short description for the article card

This article explains how my Vietnamese Text-to-Speech module works on a Flask-based AI server, from request routing and model selection to waveform generation and MP3 export. It also outlines the current runtime settings, voice generation flow, and a few practical limits in the version I am using now.

Article body

My Text-to-Speech module runs on a Python Flask AI server inside D:\hustmedia\python. The service is exposed on 0.0.0.0:8789, and the main POST /tts route works as a dispatcher between two Vietnamese TTS engines: a local F5-based pipeline and an alternative path based on facebook/mms-tts-vie. The API requires a non-empty text field, reads a servicecode, and falls back to the F5 engine unless the request explicitly asks for the Facebook path. In the current version, the request is processed synchronously and the API returns a success payload rather than streaming the audio directly.

The main voice generation path is F5_vie. Before synthesis begins, the module converts numeric strings into Vietnamese words so phone numbers, quantities, and short operational text sound more natural. The server then calls f5-tts_infer-cli with a fixed reference voice file, the F5TTS_Base model, speed 0.5, the vocos vocoder, a local vocabulary file, and the checkpoint model_500000.pt. After inference, the generated WAV file is converted to MP3 with pydub and saved to D:\hustmedia\python\tts\output\output.mp3.

Inside the F5 pipeline, the processing flow is more than a simple wrapper call. The CLI loads its config, model backbone, and checkpoint, preprocesses the reference audio by trimming silence and adding about 50 ms of padding, and can infer missing reference text with Whisper. The generation text is then split into batches, normalized, resampled to 24 kHz, and passed through the sampling path before the waveform is decoded by the vocoder. When multiple chunks are produced, they are joined with a default cross-fade of 0.15 seconds to reduce audible breaks.

I also keep a second engine based on facebook/mms-tts-vie. In this path, the tokenizer and model are loaded from Hugging Face, the text is converted into a waveform, and the result is exported through the same output pipeline. This version is useful as an alternative engine, but in the current code it reloads the model on each request, so latency and memory usage can vary more than a persistent in-memory design. Both engines also write to the same output MP3 path, which means concurrent requests need tighter output isolation in future revisions.

Technical configuration snapshot

Runtime: Python Flask server on 0.0.0.0:8789
Main route: POST /tts
Default engine: F5_vie
Alternate engine: facebook/mms-tts-vie
Model path: F5TTS_Base
Vocoder: vocos
Checkpoint: model_500000.pt
Speed: 0.5
Audio resample target: 24 kHz
Chunk merge: 0.15 s cross-fade
Export: WAV to MP3 via pydub

Voice Generation

Enter text for narration

Current characters: 0

Practical Input/Output Examples

Input: Product description paragraph. Output: Short narrated audio for social posts.

Input: Task instructions. Output: Short voice summary for collaborators.

Was this content helpful to you?