Choose AI Tool
About this AI module
This Speech-to-Text module is part of the small set of practical tools I maintain for simple media and content work. It turns spoken audio into editable text for faster transcription, note-taking, and content preparation.
I use it where audio is recorded on mobile and later converted into structured text for support, documentation, moderation, or publishing tasks. It helps reduce manual typing and makes spoken updates easier to handle.
It also turns voice recordings into searchable text that can be reviewed, edited, and reused later.
Who should use this module?
- People who need to turn short voice recordings into editable text.
- Teams handling notes, updates, or simple reporting from audio.
- Mobile users who prefer speaking first and editing later.
How My Vietnamese Speech-to-Text Module Works on a Flask AI Server
My Speech-to-Text module runs on the same Python Flask AI server used for the other media utilities in this workflow. The main API entry point is GET/POST /wav2vec2, served through Flask on port 8789. In the current version, the endpoint reads an audio path from form data or query parameters, and if no path is provided, it falls back to a default local file. The response returns JSON with status, transcript, and the processed file path.
The transcription engine is based on the Hugging Face model khanhld/wav2vec2-base-vietnamese-160h. Both the processor and model are loaded once when the module starts, rather than reloading on every request. The runtime device is selected automatically, using GPU when CUDA is available and CPU otherwise. This keeps repeated requests more stable, although it also means the service keeps a memory footprint while running.
For audio processing, the file is loaded with librosa, converted to mono, and resampled to 16 kHz, which matches the model input. A normalized intermediate WAV file can also be written to D:\hustmedia\python\tts\wav2vec2\run.wav for inspection or reuse. Before inference, the waveform is converted to float32, checked to avoid empty input, and normalized by peak amplitude.
Once prepared, the audio is tokenized with sampling_rate=16000 and passed through the model under torch.no_grad(). The output logits are decoded with greedy CTC argmax, then converted into text with batch_decode. In its current form, this module does not use beam search, VAD chunking, or language-model rescoring, so long files are still processed in one pass and may increase latency or memory usage.
This module is mainly intended for practical Vietnamese transcription tasks such as voice notes, support logs, internal updates, and simple content preparation. It is not designed as a full enterprise ASR platform, but as a working in-house component that I built and maintain for my own workflow.
- Server runtime: Flask on port 8789
- Main endpoint: GET/POST /wav2vec2
- STT model: khanhld/wav2vec2-base-vietnamese-160h
- Device selection: automatic GPU / CPU
- Input normalization: mono audio, 16 kHz
- Intermediate WAV path: D:\hustmedia\python\tts\wav2vec2\run.wav
- Decode method: greedy CTC argmax
- Inference mode: torch.no_grad()
- Current limitation: no chunking, no VAD, no beam search
Audio Transcription
Upload audio to convert it into text
Supported formats: MP3, WAV, M4A, OGG
No file selected