Choose AI Tool

About this AI module

This Speech-to-Text module is part of the small set of practical tools I maintain for simple media and content work. It turns spoken audio into editable text for faster transcription, note-taking, and content preparation.

I use it where audio is recorded on mobile and later converted into structured text for support, documentation, moderation, or publishing tasks. It helps reduce manual typing and makes spoken updates easier to handle.

It also turns voice recordings into searchable text that can be reviewed, edited, and reused later.

Who should use this module?

  • People who need to turn short voice recordings into editable text.
  • Teams handling notes, updates, or simple reporting from audio.
  • Mobile users who prefer speaking first and editing later.

How My Vietnamese Speech-to-Text Module Works on a Flask AI Server

Short description for the article card
This article explains how my Speech-to-Text module processes Vietnamese audio on a Flask AI server, from file input and waveform normalization to transcription output. It also outlines the current model choice, API behavior, and practical runtime limits.
Article body

My Speech-to-Text module runs on the same Python Flask AI server used for the other media utilities in this workflow. The main API entry point is GET/POST /wav2vec2, served through Flask on port 8789. In the current version, the endpoint reads an audio path from form data or query parameters, and if no path is provided, it falls back to a default local file. The response returns JSON with status, transcript, and the processed file path.

The transcription engine is based on the Hugging Face model khanhld/wav2vec2-base-vietnamese-160h. Both the processor and model are loaded once when the module starts, rather than reloading on every request. The runtime device is selected automatically, using GPU when CUDA is available and CPU otherwise. This keeps repeated requests more stable, although it also means the service keeps a memory footprint while running.

For audio processing, the file is loaded with librosa, converted to mono, and resampled to 16 kHz, which matches the model input. A normalized intermediate WAV file can also be written to D:\hustmedia\python\tts\wav2vec2\run.wav for inspection or reuse. Before inference, the waveform is converted to float32, checked to avoid empty input, and normalized by peak amplitude.

Once prepared, the audio is tokenized with sampling_rate=16000 and passed through the model under torch.no_grad(). The output logits are decoded with greedy CTC argmax, then converted into text with batch_decode. In its current form, this module does not use beam search, VAD chunking, or language-model rescoring, so long files are still processed in one pass and may increase latency or memory usage.

This module is mainly intended for practical Vietnamese transcription tasks such as voice notes, support logs, internal updates, and simple content preparation. It is not designed as a full enterprise ASR platform, but as a working in-house component that I built and maintain for my own workflow.

Technical configuration snapshot
  • Server runtime: Flask on port 8789
  • Main endpoint: GET/POST /wav2vec2
  • STT model: khanhld/wav2vec2-base-vietnamese-160h
  • Device selection: automatic GPU / CPU
  • Input normalization: mono audio, 16 kHz
  • Intermediate WAV path: D:\hustmedia\python\tts\wav2vec2\run.wav
  • Decode method: greedy CTC argmax
  • Inference mode: torch.no_grad()
  • Current limitation: no chunking, no VAD, no beam search

Audio Transcription

Upload audio to convert it into text

Supported formats: MP3, WAV, M4A, OGG

No file selected

Practical Input/Output Examples

Input: Short voice note. Output: Draft text for reporting.
Input: Customer call recording. Output: Searchable transcript for support history.