audio transcription AI tools model comparison

Benchmarking AI Audio Transcription: How Leading Models Handle English and Hungarian Speech

by Miklós Sebők - Rebeka Kiss - Dávid László

Jan 14, 2026

5 min read

Benchmarking AI Audio Transcription: How Leading Models Handle English and Hungarian Speech — Photo by Gritte on Unsplash

As audio becomes an increasingly common input for AI assistants, it becomes natural to ask whether these platforms can also transcribe pre-recorded audio accurately. In this benchmark, we tested five leading generative AI models on identical English and Hungarian audio files to assess their transcription accuracy, language support, and practical usability. The results reveal Gemini as the clear winner, while other major platforms exhibited persistent limitations.

Input Audio Files

We prepared two short test recordings to evaluate transcription performance across languages:

English audio - A 20-second voice memo recorded on iPhone containing: "Hi, this is a quick transcription test. Today is January thirteenth, twenty twenty-six, and I'm recording from my home office. The file name is "test_take_03," and the time is 6:18 p.m. If you can hear this clearly, please write it exactly as spoken—commas, numbers, and all."

test_eng

test_eng.m4a

173 KB

Hungarian audio (test_hun.m4a) - A 20-second voice memo with the Hungarian equivalent: "Ez egy gyors lejegyzési teszt. Ma 2026. január tizenharmadika van, és az otthoni irodámból készítem a felvételt. A fájlnév „test_take_03", az idő pedig 18:18. Ha ezt tisztán hallható, kérlek, írd le pontosan úgy, ahogy elhangzott – vesszőkkel, számokkal és mindennel együtt."

test_hun

test_hun.m4a

178 KB

Both recordings were designed to test the models' ability to handle:

Date and time formatting
Numbers
Punctuation accuracy
File name references
Natural speech patterns

Models Tested

We evaluated five leading AI platforms and models:

OpenAI - GPT 5.2
Anthropic - Claude Opus 4.5
Google - Gemini 3 Fast
xAI - Grok 4.1
DeepSeek - V3.2

The same audio files were uploaded to each platform with a straightforward transcription request.

Prompt

For all models that accepted audio input, we used a simple, direct prompt:

Please transcribe this audio file exactly as spoken, including commas, numbers, and all punctuation.

This minimal approach allowed us to assess each model's baseline transcription capabilities without additional guidance or constraints.

Model Outputs

Grok and DeepSeek

Both Grok and DeepSeek failed at the first hurdle: neither platform currently supports audio file uploads. This creates a clear limitation for users seeking audio transcription capabilities within these interfaces.

GPT-5.2 (English)

GPT-5.2 processed the English audio file but produced a severely degraded transcript that bore only slight resemblance to the original:

"this is a creek transcription it is today's jury thursday the twenty twenty six and the recording from my home will fix that for any least that's a cool three hundred and sixty two pm [NOISE] if you can hit is clearly these guys get actually have spoken comments numbers and thought"

The output was riddled with errors:

"creek transcription" instead of "quick transcription"
"jury thursday" instead of "January thirteenth"
"three hundred and sixty two pm" instead of "6:18 p.m."
Multiple nonsensical phrases throughout

The model additionally spent 17 minutes and 41 seconds processing before delivering this result.

GPT-5.2 (Hungarian)

Similarly to the results of our earlier test focusing on GPT-4.5's audio transcription capabilities, rather than attempting transcription directly, GPT-5.2 declined to process the Hungarian version of the audio file, citing runtime limitations. Instead, it provided detailed instructions for using the OpenAI API or running Whisper CLI locally:

whisper test_hun.m4a --language hu --task transcribe --model medium --word_timestamps True --output_format txt

When we followed these instructions and ran Whisper locally, the output was significantly better than the English result:

[00:00.000 --> 00:08.040] Ez egy gyors lejegyzési teszt. Ma 2026. január 13-a van, és az otthani iradámból készítem a felvételt. > [00:08.640 --> 00:20.320] A felnév teszték 03, az idő pedig 18.18. Ha ezt tisztán hallható, kérlek ír le pontosan úgy, ahogy elhangzott. Vesszökkel, számokkal és mindennel együtt.

While closer to the original, several errors remained:

"otthani iradámból" instead of "otthoni irodámból"
"felnév teszték 03" instead of "fájlnév test_take_03"

This workaround approach required technical knowledge and CLI access—which might not be practical for most users seeking quick transcriptions.

Claude Opus 4.5

Claude Opus 4.5 declined to transcribe either audio file, stating that network access was disabled in its environment. The model explained:

"I apologize, but I'm unable to transcribe this audio file. The network access in this environment is disabled, which prevents me from installing speech recognition libraries (like Whisper or Vosk) needed to perform automatic speech recognition on the audio."

Instead, Claude offered suggestions for alternative transcription services (Otter.ai, Rev, Google Speech-to-Text) and pointed users toward built-in iOS transcription features. While helpful as a guide, this response provided no actual transcription functionality.

Gemini 3

Gemini 3 delivered the strongest performance across both languages, producing near-perfect transcripts directly in the chat interface with no additional tools required.

English transcription: "This is a quick transcription test. Today is January 13th, 2026 and I'm recording from my home office. The file name is test take 03 and the time is 6:18 PM. If you can hear this clearly, please write it exactly as spoken, commas, numbers, and all."

The output was accurate and well-formatted.

Hungarian transcription: "Ez egy gyors lejegyzési teszt. Ma 2026. január 13-a van, és az otthoni irodámból készítem a felvételt. A fájlnév teszt_03, az idő pedig 18:18. Ha ez tisztán hallható, kérlek írd le pontosan úgy, ahogy elhangzott, vesszőkkel, számokkal és mindennel együtt."

The Hungarian output was equally impressive:

Correct date format
Proper punctuation and diacritics
Accurate time representation
Near-verbatim transcription of natural speech

Gemini's Hungarian transcription outperformed even the locally-run Whisper CLI output, while requiring no technical setup whatsoever.

Performance Comparison

Model	English Audio	Hungarian Audio
Gemini 3	✅ Near-perfect	✅ Near-perfect
GPT-5.2	❌ Severely inaccurate	⚠️ CLI workaround only
Claude Opus 4.5	❌ No transcription	❌ No transcription
Grok	❌ No audio support	❌ No audio support
DeepSeek	❌ No audio support	❌ No audio support

Recommendations

This benchmark reveals a striking divide in audio transcription capabilities across leading AI platforms. Gemini 3 emerged as the only model capable of delivering accurate, usable transcriptions directly through its chat interface—handling both English and Hungarian with minimal errors and no technical setup required. For researchers and users seeking reliable transcription, particularly for multilingual projects, Gemini's combination of accuracy and ease of use makes it the clear choice.

The remaining models fell short in various ways: two lacked audio support entirely, one required complex CLI workarounds that still produced inferior results, and another couldn't process audio due to network restrictions. These limitations underscore that while AI transcription technology has advanced significantly, implementation across platforms remains inconsistent.

Gemini's pattern of strong multilingual support mirrors what we observed in our comparison of GenAI models for OCR on scanned PDFs, where the 2.0 version of this model also demonstrated robust performance alongside specialized tools like Mistral's OCR API.

---

The authors used GPT-5.2 and Gemini 3 to generate and test transcription outputs. Testing was conducted on January 13, 2026.