AI Endpoints - Transcription Audio
AI Endpoints is covered by the OVHcloud AI Endpoints Conditions and the OVHcloud Public Cloud Special Conditions.
Introduction
AI Endpoints is a serverless platform provided by OVHcloud that offers easy access to a selection of world-renowned, pre-trained AI models. The platform is designed to be simple, secure, and intuitive, making it an ideal solution for developers who want to enhance their applications with AI capabilities without extensive AI expertise or concerns about data privacy.
Speech to Text is a powerful feature that enables the conversion of spoken language into written text.
The Speech to Text APIs on AI Endpoints allow you to easily integrate this technology into your applications, enabling you to transcribe audio files with high accuracy. Our endpoints support various audio formats and provide flexible configuration options to suit your specific use cases.
Objective
This documentation provides an overview of the Speech to Text endpoints offered on AI Endpoints.
Visit our Catalog to find out which models are compatible with Audio Analysis.
The examples provided during this guide can be used with one of the following environments:
These examples will be using the Whisper-large-v3 model.
Authentication & Rate Limiting
All the examples provided in this guide use anonymous authentication, which makes it simpler to use but may cause rate limiting issues. If you wish to enable authentication using your own token, simply specify your API key within the requests.
Follow the instructions in the AI Endpoints - Getting Started guide for more information on authentication.
Request Body
Parameters Overview
The request body for the audio transcription endpoint is of type multipart/form-data and includes the following fields:
| Parameter | Required | Type | Allowed Values / Format | Default | Description |
|---|---|---|---|---|---|
| file | Yes | binary | mp3, mp4, aac, m4a, wav, flac, ogg, opus, webm, mpeg, mpga | - | The audio file object (not file name) to transcribe. |
| chunking_strategy | No | string/server_vad object/null | - | null | Strategy for dividing the audio into chunks. More details here. |
| diarize | No | boolean/null | true/false | false | Enables speaker separation in the transcript. When set to true, the system separates the audio into segments based on speakers, by adding labels like "Speaker 1" and "Speaker 2", so you can see who said what in conversations such as interviews, meetings, or phone calls. More details here. |
| language | No | string/null | ISO-639-1 format | - | The language parameter specifies the language spoken in the input audio. Providing it can improve transcription accuracy and reduce latency (e.g. en for English, fr for French, de for German, es for Spanish, zh for Chinese, ar for Arabic ...). If not provided, the system will attempt automatic language detection, which may be slightly slower and less accurate in some cases. More details on language compatibility and performance. |
| model | No | string/null | ID of the model to use | - | Specifies the model to use for transcription. Useful when using our unified endpoint. |
| prompt | No | string/null | - | - | Text to guide the model's style, translate transcript to english or continue a previous audio segment. The language in which you write the prompt must match the audio's one. More details about prompt usage here. |
| response_format | No | enum/null | json, text, srt, verbose_json, vtt | verbose_json | Determines how the transcription data is returned. For detailed examples of each output type, visit the Response Formats section. |
| stream | No | boolean/null | true/false | false | If set to true, the model response data will be streamed to the client. Currently not supported for Whisper models. |
| temperature | No | number/null | From 0.0 to 1.0 | 0 | Controls randomness in the output. Higher values make the output more random, while lower values make it more focused and deterministic. |
| timestamp_granularities | No | array/null | ["segment"], ["word"], ["word", "segment"] | ["segment"] | Controls the level of detail in the timestamps provided in the transcription. More details here. |
Example Usage
Now that you know which parameters are available, let’s look at how to put them into practice. Below are sample requests in Python, cURL and JavaScript:
Warning: The diarize parameter is not supported when using the OpenAI client library.
To use diarization, you must make a direct HTTP request using requests or cURL with diarize set to true.
To authenticate with your API key, add an Authorization header:
Output example
By default, the transcription endpoint returns output in verbose_json format.
This includes detailed metadata such as language, segments, tokens, and diarization information:
For detailed examples of each available output type, see the Response Formats section section.
Parameters Details
While the previous overview gives a quick reference, certain parameters require more context to understand how and when to use them.
Diarization
The diarize parameter enables speaker separation in the generated transcript. When set to true, the system labels different voices as Speaker 0, Speaker 1, etc.
This is useful for meetings, debates, or interviews where multiple people are speaking.
- This parameter is only available with the default
verbose_jsonresponse format. Using any other will raise an error. diarizeis not supported when using the OpenAI client libraries. You must use a direct HTTP request withrequests,cURL, or another HTTP client.
Output Example: Transcribing an audio file with diarize enabled:
Request:
Output:
Prompt
The prompt parameter lets you provide extra context to improve transcription. Think of it as giving the model a hint before it starts listening to your audio. This can help when:
- Correcting words or acronyms that are often misrecognized.
- Preserving context if the audio is split into several parts.
- Enforcing punctuation, filler words, or writing style.
- Translate generated speech to English.
The prompt must be written in the same language as the audio. For example, if your audio is in English, your prompt must also be in English.
Examples
If the audio mentions complicated words such as products, companies, technical terms, or people but the model often mispells them, you could list them in your prompt:
To directly translate the transcription into English instead of keeping it in the source language, you can pass the special translation token <|translate|> in your prompt:
When processing split audio files, provide the transcript from the previous segment to maintain context and improve accuracy:
If the model skips punctuation in the transcript, use a properly punctuated prompt to encourage correct formatting:
If the model omits filler words when transcribing audio, like "ums" or "like", include them in your prompt:
For languages with multiple writing systems (like simplified vs. traditional Chinese), or to maintain consistent style:
Simplified Chinese:
Traditional Chinese:
Timestamp Granularities
The timestamp_granularities parameter controls the level of time markers included in the transcript. You have three possibilities there:
Timestamps for each segment, providing timing for larger sections of the audio.
Timestamps for each word, providing precise timing for every spoken word.
Generating ["word"] timestamps incurs additional latency.
You can also get both:
Generating ["word"] timestamps can incur additional latency.
Response Formats
The response_format determines how the transcription data is returned. Available formats include:
Returns the full transcription with metadata such as segments, tokens, language, duration, and diarization:
Returns only the basic transcription data, such as transcribed text and usage information:
Returns only the transcribed text as a plain string.
Not yet supported.
Not yet supported.
Chunking Strategy
The chunking_strategy parameter controls how the audio file is divided into smaller segments before transcription.
By default, when unset, the audio is processed as a single block.
When set to auto, the system first normalizes audio loudness and then uses voice activity detection (VAD) to automatically split the audio at natural pauses (silence).
You can also provide a server_vad object to manually tweak VAD detection parameters. This lets you control the following parameters:
prefix_padding_ms: Amount of audio to include before the VAD detected speech (in milliseconds).silence_duration_ms: Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.threshold: Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.
Example:
Endpoint Limitations
Language Compatibility and Performances
Whisper models are compatible with a wide range of languages, supporting approximately 100 in total.
However, transcription quality and speed depend on the language of the input audio. While Whisper v3 models are multilingual, their accuracy varies significantly by language:
- Common languages such as English, French, Spanish, and German generally produce the best results.
- Less common or low-resource languages may yield lower accuracy or longer processing times.
- Regional accents, dialects, or code-switching (switching between multiple languages in the same recording) can reduce accuracy further.
Providing the language parameter explicitly (instead of relying on automatic detection) generally improves both accuracy and latency. Expected format is ISO-639-1 format (e.g. en for English, fr for French, de for German, es for Spanish, zh for Chinese, ar for Arabic ...).
For a detailed performance breakdown by language, see Whisper’s benchmark results. This includes word error rates (WER) and character error rates (CER) across different datasets.
Prompt Length
For Whisper-based models, the prompt parameter only considers the last 224 tokens (approximately the final 200 characters). If your prompt is longer, tokens preceding the last 224 will be ignored.
Parameters Support
-
Streaming is not yet supported for audio transcription endpoints. All audio must be uploaded and processed in a single request.
-
The
srtandvttresponse formats are not yet supported. Available response formats can be found here.
Supported Audio Formats, Durations and Sizes
Audio Formats
The API supports multiple audio formats as mentioned before. Ensure your file is in a supported format to be successfully transcribed.
File Size and Duration Limits:
- Authenticated requests (using an API key): Up to
2048 MBor10 800 secondsof audio per request. - Anonymous requests: Up to
10 MBor60 secondsof audio per request.
Transcribing larger audio files
If your audio file exceeds these limits, you can split it into smaller chunks before sending it to the transcription endpoint.
Try to avoid splitting mid-sentence, as this can cause context to be lost and reduce transcription accuracy. Using compressed audio formats can also help reduce file size.
Example
Splitting Audio with open-source Python pydub library:
Repeat this process to create multiple chunks, then transcribe each chunk individually.
OVHcloud makes no guarantees about the usability or security of third-party softwares like pydub.
Conclusion
In this guide, we have explained how to use Speech to Text models available on AI Endpoints models. We have provided a comprehensive overview of the feature which can help you perfect your integration of model for your own application.
Go Further
Browse the full AI Endpoints documentation to further understand the main concepts and get started.
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.
Feedback
Please send us your questions, feedback, and suggestions to improve the service:
- On the OVHcloud Discord server.