AI Partners Ecosystem - Voxist - Models concept (EN)
Objective
OVHcloud offers different Artificial Intelligence services through its AI Partners Ecosystem. You will benefit from a catalogue of ready-to-use applications provided by our partners which you will easily be able to deploy according to your needs through AI Deploy.
Voxist is an OVHcloud partner that offers AI services dedicated to Speech and Natural Language Processing. This guide will provide a detailed understanding of how Voxist services work.
To find out more about Voxist billing, launching and capabilities, please refer to this guide.
Introduction
Voxist is a French start-up specialized in Speech Recognition. The platform enables all organizations, from start-ups to large corporations, to perform automatic speech recognition.
Speech recognition is supported through two APIs:
- REST API for asynchronous speech recognition, with support for diarization
- WebSocket API for synchronous speech recognition
Supported languages: French, English, German, Spanish, Portuguese, Italian, Polish
Coming soon: Dutch and Hebrew
Voxist REST API
Speech to text API
Transcription
This endpoint transcribes an audio recording into a JSON object. Configuration parameters can be passed to a specify language, activate punctuation recognition and diarization.
-
URL:
/transcribe -
Method:
POST -
Payload:
Upload configuration parameters and audio files using multipart/form-data Content-Type.
Configuration parameters (config form-data parameter)
- punctuation:
[true, false]: enables punctuation in output (default: false) - lang:
["fr", "en", "es", "pt", "it", "de", "pl"]: defines input audio language (default: "en") - sample_rate:
[8000, 16000]: default is 8Khz. - wait:
[true, false]: default istrue. The request will wait for the transcription to be completed before returning. If set tofalse, the request will return immediately with a job id and an estimate of the completion time.
Configuration example:
Success Response
-
Code:
200 OK -
Response content:
Lexical: Segment transcriptionConfidence: Transcription confidenceStart_time: Segment offset in secondsDuration: Segment duration in secondsSpeaker: if diarization is active, identifies speaker with a unique identifierSPEAKER_00,SPEAKER_01.
Preparing audio files for transcription
To be successfully transcribed, audio files should be mono (single channel) WAV files. You can use FFmpeg to convert your input audio to the appropriate format with the following steps:
-
Installation: First, ensure you have FFmpeg installed on your machine. If not, download and install it from the official website or use your package manager if you're on Linux.
-
Command: Once FFmpeg is installed, you can use the following command to convert an audio file:
bash
ffmpeg -i input_audio.ext -ac 1 -ar 16000 output_audio.wav
Let's break down this command:
-i input_audio.ext: Specifies the input audio file namedinput_audio.extwhereextis the file extension (e.g., mp3, aac, flac).-ac 1: Sets the output audio channels to 1 (mono).-ar 16000: Sets the output audio sample rate to 16kHz.-
output_audio.wav: The name of the output WAV file. -
Execute: Replace
input_audio.extwith the name and extension of your source audio file and execute the command. FFmpeg will then process the file and output the converted WAV file asoutput_audio.wav.
Remember to navigate (using the command line) to the directory containing your audio file or specify the full path in the command to make this work correctly.
Calling the API with curl
- Requirement:
Make sure you have curl installed on you system.
Optionally use jq to format JSON output
Set appropriate environment variables in your shell.
- Simple synchronous call:
It will return a JSON array with the following structure:
- Asynchronous with Diarization:
The API will return immediately with:
jobid: a UUID representing the job executionestimated_time: conservative estimated time for transcription in seconds. It is strongly recommended to query the job after this time.
Fetch transcription:
It will return a JSON array with the following structure:
Voxist WebSocket API
With the websocket API
- Payload:
The first websocket message sent it:
- punctuation:
[true, false]: enables punctuation in output (default: false) - lang:
["fr", "en", "es", "pt", "it", "de", "pl"]: defines input audio language (default: "en") - sample_rate:
[8000, 16000]: default is 8Khz
Calling the API in Python
- Requirement:
Install the requested packages:
- Code:
Please, don't forget to replace: 1. - ex: audios/audio-sample-en.wav 2. - ex: 98b38dff-1db6-4250-96d9-02512251f247.app.gra.ai.cloud.ovh.net
Run code with python3 asr-test.py
Transcription starts streaming on standard output.
{'Sentence_id': 1, 'Text': 'Our political reporter Jack Fink hosted a Facebook live discussion on Obamacare and he asked the panel if they agree with the president who says let Obamacare implode.', 'Confidence': 1}
{'Sentence_id': 2, 'Text': 'When do we stop being held hostage by the insurance companies?', 'Confidence': 1}
{'Sentence_id': 3, 'Text': "That's the question. That's the real question. When do we stop being held hostage by them? When we move from a for-profit system to a not-for-profit system in our healthcare.", 'Confidence': 1}
{'Sentence_id': 4, 'Text': 'delivery.', 'Confidence': 1}
{'Sentence_id': 5, 'Text': 'being a veteran.', 'Confidence': 1}
{'Sentence_id': 6, 'Text': 'The only example of a single payer in the United States of America is the VA hospital.', 'Confidence': 1}
{'Sentence_id': 7, 'Text': "Ain't way wrong answer.", 'Confidence': 1}
{"Sentence_id": 8, "Text": "It was a great discussion. If you missed it, you can still watch it. It's on our CBS DFW Facebook page.", "Confidence": 1, "Words": null}
Feedback
Please send us your questions, feedback and suggestions to improve the service:
- On the OVHcloud Discord server