Skip to main content

Speech-to-Text Service

Convert spoken audio into accurate transcripts using flexible endpoints. Use /v1/audio/transcribe for streaming SSE results, /v1/audio/transcribe_ws for real-time WebSocket streaming, /v1/audio/transcribe_file + GET /v1/audio/transcribe_status/{request_id} for web-only batch jobs up to 10 minutes, /v1/audio/transcribe/diarize + GET /v1/audio/transcribe/diarize/status/{job_id} for speaker-labeled transcription up to 30 minutes, and GET /v1/audio/transcribe/config to inspect supported formats and defaults before uploading.

Need an API Key? If you don't have an API key yet, you can create one here: https://playground.induslabs.io/register

Available languages
  • Englishen
  • Spanishes
  • Frenchfr
  • Germande
  • Italianit
  • Portuguesept
  • Russianru
  • Japaneseja
  • Koreanko
  • Chinesezh
  • Arabicar
  • Hindihi
  • Turkishtr
  • Polishpl
  • Dutchnl
  • Swedishsv
  • Danishda
  • Norwegianno
  • Finnishfi
  • Czechcs
  • Slovaksk
  • Hungarianhu
  • Romanianro
  • Bulgarianbg
  • Croatianhr
  • Sloveniansl
  • Estonianet
  • Latvianlv
  • Lithuanianlt
  • Maltesemt
  • Irishga
  • Welshcy
  • Icelandicis
  • Macedonianmk
  • Albaniansq
  • Azerbaijaniaz
  • Kazakhkk
  • Kyrgyzky
  • Uzbekuz
  • Tajiktg
  • Amharicam
  • Burmesemy
  • Khmerkm
  • Laolo
  • Sinhalasi
  • Nepaline
  • Bengalibn
  • Assameseas
  • Odiaor
  • Punjabipa
  • Gujaratigu
  • Tamilta
  • Telugute
  • Kannadakn
  • Malayalamml
  • Thaith
  • Vietnamesevi
  • Indonesianid
  • Malayms
  • Filipino/Tagalogtl
Where to get your API key

Screenshot: where to find your API key. Create one at playground.induslabs.io/register

POST/v1/audio/transcribe

Transcribe Audio (Streaming)

This endpoint is used to transcribe audio files into text with streaming results via Server-Sent Events (SSE).

Functionality
  • Accepts an audio file and returns real-time transcription results.
  • Outputs partial, chunk-level, and final transcripts as the audio is processed.
  • Suitable for low-latency transcription where results are streamed back continuously.

Form Fields

NameTypeDefaultDescription
filefilerequiredAudio file to transcribe.
api_keystringrequiredAuthentication API key.
languagestring-Language code (e.g., "en", "hi") for forced detection.
chunk_length_snumber-Length of each chunk in seconds (1–30).
stride_snumber-Stride between chunks in seconds (1–29).
overlap_wordsinteger-Number of overlapping words for context handling (0–20).

Outputs

StatusTypeDescription
200 OKtext/event-streamReturns transcription results in JSON (streamed via SSE).
422 Validation Errorapplication/jsonValidation failure. Inspect detail array.

200 OK (SSE stream)

data: {"type": "partial", "word": "यह", "provisional": true, "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "partial", "word": "एक", "provisional": true, "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "partial", "word": "टेस्ट", "provisional": true, "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "chunk_final", "text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।", "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "final", "text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।", "audio_duration_seconds": 3.413375, "processing_time_seconds": 1.44447922706604, "first_token_time_seconds": 0.136627197265625, "language_detected": "hi", "request_id": "df3a5974-6b24-4b15-a9d9-7c9df9513306"}

200 OK (Config)

{
"model_id": "openai/whisper-large-v3",
"supported_formats": ["wav", "mp3", "mp4", "m4a", "flac", "ogg"],
"max_file_size_mb": 25,
"hindi_model": {
"enabled": true,
"model_id": null
},
"defaults": {
"chunk_length_s": 6.0,
"stride_s": 5.9,
"overlap_words": 7
},
"limits": {
"chunk_length_range": [1.0, 30.0],
"stride_range": [1.0, 29.0],
"overlap_words_range": [0, 20],
"timeout_seconds": 30
},
"supported_languages": [
"en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", "zh",
"ar", "hi", "tr", "pl", "nl", "sv", "da", "no", "fi", "cs",
"sk", "hu", "ro", "bg", "hr", "sl", "et", "lv", "lt", "mt",
"ga", "cy", "is", "mk", "sq", "az", "kk", "ky", "uz", "tg",
"am", "my", "km", "lo", "si", "ne", "bn", "as", "or", "pa",
"gu", "ta", "te", "kn", "ml", "th", "vi", "id", "ms", "tl"
],
"output_formats": {
"streaming": "Server-Sent Events (SSE) with real-time partial results",
"file": "Complete JSON response with final transcript"
},
"credit_system": {
"unit": "1 credit = 1 minute of audio",
"billing": "Based on actual audio duration, not processing time"
}
}
WS/v1/audio/transcribe_ws

WebSocket Streaming Transcription

Real-time STT via WebSocket. Supports bidirectional streaming for live audio input.

Key Features
  • Persistent Connection: Maintains open WebSocket for continuous audio streaming.
  • Real-time Results: Receives transcription segments as audio is processed.
  • Low Latency: Optimized for live microphone input and voice applications.
  • Segment Callbacks: Provides word-level and segment-level results via callbacks.
  • Bidirectional: Sends audio chunks and receives transcriptions simultaneously.
  • Noise Cancellation: Optional server-side denoising before inference.
WebSocket Connection

Connect to: wss://voice.induslabs.io/v1/audio/transcribe_ws

Unlike REST endpoints, WebSocket maintains a persistent bidirectional connection for real-time streaming.

Available Models
  • indus-stt-v1: Default model that supports all languages.
  • indus-stt-hi-en: Specialized model for Hindi and English with real-time streaming input/output and very low processing time.

WebSocket Message Flow

MessageTypeOrderDescription
URL Query ParamsURLconnectionPass api_key, model, language, streaming as URL query parameters when connecting.
Audio ChunksbinarycontinuousRaw audio data sent as binary WebSocket frames (recommended: 4096 bytes per chunk).
End SignalbinarylastSend b"__END__" to signal audio stream completion.

Configuration Parameters

NameTypeDefaultDescription
api_keystringrequiredAuthentication API key passed in URL query string.
modelstringindus-stt-hi-enModel to use (e.g., "indus-stt-hi-en").
languagestring-Language name or ISO code (e.g., "english", "hindi", "en", "hi").
streamingstring"true"Use "true" for streaming mode (interim results), "false" for non-streaming.
noise_cancellationstring"false"Use "true" to enable noise cancellation for cleaner audio in noisy environments. Filters low-frequency rumble, high-frequency hiss, and ambient background noise to reduce hallucinations and improve accuracy.

Outputs

StatusTypeDescription
chunk_interimJSONInterim transcription result during processing (when streaming="true").
chunk_finalJSONFinal transcription for a processed audio chunk.
finalJSONComplete transcription with full text after all chunks processed.
errorJSONError message if processing fails.

Chunk Interim Response

{
"type": "chunk_interim",
"text": "यह एक टेस्ट है"
}

Chunk Final Response

{
"type": "chunk_final",
"text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।",
"chunk_index": 1,
"total_chunks": 1
}

Final Response

{
"type": "final",
"text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।",
"audio_duration_seconds": 3.413375,
"language_detected": "hi",
"request_id": "df3a5974-6b24-4b15-a9d9-7c9df9513306"
}

Error Response

{
"type": "error",
"message": "Invalid API key",
"code": "AUTH_ERROR"
}
POST/v1/audio/transcribe_file

Transcribe Audio File (Batch Async)

Launches background transcription for files up to 10 minutes and immediately returns a request_id to poll later.

Functionality
  • Designed for long recordings up to 10 minutes (600 seconds).
  • Returns immediately so your UI can poll the status endpoint or notify the user.
  • Available via REST on the web — SDK helpers are not yet available.
  • Supports optional noise cancellation before inference begins.

Form Fields

NameTypeDefaultDescription
filefilerequiredAudio file up to 10 minutes (600 seconds).
api_keystringrequiredAuthentication API key.
modelstring"default"Use "default", "indus-stt-v1", "hi-en", or "indus-stt-hi-en".
languagestring-Language hint (ISO code or name).
noise_cancellationbooleanfalseEnable server-side denoising before inference.

Outputs

StatusTypeDescription
202 Acceptedapplication/jsonReturns request_id, duration, estimated_time, and status_url for polling.
400 Bad Requestapplication/jsonAudio rejected (e.g., longer than 10 minutes or invalid format).
401 / 402application/jsonAuthentication failure or insufficient credits.

202 Accepted (Batch Upload)

{
"request_id": "13c8b15a-59f9-4cda-a3bb-3bf06f5e2c9b",
"status": "processing",
"message": "File uploaded successfully. Processing in background.",
"duration": 126.42,
"estimated_time": 18.96,
"status_url": "/v1/audio/transcribe_status/13c8b15a-59f9-4cda-a3bb-3bf06f5e2c9b",
"poll_interval": 5
}
GET/v1/audio/transcribe_status/{request_id}

Get Batch Transcription Status

Polls the progress of a batch job created by /v1/audio/transcribe_file and returns the final transcript when completed.

Functionality
  • Call every poll_interval seconds until the job reports completed or failed.
  • Completed responses include full text, segments, word-level timestamps (if available), and processing metrics.
  • Failed jobs return an error string so that you can surface actionable feedback to users.

Form Fields

NameTypeDefaultDescription
request_idpathrequiredIdentifier returned from /v1/audio/transcribe_file.
api_keystring (query)requiredSame key used when creating the job.

Outputs

StatusTypeDescription
200 OKapplication/jsonCurrent status plus transcript, segments, and metrics when completed.
404 Not Foundapplication/jsonUnknown or expired request_id.
500 Internal Server Errorapplication/jsonJob failed; see error field for details.

200 OK (Batch Status)

{
"request_id": "13c8b15a-59f9-4cda-a3bb-3bf06f5e2c9b",
"status": "completed",
"progress_percentage": 100,
"progress": "3/3",
"text": "Final transcription text...",
"segments": [
{"text": "segment text", "start": 0, "end": 12.5}
],
"processing_time": 98.11,
"model": "indus-stt-v1"
}
POST/v1/audio/transcribe/diarize

Diarize + Transcribe (Async)

Uploads audio for speaker diarization and per-speaker transcription, returning a job_id for polling.

Functionality
  • Runs speaker diarization before transcription to label speakers.
  • Supports files up to 30 minutes; processing occurs in the background.
  • Use config_json to tune language, noise_cancellation, padding, and concurrency parameters.
  • Returns an estimated_time and status_url so clients can poll without blocking.

Form Fields

NameTypeDefaultDescription
filefilerequiredAudio file to diarize (max 30 minutes).
api_keystringrequiredAuthentication API key.
config_jsonstring"{}"Optional JSON string to tune language, models, padding, and concurrency.

Outputs

StatusTypeDescription
200 OKapplication/jsonReturns job_id, estimated_time, and status_url for polling diarization results.
400 Bad Requestapplication/jsonInvalid audio, malformed config_json, or duration beyond 30 minutes.
401 / 402application/jsonAuthentication failure or insufficient credits.
503 Service Unavailableapplication/jsonCredit service unavailable.

200 OK (Diarization Upload)

{
"job_id": "7a5f2f0e4c8d4b2ab6d2c5d0fdc4b5e0",
"status": "processing",
"message": "File uploaded successfully. Processing in background.",
"duration": 312.44,
"estimated_time": 93.73,
"status_url": "/v1/audio/transcribe/diarize/status/7a5f2f0e4c8d4b2ab6d2c5d0fdc4b5e0",
"poll_interval": 5
}
GET/v1/audio/transcribe/diarize/status/{job_id}

Get Diarization Status

Polls the progress of a diarization + transcription job created by /v1/audio/transcribe/diarize.

Functionality
  • Poll every poll_interval seconds until status is completed or failed.
  • Completed responses return per-speaker utterances with start/end timestamps and transcript text.
  • Failed jobs include an error string so you can surface actionable feedback.

Form Fields

NameTypeDefaultDescription
job_idpathrequiredIdentifier returned from /v1/audio/transcribe/diarize.
api_keystring (query)requiredSame key used when creating the job.

Outputs

StatusTypeDescription
200 OKapplication/jsonCurrent diarization status plus speaker-labeled results when completed.
404 Not Foundapplication/jsonUnknown or expired job_id.
401 Unauthorizedapplication/jsonInvalid API key.
503 Service Unavailableapplication/jsonCredit service unavailable.

200 OK (Diarization Status)

{
"job_id": "7a5f2f0e4c8d4b2ab6d2c5d0fdc4b5e0",
"status": "completed",
"results": [
{
"utterance_index": 1,
"start_sec": 0.0,
"end_sec": 14.2,
"duration_sec": 14.2,
"speaker": "speaker_0",
"endpoint": "transcribe_file",
"text": "Welcome everyone, let's review the agenda.",
"status": "completed"
},
{
"utterance_index": 2,
"start_sec": 14.2,
"end_sec": 29.8,
"duration_sec": 15.6,
"speaker": "speaker_1",
"endpoint": "transcribe_ws",
"text": "I will cover the launch metrics next.",
"status": "completed"
}
],
"processing_time": 108.51,
"model": "diarization"
}
GET/v1/audio/transcribe/config

Retrieve Transcription Configuration

Fetch default parameters, limits, and supported formats before sending audio for transcription.

Functionality
  • Provides current defaults for chunk sizing, stride, and overlap handling.
  • Lists accepted media formats and the maximum upload size in megabytes.
  • Use these values to validate client-side settings and avoid failed uploads.

Outputs

StatusTypeDescription
200 OKapplication/jsonReturns defaults, limits, and supported languages/formats for the STT service.
422 Validation Errorapplication/jsonValidation failure. Inspect detail array.

422 Validation Error

{
"detail": [
{
"loc": ["string", 0],
"msg": "string",
"type": "string"
}
]
}

200 OK (Config)

{
"model_id": "openai/whisper-large-v3",
"supported_formats": ["wav", "mp3", "mp4", "m4a", "flac", "ogg"],
"max_file_size_mb": 25,
"hindi_model": {
"enabled": true,
"model_id": null
},
"defaults": {
"chunk_length_s": 6.0,
"stride_s": 5.9,
"overlap_words": 7
},
"limits": {
"chunk_length_range": [1.0, 30.0],
"stride_range": [1.0, 29.0],
"overlap_words_range": [0, 20],
"timeout_seconds": 30
},
"supported_languages": [
"en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", "zh",
"ar", "hi", "tr", "pl", "nl", "sv", "da", "no", "fi", "cs",
"sk", "hu", "ro", "bg", "hr", "sl", "et", "lv", "lt", "mt",
"ga", "cy", "is", "mk", "sq", "az", "kk", "ky", "uz", "tg",
"am", "my", "km", "lo", "si", "ne", "bn", "as", "or", "pa",
"gu", "ta", "te", "kn", "ml", "th", "vi", "id", "ms", "tl"
],
"output_formats": {
"streaming": "Server-Sent Events (SSE) with real-time partial results",
"file": "Complete JSON response with final transcript"
},
"credit_system": {
"unit": "1 credit = 1 minute of audio",
"billing": "Based on actual audio duration, not processing time"
}
}