Skip to main content

Speech-to-Text Service

Convert spoken audio into accurate transcripts using flexible endpoints. Use /v1/audio/transcribe for streaming SSE results, /v1/audio/transcribe_ws for real-time WebSocket streaming, /v1/audio/transcribe_file + GET /v1/audio/transcribe_status/{request_id} for web-only batch jobs up to 10 minutes, /v1/audio/transcribe/diarize + GET /v1/audio/transcribe/diarize/status/{job_id} for speaker-labeled transcription up to 30 minutes, and GET /v1/audio/transcribe/config to inspect supported formats and defaults before uploading.

Need an API Key? If you don't have an API key yet, you can create one here: https://playground.induslabs.io/register

Available languages
  • Englishen
  • Spanishes
  • Frenchfr
  • Germande
  • Italianit
  • Portuguesept
  • Russianru
  • Japaneseja
  • Koreanko
  • Chinesezh
  • Arabicar
  • Hindihi
  • Turkishtr
  • Polishpl
  • Dutchnl
  • Swedishsv
  • Danishda
  • Norwegianno
  • Finnishfi
  • Czechcs
  • Slovaksk
  • Hungarianhu
  • Romanianro
  • Bulgarianbg
  • Croatianhr
  • Sloveniansl
  • Estonianet
  • Latvianlv
  • Lithuanianlt
  • Maltesemt
  • Irishga
  • Welshcy
  • Icelandicis
  • Macedonianmk
  • Albaniansq
  • Azerbaijaniaz
  • Kazakhkk
  • Kyrgyzky
  • Uzbekuz
  • Tajiktg
  • Amharicam
  • Burmesemy
  • Khmerkm
  • Laolo
  • Sinhalasi
  • Nepaline
  • Bengalibn
  • Assameseas
  • Odiaor
  • Punjabipa
  • Gujaratigu
  • Tamilta
  • Telugute
  • Kannadakn
  • Malayalamml
  • Thaith
  • Vietnamesevi
  • Indonesianid
  • Malayms
  • Filipino/Tagalogtl
Where to get your API key

Screenshot: where to find your API key. Create one at playground.induslabs.io/register

Swarmitra (Emotional Models)

Unlock deeper insights with swarmitra-flash and swarmitra-v2. These specialized models go beyond transcription to detect emotional nuance in Hindi and English speech.

Supported Emotions

angrysadhappyfearwhisperlaughgigglechuckleneutraluhmsurpriseshoutsigh

Sample 1: English (Ghost Story)

I think I saw a ghost in the hallway last night. Whisper It was a dark shadow that just moved across the wall. Laugh Okay, it was probably just the cat, but I didn't sleep for an hour.
model: swarmitra-v2language: English

Sample 2: English (Conversation)

Whisper Hi Rohit, how are you? I heard you are getting out of India right now. Laugh That was just a joke, Rohit. Angry What are you looking at like this? Is it? Are you sure? No, don't worry.
model: swarmitra-v2language: English

Sample 3: Hindi (Work Pressure)

Angry मैंने क्लियरली कहा था कि ये टास्क आज ही कंप्लीट होना चाहिए, Shout लेकिन किसी ने सीरियसली नहीं लिया। Uhm एंड, अब डेडलाइन पास है, इसलिए सब पैनिक कर रहे हैं और प्रेशर बहुत बढ़ गया है।
model: swarmitra-v2language: Hindi
Integration

Use these models seamlessly with existing endpoints:

  • transcribe_ws (Real-time streaming)
  • transcribe_file (Batch processing)

Real-time Usage

import asyncio
import websockets
import json

async def transcribe_ws():
# Pass API key and parameters in the URL query string
params = {
"api_key": "YOUR_API_KEY",
"model": "swarmitra-v2",
"language": "hindi",
"streaming": "false",
"noise_cancellation": "false"
}
query_string = "&".join([f"{k}={v}" for k, v in params.items()])
uri = f"wss://voice.induslabs.io/v1/audio/transcribe_ws?{query_string}"

async with websockets.connect(uri) as ws:
# Read and send audio in chunks
with open("emo_hi.wav", "rb") as f:
while chunk := f.read(4096):
await ws.send(chunk)

# Signal end of audio with __END__ marker
await ws.send(b"__END__")

# Receive transcription results
async for message in ws:
data = json.loads(message)
msg_type = data.get("type")

if msg_type == "chunk_interim":
print(f"[interim] {data.get('text', '')}")
elif msg_type == "chunk_final":
print(f"[chunk] {data.get('text', '')}")
elif msg_type == "final":
print(f"\n[final] {data.get('text', '')}")
break

asyncio.run(transcribe_ws())

Note: The WebSocket endpoint supports audio files up to 30 seconds. For longer audio files, use transcribe_file for batch processing (available in the dropdown on the right).

POST/v1/audio/transcribe

Transcribe Audio (Streaming)

This endpoint is used to transcribe audio files into text with streaming results via Server-Sent Events (SSE).

Functionality
  • Accepts an audio file and returns real-time transcription results.
  • Outputs partial, chunk-level, and final transcripts as the audio is processed.
  • Suitable for low-latency transcription where results are streamed back continuously.

Form Fields

NameTypeDefaultDescription
filefilerequiredAudio file to transcribe.
api_keystringrequiredAuthentication API key.
languagestring-Language code (e.g., "en", "hi") for forced detection.
chunk_length_snumber-Length of each chunk in seconds (1–30).
stride_snumber-Stride between chunks in seconds (1–29).
overlap_wordsinteger-Number of overlapping words for context handling (0–20).

Outputs

StatusTypeDescription
200 OKtext/event-streamReturns transcription results in JSON (streamed via SSE).
422 Validation Errorapplication/jsonValidation failure. Inspect detail array.

200 OK (SSE stream)

data: {"type": "partial", "word": "यह", "provisional": true, "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "partial", "word": "एक", "provisional": true, "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "partial", "word": "टेस्ट", "provisional": true, "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "chunk_final", "text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।", "chunk_start": 0.0, "chunk_end": 3.413375, "chunk_index": 1, "total_chunks": 1}

data: {"type": "final", "text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।", "audio_duration_seconds": 3.413375, "processing_time_seconds": 1.44447922706604, "first_token_time_seconds": 0.136627197265625, "language_detected": "hi", "request_id": "df3a5974-6b24-4b15-a9d9-7c9df9513306"}

200 OK (Config)

{
"model_id": "openai/whisper-large-v3",
"supported_formats": ["wav", "mp3", "mp4", "m4a", "flac", "ogg"],
"max_file_size_mb": 25,
"hindi_model": {
"enabled": true,
"model_id": null
},
"defaults": {
"chunk_length_s": 6.0,
"stride_s": 5.9,
"overlap_words": 7
},
"limits": {
"chunk_length_range": [1.0, 30.0],
"stride_range": [1.0, 29.0],
"overlap_words_range": [0, 20],
"timeout_seconds": 30
},
"supported_languages": [
"en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", "zh",
"ar", "hi", "tr", "pl", "nl", "sv", "da", "no", "fi", "cs",
"sk", "hu", "ro", "bg", "hr", "sl", "et", "lv", "lt", "mt",
"ga", "cy", "is", "mk", "sq", "az", "kk", "ky", "uz", "tg",
"am", "my", "km", "lo", "si", "ne", "bn", "as", "or", "pa",
"gu", "ta", "te", "kn", "ml", "th", "vi", "id", "ms", "tl"
],
"output_formats": {
"streaming": "Server-Sent Events (SSE) with real-time partial results",
"file": "Complete JSON response with final transcript"
},
"credit_system": {
"unit": "1 credit = 1 minute of audio",
"billing": "Based on actual audio duration, not processing time"
}
}
WS/v1/audio/transcribe_ws

WebSocket Streaming Transcription

Real-time STT via WebSocket. Supports bidirectional streaming for live audio input.

Key Features
  • Persistent Connection: Maintains open WebSocket for continuous audio streaming.
  • Real-time Results: Receives transcription segments as audio is processed.
  • Low Latency: Optimized for live microphone input and voice applications.
  • Segment Callbacks: Provides word-level and segment-level results via callbacks.
  • Bidirectional: Sends audio chunks and receives transcriptions simultaneously.
  • Noise Cancellation: Optional server-side denoising before inference.
WebSocket Connection

Connect to: wss://voice.induslabs.io/v1/audio/transcribe_ws

Unlike REST endpoints, WebSocket maintains a persistent bidirectional connection for real-time streaming.

Available Models
  • indus-stt-v1: Default model that supports all languages.
  • indus-stt-hi-en: Specialized model for Hindi and English with real-time streaming input/output and very low processing time.

WebSocket Message Flow

MessageTypeOrderDescription
URL Query ParamsURLconnectionPass api_key, model, language, streaming as URL query parameters when connecting.
Audio ChunksbinarycontinuousRaw audio data sent as binary WebSocket frames (recommended: 4096 bytes per chunk).
End SignalbinarylastSend b"__END__" to signal audio stream completion.

Configuration Parameters

NameTypeDefaultDescription
api_keystringrequiredAuthentication API key passed in URL query string.
modelstringindus-stt-hi-enModel to use (e.g., "indus-stt-hi-en").
languagestring-Language name or ISO code (e.g., "english", "hindi", "en", "hi").
streamingstring"true"Use "true" for streaming mode (interim results), "false" for non-streaming.
noise_cancellationstring"false"Use "true" to enable noise cancellation for cleaner audio in noisy environments. Filters low-frequency rumble, high-frequency hiss, and ambient background noise to reduce hallucinations and improve accuracy.

Outputs

StatusTypeDescription
chunk_interimJSONInterim transcription result during processing (when streaming="true").
chunk_finalJSONFinal transcription for a processed audio chunk.
finalJSONComplete transcription with full text after all chunks processed.
errorJSONError message if processing fails.

Chunk Interim Response

{
"type": "chunk_interim",
"text": "यह एक टेस्ट है"
}

Chunk Final Response

{
"type": "chunk_final",
"text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।",
"chunk_index": 1,
"total_chunks": 1
}

Final Response

{
"type": "final",
"text": "यह एक टेस्ट है, भाषन से पाट रूपांतरन का परिच्छन।",
"audio_duration_seconds": 3.413375,
"language_detected": "hi",
"request_id": "df3a5974-6b24-4b15-a9d9-7c9df9513306"
}

Error Response

{
"type": "error",
"message": "Invalid API key",
"code": "AUTH_ERROR"
}
POST/v1/audio/transcribe_file

Transcribe Audio File (Batch Async)

Launches background transcription for files up to 10 minutes and immediately returns a request_id to poll later.

Functionality
  • Designed for long recordings up to 10 minutes (600 seconds).
  • Returns immediately so your UI can poll the status endpoint or notify the user.
  • Available via REST on the web — SDK helpers are not yet available.
  • Supports optional noise cancellation before inference begins.

Form Fields

NameTypeDefaultDescription
filefilerequiredAudio file up to 10 minutes (600 seconds).
api_keystringrequiredAuthentication API key.
modelstring"default"Use "default", "indus-stt-v1", "hi-en", or "indus-stt-hi-en".
languagestring-Language hint (ISO code or name).
noise_cancellationbooleanfalseEnable server-side denoising before inference.

Outputs

StatusTypeDescription
202 Acceptedapplication/jsonReturns request_id, duration, estimated_time, and status_url for polling.
400 Bad Requestapplication/jsonAudio rejected (e.g., longer than 10 minutes or invalid format).
401 / 402application/jsonAuthentication failure or insufficient credits.

202 Accepted (Batch Upload)

{
"request_id": "13c8b15a-59f9-4cda-a3bb-3bf06f5e2c9b",
"status": "processing",
"message": "File uploaded successfully. Processing in background.",
"duration": 126.42,
"estimated_time": 18.96,
"status_url": "/v1/audio/transcribe_status/13c8b15a-59f9-4cda-a3bb-3bf06f5e2c9b",
"poll_interval": 5
}
GET/v1/audio/transcribe_status/{request_id}

Get Batch Transcription Status

Polls the progress of a batch job created by /v1/audio/transcribe_file and returns the final transcript when completed.

Functionality
  • Call every poll_interval seconds until the job reports completed or failed.
  • Completed responses include full text, segments, word-level timestamps (if available), and processing metrics.
  • Failed jobs return an error string so that you can surface actionable feedback to users.

Form Fields

NameTypeDefaultDescription
request_idpathrequiredIdentifier returned from /v1/audio/transcribe_file.
api_keystring (query)requiredSame key used when creating the job.

Outputs

StatusTypeDescription
200 OKapplication/jsonCurrent status plus transcript, segments, and metrics when completed.
404 Not Foundapplication/jsonUnknown or expired request_id.
500 Internal Server Errorapplication/jsonJob failed; see error field for details.

200 OK (Batch Status)

{
"request_id": "13c8b15a-59f9-4cda-a3bb-3bf06f5e2c9b",
"status": "completed",
"progress_percentage": 100,
"progress": "3/3",
"text": "Final transcription text...",
"segments": [
{"text": "segment text", "start": 0, "end": 12.5}
],
"processing_time": 98.11,
"model": "indus-stt-v1"
}
POST/v1/audio/transcribe/diarize

Diarize + Transcribe (Async)

Uploads audio for speaker diarization and per-speaker transcription, returning a job_id for polling.

Functionality
  • Runs speaker diarization before transcription to label speakers.
  • Supports files up to 30 minutes; processing occurs in the background.
  • Use config_json to tune language, noise_cancellation, padding, and concurrency parameters.
  • Returns an estimated_time and status_url so clients can poll without blocking.

Form Fields

NameTypeDefaultDescription
filefilerequiredAudio file to diarize (max 30 minutes).
api_keystringrequiredAuthentication API key.
config_jsonstring"{}"Optional JSON string to tune language, models, padding, and concurrency.

Outputs

StatusTypeDescription
200 OKapplication/jsonReturns job_id, estimated_time, and status_url for polling diarization results.
400 Bad Requestapplication/jsonInvalid audio, malformed config_json, or duration beyond 30 minutes.
401 / 402application/jsonAuthentication failure or insufficient credits.
503 Service Unavailableapplication/jsonCredit service unavailable.

200 OK (Diarization Upload)

{
"job_id": "7a5f2f0e4c8d4b2ab6d2c5d0fdc4b5e0",
"status": "processing",
"message": "File uploaded successfully. Processing in background.",
"duration": 312.44,
"estimated_time": 93.73,
"status_url": "/v1/audio/transcribe/diarize/status/7a5f2f0e4c8d4b2ab6d2c5d0fdc4b5e0",
"poll_interval": 5
}
GET/v1/audio/transcribe/diarize/status/{job_id}

Get Diarization Status

Polls the progress of a diarization + transcription job created by /v1/audio/transcribe/diarize.

Functionality
  • Poll every poll_interval seconds until status is completed or failed.
  • Completed responses return per-speaker utterances with start/end timestamps and transcript text.
  • Failed jobs include an error string so you can surface actionable feedback.

Form Fields

NameTypeDefaultDescription
job_idpathrequiredIdentifier returned from /v1/audio/transcribe/diarize.
api_keystring (query)requiredSame key used when creating the job.

Outputs

StatusTypeDescription
200 OKapplication/jsonCurrent diarization status plus speaker-labeled results when completed.
404 Not Foundapplication/jsonUnknown or expired job_id.
401 Unauthorizedapplication/jsonInvalid API key.
503 Service Unavailableapplication/jsonCredit service unavailable.

200 OK (Diarization Status)

{
"job_id": "7a5f2f0e4c8d4b2ab6d2c5d0fdc4b5e0",
"status": "completed",
"results": [
{
"utterance_index": 1,
"start_sec": 0.0,
"end_sec": 14.2,
"duration_sec": 14.2,
"speaker": "speaker_0",
"endpoint": "transcribe_file",
"text": "Welcome everyone, let's review the agenda.",
"status": "completed"
},
{
"utterance_index": 2,
"start_sec": 14.2,
"end_sec": 29.8,
"duration_sec": 15.6,
"speaker": "speaker_1",
"endpoint": "transcribe_ws",
"text": "I will cover the launch metrics next.",
"status": "completed"
}
],
"processing_time": 108.51,
"model": "diarization"
}
GET/v1/audio/transcribe/config

Retrieve Transcription Configuration

Fetch default parameters, limits, and supported formats before sending audio for transcription.

Functionality
  • Provides current defaults for chunk sizing, stride, and overlap handling.
  • Lists accepted media formats and the maximum upload size in megabytes.
  • Use these values to validate client-side settings and avoid failed uploads.

Outputs

StatusTypeDescription
200 OKapplication/jsonReturns defaults, limits, and supported languages/formats for the STT service.
422 Validation Errorapplication/jsonValidation failure. Inspect detail array.

422 Validation Error

{
"detail": [
{
"loc": ["string", 0],
"msg": "string",
"type": "string"
}
]
}

200 OK (Config)

{
"model_id": "openai/whisper-large-v3",
"supported_formats": ["wav", "mp3", "mp4", "m4a", "flac", "ogg"],
"max_file_size_mb": 25,
"hindi_model": {
"enabled": true,
"model_id": null
},
"defaults": {
"chunk_length_s": 6.0,
"stride_s": 5.9,
"overlap_words": 7
},
"limits": {
"chunk_length_range": [1.0, 30.0],
"stride_range": [1.0, 29.0],
"overlap_words_range": [0, 20],
"timeout_seconds": 30
},
"supported_languages": [
"en", "es", "fr", "de", "it", "pt", "ru", "ja", "ko", "zh",
"ar", "hi", "tr", "pl", "nl", "sv", "da", "no", "fi", "cs",
"sk", "hu", "ro", "bg", "hr", "sl", "et", "lv", "lt", "mt",
"ga", "cy", "is", "mk", "sq", "az", "kk", "ky", "uz", "tg",
"am", "my", "km", "lo", "si", "ne", "bn", "as", "or", "pa",
"gu", "ta", "te", "kn", "ml", "th", "vi", "id", "ms", "tl"
],
"output_formats": {
"streaming": "Server-Sent Events (SSE) with real-time partial results",
"file": "Complete JSON response with final transcript"
},
"credit_system": {
"unit": "1 credit = 1 minute of audio",
"billing": "Based on actual audio duration, not processing time"
}
}