Documentation Index
Fetch the complete documentation index at: https://docs.asteragents.com/llms.txt
Use this file to discover all available pages before exploring further.
What it does
Transcribes audio or video to text using ElevenLabs Scribe. Works on files already in the conversation (including output from elevenlabs_text_to_speech) or any HTTPS-accessible media URL — including cloud storage, YouTube, TikTok, and podcast hosts.Key features
- Single
audio_sourceparam acceptsr2://conversation attachments or HTTPS URLs — auto-detected by prefix - Scribe v2 (default) for best-in-class accuracy
- Word-level timestamps returned by default
- Speaker diarization (who spoke when) when
diarizeis on - Audio event tagging — surfaces
(laughter),(footsteps), etc. inline in the transcript - Auto language detection, or pin a specific ISO-639 code
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
audio_source | string | Yes | Either (a) an r2://bucket/key path of an audio file already attached to the thread, or (b) an HTTPS URL to an audio/video file. Supports cloud storage URLs (S3, R2, GCS), YouTube, TikTok, and other HTTPS sources up to 2GB. |
model_id | enum | No | scribe_v2 (default, latest) or scribe_v1 |
language_code | string | No | ISO-639-1 or ISO-639-3 code (e.g. eng, spa). If omitted, the language is auto-detected. |
diarize | boolean | No | Annotate which speaker is talking (returns speaker_id per word). Default: true |
tag_audio_events | boolean | No | Tag audio events like (laughter), (footsteps) inline. Default: true |
num_speakers | integer | No | Expected maximum number of speakers (1–32). Helps diarization when known. |
timestamps_granularity | enum | No | none, word (default), or character |
Common use cases
Transcribe a file already attached to the conversation
r2_path straight through.
Transcribe a public podcast or recording URL
Transcribe a meeting with multiple speakers
Response
Returns:text— the full transcriptlanguage_code/language_probability— detected language and confidencespeaker_count— number of distinct speakers identified (whendiarizeis on)word_count— total words in the transcriptwords— per-word objects with text, start/end timestamps, andspeaker_idsource— a label describing which input path was used
