Video & Audio Inputs ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Query multimodal models to process video and audio content directly
Some Omni/multimodal models can process audio and/or video inputs directly, enabling video captioning, scene analysis, content understanding, and multimodal question answering. A good example is Qwen3 Omni (qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.
Available models#
| Model | Input support | Notes |
|---|---|---|
| Qwen3 Omni 30B A3B Instruct | Video, audio, text | Dedicated deployment required |
| Molmo2-4B | Video, text | Dedicated deployment required |
| Molmo2-8B | Video, text | Dedicated deployment required |
Qwen3 Omni supports native video and audio inputs. Molmo2 models are video-only, so use the same request structure as below, but omit audio_url. Molmo2 models cannot understand audio from videos.
Create a deployment#
Video and audio models require dedicated deployments. Create one using firectl:
firectl deployment create qwen3-omni-30b-a3b-instruct \
--account-id <YOUR_ACCOUNT_ID> \
--min-replica-count 1 \
--max-replica-count 1 \
--deployment-shape qwen3-omni-30b-a3b-instruct-minimal
Make sure to use the predefined qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.
Chat Completions API#
Provide video and audio as base64-encoded data URLs. The model accepts video_url, audio_url, and text content types.
import os
import base64
import requests
# Load and encode your preprocessed video and audio
with open("processed_video.mp4", "rb") as f:
video_b64 = base64.b64encode(f.read()).decode("utf-8")
with open("audio.ogg", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
# API configuration
url = "https://api.fireworks.ai/inference/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}",
}
# Request payload
payload = {
"model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
"max_tokens": 1000,
"temperature": 0.3,
"messages": [
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
{"type": "audio_url", "audio_url": {"url": f"data:audio/ogg;base64,{audio_b64}"}},
{"type": "text", "text": "Describe what happens in this video."},
],
},
],
}
# Send request
response = requests.post(url, headers=headers, json=payload)
print(response.json()["choices"][0]["message"]["content"])
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="curl"></span>
```bash
# Encode your files (run these separately)
VIDEO_B64=$(base64 -i processed_video.mp4)
AUDIO_B64=$(base64 -i audio.ogg)
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-d '{
"model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
"max_tokens": 1000,
"temperature": 0.3,
"messages": [
{
"role": "user",
"content": [
{"type": "video_url", "video_url": {"url": "data:video/mp4;base64,'$VIDEO_B64'"}},
{"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,'$AUDIO_B64'"}},
{"type": "text", "text": "Describe what happens in this video."}
]
}
]
}'
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
## Working with videos
Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.
### Preprocessing video
Extract frames at 1 FPS and downscale to 360p for efficient processing:
```bash
ffmpeg -y -i input_video.mp4 \
-t 60 \
-vf "fps=1,scale=-1:360" \
-c:v libx264 -preset fast \
-an \
processed_video.mp4| Parameter | Description |
|---|---|
-t 60 | Limit to first 60 seconds |
fps=1 | Extract 1 frame per second |
scale=-1:360 | Downscale to 360p height, maintain aspect ratio |
-an | Remove audio track (extracted separately) |
Preprocessing audio#
Extract audio as Opus in an Ogg container for optimal compression:
ffmpeg -y -i input_video.mp4 \
-t 60 \
-vn \
-c:a libopus \
-b:a 24k \
-ar 16000 \
-ac 1 \
audio.ogg| Parameter | Description |
|---|---|
-t 60 | Limit to first 60 seconds |
-vn | Remove video track |
-c:a libopus | Use Opus codec |
-b:a 24k | 24 kbps bitrate |
-ar 16000 | 16 kHz sample rate |
-ac 1 | Mono audio |
Complete preprocessing example#
import subprocess
import tempfile
import base64
import os
def preprocess_video(video_path: str) -> tuple[str, str]:
"""
Preprocess video for optimal model input.
Returns:
Tuple of (video_base64, audio_base64)
"""
with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_video:
processed_video_path = tmp_video.name
with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as tmp_audio:
audio_path = tmp_audio.name
try:
# Process video: 1 FPS, 360p, max 60 seconds
subprocess.run([
"ffmpeg", "-y", "-i", video_path,
"-t", "60",
"-vf", "fps=1,scale=-1:360",
"-c:v", "libx264", "-preset", "fast",
"-an",
processed_video_path
], check=True, capture_output=True)
# Extract audio: Opus/Ogg, mono, 16kHz, 24kbps
subprocess.run([
"ffmpeg", "-y", "-i", video_path,
"-t", "60",
"-vn",
"-c:a", "libopus",
"-b:a", "24k",
"-ar", "16000",
"-ac", "1",
audio_path
], check=True, capture_output=True)
with open(processed_video_path, "rb") as f:
video_b64 = base64.b64encode(f.read()).decode("utf-8")
with open(audio_path, "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode("utf-8")
return video_b64, audio_b64
finally:
os.unlink(processed_video_path)
os.unlink(audio_path)Preprocessing is highly recommended to reduce latency and ensure consistent performance.
Performance considerations#
Tips for optimal throughput:
- Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
- Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
- Limit video duration – Cap at 60 seconds for consistent performance
- Use dedicated deployments – Scale replicas based on your throughput needs
Known limitations#
- Video duration: Maximum 60 seconds recommended for optimal performance
- Supported formats:
.mp4for video,.ogg(Opus) for audio - Base64 size: Total encoded payload should be under 10MB
- Deployment required: Video models are not available on serverless; dedicated deployment required
Related resources#
Interactive notebook for video and audio analysis
Query models with image inputs
Deploy models on dedicated GPUs