Video & Audio Inputs ↗

fireworks guide intermediate vision ide models

Summary: Query multimodal models to process video and audio content directly

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Query multimodal models to process video and audio content directly

Some Omni/multimodal models can process audio and/or video inputs directly, enabling video captioning, scene analysis, content understanding, and multimodal question answering. A good example is Qwen3 Omni (qwen3-omni-30b-a3b-instruct), which supports video, audio, and text inputs in a single request. Deploy these models using dedicated deployments for production workloads.

Available models#

Model	Input support	Notes
Qwen3 Omni 30B A3B Instruct	Video, audio, text	Dedicated deployment required
Molmo2-4B	Video, text	Dedicated deployment required
Molmo2-8B	Video, text	Dedicated deployment required

Qwen3 Omni supports native video and audio inputs. Molmo2 models are video-only, so use the same request structure as below, but omit audio_url. Molmo2 models cannot understand audio from videos.

Create a deployment#

Video and audio models require dedicated deployments. Create one using firectl:

firectl deployment create qwen3-omni-30b-a3b-instruct \
  --account-id <YOUR_ACCOUNT_ID> \
  --min-replica-count 1 \
  --max-replica-count 1 \
  --deployment-shape qwen3-omni-30b-a3b-instruct-minimal

Make sure to use the predefined qwen3-omni-30b-a3b-instruct-minimal deployment shape for your deployment to work correctly.

Chat Completions API#

Provide video and audio as base64-encoded data URLs. The model accepts video_url, audio_url, and text content types.

    import os
    import base64
    import requests

    # Load and encode your preprocessed video and audio
    with open("processed_video.mp4", "rb") as f:
        video_b64 = base64.b64encode(f.read()).decode("utf-8")

    with open("audio.ogg", "rb") as f:
        audio_b64 = base64.b64encode(f.read()).decode("utf-8")

    # API configuration
    url = "https://api.fireworks.ai/inference/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {os.environ['FIREWORKS_API_KEY']}",
    }

    # Request payload
    payload = {
        "model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
        "max_tokens": 1000,
        "temperature": 0.3,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_b64}"}},
                    {"type": "audio_url", "audio_url": {"url": f"data:audio/ogg;base64,{audio_b64}"}},
                    {"type": "text", "text": "Describe what happens in this video."},
                ],
            },
        ],
    }

    # Send request
    response = requests.post(url, headers=headers, json=payload)
    print(response.json()["choices"][0]["message"]["content"])
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="curl"></span>
```bash
    # Encode your files (run these separately)
    VIDEO_B64=$(base64 -i processed_video.mp4)
    AUDIO_B64=$(base64 -i audio.ogg)

    curl https://api.fireworks.ai/inference/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $FIREWORKS_API_KEY" \
      -d '{
        "model": "accounts/<YOUR_ACCOUNT_ID>/models/qwen3-omni-30b-a3b-instruct#accounts/<YOUR_ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
        "max_tokens": 1000,
        "temperature": 0.3,
        "messages": [
          {
            "role": "user",
            "content": [
              {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,'$VIDEO_B64'"}},
              {"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,'$AUDIO_B64'"}},
              {"type": "text", "text": "Describe what happens in this video."}
            ]
          }
        ]
      }'
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

## Working with videos

Video models perform best with preprocessed inputs that balance quality and token efficiency. Use ffmpeg to optimize your video and audio before sending requests.

### Preprocessing video

Extract frames at 1 FPS and downscale to 360p for efficient processing:

```bash
ffmpeg -y -i input_video.mp4 \
  -t 60 \
  -vf "fps=1,scale=-1:360" \
  -c:v libx264 -preset fast \
  -an \
  processed_video.mp4

Parameter	Description
`-t 60`	Limit to first 60 seconds
`fps=1`	Extract 1 frame per second
`scale=-1:360`	Downscale to 360p height, maintain aspect ratio
`-an`	Remove audio track (extracted separately)

Preprocessing audio#

Extract audio as Opus in an Ogg container for optimal compression:

ffmpeg -y -i input_video.mp4 \
  -t 60 \
  -vn \
  -c:a libopus \
  -b:a 24k \
  -ar 16000 \
  -ac 1 \
  audio.ogg

Parameter	Description
`-t 60`	Limit to first 60 seconds
`-vn`	Remove video track
`-c:a libopus`	Use Opus codec
`-b:a 24k`	24 kbps bitrate
`-ar 16000`	16 kHz sample rate
`-ac 1`	Mono audio

Complete preprocessing example#

import subprocess
import tempfile
import base64
import os

def preprocess_video(video_path: str) -> tuple[str, str]:
    """
    Preprocess video for optimal model input.
    
    Returns:
        Tuple of (video_base64, audio_base64)
    """
    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_video:
        processed_video_path = tmp_video.name
    with tempfile.NamedTemporaryFile(suffix=".ogg", delete=False) as tmp_audio:
        audio_path = tmp_audio.name
    
    try:
        # Process video: 1 FPS, 360p, max 60 seconds
        subprocess.run([
            "ffmpeg", "-y", "-i", video_path,
            "-t", "60",
            "-vf", "fps=1,scale=-1:360",
            "-c:v", "libx264", "-preset", "fast",
            "-an",
            processed_video_path
        ], check=True, capture_output=True)
        
        # Extract audio: Opus/Ogg, mono, 16kHz, 24kbps
        subprocess.run([
            "ffmpeg", "-y", "-i", video_path,
            "-t", "60",
            "-vn",
            "-c:a", "libopus",
            "-b:a", "24k",
            "-ar", "16000",
            "-ac", "1",
            audio_path
        ], check=True, capture_output=True)
        
        with open(processed_video_path, "rb") as f:
            video_b64 = base64.b64encode(f.read()).decode("utf-8")
        
        with open(audio_path, "rb") as f:
            audio_b64 = base64.b64encode(f.read()).decode("utf-8")
        
        return video_b64, audio_b64
    
    finally:
        os.unlink(processed_video_path)
        os.unlink(audio_path)

Preprocessing is highly recommended to reduce latency and ensure consistent performance.

Performance considerations#

Tips for optimal throughput:

Preprocess all videos – 1 FPS at 360p provides good quality with minimal tokens
Extract audio separately – Opus/Ogg at 24kbps offers excellent compression
Limit video duration – Cap at 60 seconds for consistent performance
Use dedicated deployments – Scale replicas based on your throughput needs

Known limitations#

Video duration: Maximum 60 seconds recommended for optimal performance
Supported formats: .mp4 for video, .ogg (Opus) for audio
Base64 size: Total encoded payload should be under 10MB
Deployment required: Video models are not available on serverless; dedicated deployment required

Interactive notebook for video and audio analysis

Query models with image inputs

Deploy models on dedicated GPUs

Link last verified June 7, 2026. View original ↗

Source: Fireworks AI Docs

Link last verified: 2026-06-07