Vision Inputs

no
Summary: Fine-tune vision-language models (VLMs) with the Training API using multimodal chat data containing images and text.

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Fine-tune vision-language models (VLMs) with the Training API using multimodal chat data containing images and text.

The Training API supports vision-language model (VLM) fine-tuning, allowing you to train models that understand both images and text. This works across all training modes — SFT, DPO, and RL — using the same API primitives and cookbook recipes you already know.

VLM support in the Training API requires a VLM-compatible training shape. See Training Shapes for available shapes.

What changes for vision#

Compared to text-only training, VLM fine-tuning differs in three ways:

AspectText-onlyVision
Training shapeText model shape (e.g. qwen3-8b-128k)VLM shape (e.g. qwen3-vl-8b-65k)
TokenizerText tokenizer (e.g. Qwen/Qwen3-8B)VLM processor (e.g. Qwen/Qwen3-VL-8B-Instruct)
Message formatcontent is a stringcontent is an array of text and image_url objects

Everything else — loss functions, checkpointing, weight sync, deployment sampling — works identically.

Dataset format#

Vision datasets use the standard OpenAI-compatible chat format. The key difference is that content fields can contain an array of content parts mixing text and images:

Single image#

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What objects do you see in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "I can see a red car, a tree, and a blue house."
    }
  ]
}

Multiple images#

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Compare these two images"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": "The first image shows a daytime scene while the second shows the same location at night."
    }
  ]
}

Multi-turn with images#

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this kitchen."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQ..."}}
      ]
    },
    {
      "role": "assistant",
      "content": "This is a modern open-plan kitchen with white cabinets and granite countertops."
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Now compare it with this living room."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4BBB..."}}
      ]
    },
    {
      "role": "assistant",
      "content": "Both spaces share a modern aesthetic with clean lines and neutral colors."
    }
  ]
}

Image encoding requirements#

Images must be base64-encoded with a MIME type prefix. Raw HTTP URLs are not supported in training data.

    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
      }
    }
    ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Incorrect"></span>
```json
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/photo.jpg"
      }
    }
    ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

Supported image formats: **PNG**, **JPEG/JPG**.

If your dataset contains image URLs, download and convert them to base64 first. See the [conversion script in the managed VLM fine-tuning guide](/fine-tuning/fine-tuning-vlm#if-your-dataset-contains-image-urls).

## Cookbook: VLM SFT

The cookbook's `sft_loop` recipe works with vision datasets out of the box. Use a VLM training shape and a VLM tokenizer:

```python
from training.recipes.sft_loop import Config, main
from training.utils import TrainerConfig

cfg = Config(
    log_path="./vlm_sft_logs",
    base_model="accounts/fireworks/models/qwen3-vl-8b-instruct",
    dataset="/path/to/vision_data.jsonl",
    tokenizer_model="Qwen/Qwen3-VL-8B-Instruct",
    max_seq_len=4096,
    epochs=1,
    batch_size=4,
    learning_rate=1e-5,
    trainer=TrainerConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-vl-8b-65k",
    ),
)

main(cfg)

The recipe handles vision-aware tokenization automatically — image tokens are assigned weight 0.0 (prompt) and text response tokens are assigned weight 1.0 (train).

API-level: VLM training loop#

For full control over the training loop, use the API directly with a VLM training shape. The workflow is the same as text-only training, but the tokenizer and shape are VLM-specific:

1. Create the managed VLM service#

import os
from fireworks.training.sdk import FiretitanServiceClient

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

base_model = "accounts/fireworks/models/qwen3-vl-8b-instruct"
tokenizer_model = "Qwen/Qwen3-VL-8B-Instruct"
shape_id = "accounts/fireworks/trainingShapes/qwen3-vl-8b-65k"

service = FiretitanServiceClient.from_firetitan_config(
    api_key=api_key,
    base_url=base_url,
    base_model=base_model,
    tokenizer_model=tokenizer_model,
    lora_rank=0,
    training_shape_id=shape_id,
    learning_rate=1e-5,
    create_deployment=False,
    cleanup_trainer_on_close=True,
)

2. Connect and train#

import torch
import tinker
import transformers
from tinker_cookbook.supervised.common import datum_from_model_input_weights

training_client = service.create_training_client(
    base_model=base_model, lora_rank=0,
)

processor = transformers.AutoProcessor.from_pretrained(
    tokenizer_model, trust_remote_code=True,
)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/..."}},
        ],
    },
    {
        "role": "assistant",
        "content": "The image shows a sunset over the ocean.",
    },
]

text = processor.apply_chat_template(conversation, tokenize=False)
full_tokens = processor.tokenizer.encode(text)

prompt_text = processor.apply_chat_template(conversation[:1], tokenize=False)
prompt_len = len(processor.tokenizer.encode(prompt_text))

weights = torch.zeros(len(full_tokens), dtype=torch.float32)
weights[prompt_len:] = 1.0

datum = datum_from_model_input_weights(
    tinker.ModelInput.from_ints(full_tokens),
    weights,
    max_length=4096,
)

def sft_loss(data, logprobs_list):
    total_loss = torch.tensor(0.0)
    n_tokens = 0
    for i, logprobs in enumerate(logprobs_list):
        w = torch.tensor(data[i].loss_fn_inputs["weights"].data, dtype=torch.float32)
        min_len = min(len(logprobs), len(w))
        total_loss = total_loss - torch.dot(logprobs[:min_len].float(), w[:min_len])
        n_tokens += w[:min_len].sum().item()
    return total_loss / max(n_tokens, 1), {"sft_loss": (total_loss / max(n_tokens, 1)).item()}

for step in range(100):
    training_client.forward_backward_custom([datum], sft_loss).result()
    training_client.optim_step(
        tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
    ).result()

3. Save and promote#

Checkpointing and weight sync work identically to text-only training:

saved = training_client.save_weights_for_sampler(
    "vlm-final",
    checkpoint_type="base",
).result()

entry = next(
    row for row in service.list_checkpoints(service.trainer_job_id)
    if row["name"].endswith(f"/checkpoints/{saved.path}")
)
model = service.promote_checkpoint(
    name=entry["name"],
    output_model_id="my-vlm-model",
    base_model="accounts/fireworks/models/qwen3-vl-8b-instruct",
)

service.close()

VLM DPO and RL#

Vision inputs also work with DPO and RL training. The dataset format is the same — use multimodal content arrays in your messages:

DPO with vision#

{
  "chosen": {
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this chart."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      },
      {"role": "assistant", "content": "This bar chart shows quarterly revenue growth of 15% year-over-year."}
    ]
  },
  "rejected": {
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this chart."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
        ]
      },
      {"role": "assistant", "content": "This is a chart."}
    ]
  }
}

RL with vision prompts#

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Solve the math problem shown in this image. Show your reasoning."},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBOR..."}}
      ]
    }
  ]
}

Use the corresponding cookbook recipes (dpo_loop, rl_loop) with a VLM training shape and tokenizer — the multimodal message handling is automatic.

Available VLM training shapes#

ModelShape IDContextGPUs
Qwen3 VL 8Baccounts/fireworks/trainingShapes/qwen3-vl-8b-65k65k4

See Training Shapes for the full list and details.

Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07