Supervised Fine Tuning - Vision

no
Summary: Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.

Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets

Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding.

To see all vision models that support fine-tuning, visit the Model Library for vision models.

Fine-tuning a VLM using LoRA#

vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example.

Dataset Requirements:

  • Format: .jsonl file
  • Minimum examples: 3
  • Maximum examples: 3 million per dataset
  • Images: Must be base64 encoded with proper MIME type prefixes
  • Supported image formats: PNG, JPG, JPEG

Message Schema: Each training example must include a messages array where each message has:

  • role: one of system, user, or assistant
  • content: an array containing text and image objects or just text

Basic VLM Dataset Example#

    {
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful visual assistant that can analyze images and answer questions about them."
        },
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": "What objects do you see in this image?"
            },
            {
              "type": "image_url",
              "image_url": {
                "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
              }
            }
          ]
        },
        {
          "role": "assistant",
          "content": "I can see a red car, a tree, and a blue house in this image."
        }
      ]
    }
    ```

### If your dataset contains image urls

Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you'll need to download and encode them to base64.

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="❌ Incorrect"></span>
    ```json
        {
          "type": "image_url",
          "image_url": {
            // ❌ Raw HTTP/HTTPS URLs are NOT supported
            "url": "https://example.com/image.jpg"
          }
        }
        ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="✅ Correct"></span>
    ```json
        {
          "type": "image_url",
          "image_url": {
            // ✅ Use data URI with base64 encoding
            // Format: data:image/{format};base64,{base64_encoded_data}
            "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
          }
        }
        ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

You can use the following script to automatically convert your dataset to the correct format:

<AccordionGroup>
  <Accordion title="Python script to download and encode images to base64">
    **Usage:**

    ```bash
        # Install required dependency
        pip install requests

        # Download the script
        wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py

        # Run the script - will output a new dataset <path_to_your_dataset>_base64.jsonl
        python download_images_and_encode_to_b64.py --input_file <path_to_your_dataset.jsonl>
        ```
  </Accordion>
</AccordionGroup>

### Advanced Dataset Examples

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="Multi-image Conversation"></span>
    ```json
        {
          "messages": [
            {
              "role": "user",
              "content": [
                {
                  "type": "text",
                  "text": "Compare these two images and tell me the differences"
                },
                {
                  "type": "image_url",
                  "image_url": {
                    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
                  }
                },
                {
                  "type": "image_url",
                  "image_url": {
                    "url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
                  }
                }
              ]
            },
            {
              "role": "assistant",
              "content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
            }
          ]
        }
        ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Multi-turn Conversation"></span>
    ```json
        {
          "messages": [
            {
              "role": "system",
              "content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation."
            },
            {
              "role": "user",
              "content": [
                {
                  "type": "text",
                  "text": "Can you describe this kitchen layout for me?"
                },
                {
                  "type": "image_url",
                  "image_url": {
                    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
                  }
                }
              ]
            },
            {
              "role": "assistant",
              "content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light."
            },
            {
              "role": "user",
              "content": "Now look at this living room. Do you think the styles would work well together?"
            },
            {
              "role": "assistant",
              "content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?"
            },
            {
              "role": "user",
              "content": [
                {
                  "type": "text",
                  "text": "Sorry about that! Here's the living room:"
                },
                {
                  "type": "image_url",
                  "image_url": {
                    "url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..."
                  }
                }
              ]
            },
            {
              "role": "assistant",
              "content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones."
            },
          ]
        }
        ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

### Try with an Example Dataset

To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with `<think></think>` tags for reasoning.

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="Download with curl"></span>
    ```bash
        # Download the example dataset
        curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
        ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="Download with wget"></span>
    ```bash
        # Download the example dataset
        wget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
        ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>
  <span class="step-end"></span>

  <span class="step-marker" data-step-title="Upload your VLM dataset"></span>
Upload your prepared JSONL dataset to Fireworks for training:

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="firectl"></span>
    ```bash
        firectl dataset create my-vlm-dataset /path/to/vlm_training_data.jsonl
        ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="UI"></span>
    Navigate to the Datasets tab in the Fireworks console, click "Create Dataset", and upload your JSONL file through the wizard.

    
      <img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/dataset.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=406fa721650d41553f3adc5e4d372a68" alt="Dataset creation interface" width="2972" height="2060" data-path="images/fine-tuning/dataset.png" />
    
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="REST API"></span>
    ```javascript
        // Create dataset entry
        const createDatasetPayload = {
          datasetId: "my-vlm-dataset",
          dataset: { userUploaded: {} }
        };

        const response = await fetch(`${BASE_URL}/datasets`, {
          method: "POST",
          headers: {
            "Authorization": `Bearer ${API_KEY}`,
            "Content-Type": "application/json"
          },
          body: JSON.stringify(createDatasetPayload)
        });

        // Upload JSONL file
        const formData = new FormData();
        formData.append("file", fileInput.files[0]);

        const uploadResponse = await fetch(`${BASE_URL}/datasets/my-vlm-dataset:upload`, {
          method: "POST",
          headers: { "Authorization": `Bearer ${API_KEY}` },
          body: formData
        });
        ```
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

<span class="callout-start" data-callout-type="tip"></span>
  For larger datasets (>500MB), use `firectl` as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our [Secure Fine Tuning](/fine-tuning/secure-fine-tuning#gcs-bucket-integration) guide for setup details.
<span class="callout-end"></span>
  <span class="step-end"></span>

  <span class="step-marker" data-step-title="Launch VLM fine-tuning job"></span>
Create a supervised fine-tuning job for your VLM:

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="firectl"></span>
    ```bash
        firectl sftj create \
          --base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
          --dataset my-vlm-dataset \
          --output-model my-custom-vlm \
          --epochs 3
        ```

    For additional parameters like learning rates, evaluation datasets, and batch sizes, see [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings).
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="UI"></span>
    1. Navigate to the Fine-tuning tab in the Fireworks console
    2. Click "Create Fine-tuning Job"
    3. Select your VLM base model (Qwen 2.5 VL)
    4. Choose your uploaded dataset
    5. Configure training parameters
    6. Launch the job

    
      <img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/create-sftj.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=a2ea6a163d5d3e83ee7322aa90bb51e6" alt="Fine-tuning job creation interface" width="2970" height="2048" data-path="images/fine-tuning/create-sftj.png" />
    
  <span class="tab-end"></span>
<span class="tab-group-end"></span>

VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
  <span class="step-end"></span>

  <span class="step-marker" data-step-title="Monitor training progress"></span>
Track your VLM fine-tuning job in the [Fireworks console](https://app.fireworks.ai/dashboard/fine-tuning).


  <img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/vlm-sftj.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=e93405e55268af1c2202169c0bff2a39" alt="VLM fine-tuning job in the Fireworks console" width="3802" height="1690" data-path="images/fine-tuning/vlm-sftj.png" />


Monitor key metrics:

* **Training loss**: Should generally decrease over time
* **Evaluation loss**: Monitor for overfitting if using evaluation dataset
* **Training progress**: Epochs completed and estimated time remaining

<span class="callout-start" data-callout-type="check"></span>
  Your VLM fine-tuning job is complete when the status shows `COMPLETED` and your custom model is ready for deployment.
<span class="callout-end"></span>
  <span class="step-end"></span>

  <span class="step-marker" data-step-title="Deploy your fine-tuned VLM"></span>
Once training is complete, deploy your custom VLM:

<span class="tab-group-start"></span>
  <span class="tab-start" data-tab-title="firectl"></span>
    ```bash
        # Create a deployment for your fine-tuned VLM
        firectl deployment create my-custom-vlm

        # Check deployment status
        firectl deployment get accounts/your-account/deployment/deployment-id
        ```
  <span class="tab-end"></span>

  <span class="tab-start" data-tab-title="UI"></span>
    Deploy from the UI using the `Deploy` dropdown in the fine-tuning job page.

    
      <img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/vlm-sftj-deploy.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=c4d8d33d8ea388c186e078dd781386dc" alt="Deploy dropdown in the fine-tuning job page" width="3802" height="1690" data-path="images/fine-tuning/vlm-sftj-deploy.png" />
    
  <span class="tab-end"></span>
<span class="tab-group-end"></span>
  <span class="step-end"></span>
<span class="steps-end"></span>

## Advanced Configuration

For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings) section in our comprehensive fine-tuning guide.

<span class="callout-start" data-callout-type="tip"></span>
  Need custom training loops for VLMs? The **Training API** also supports vision-language model fine-tuning with full control over loss functions, training objectives, and evaluation. See [Training API  Vision Inputs](/fine-tuning/training-api/vision-inputs) for details.
<span class="callout-end"></span>

## Interactive Tutorials: Fine-tuning VLMs

For a hands-on, step-by-step walkthrough of VLM fine-tuning, we've created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation.

<span class="card-group-start" data-cols="2"></span>
  <span class="card-start" data-card-title="VLM Fine-tuning Quickstart" data-card-icon="notebook" data-card-href="https://colab.research.google.com/drive/11WpagNa6xKgh1zhr1xh5uIuVtkPPL-qn"></span>
**Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI**
  <span class="card-end"></span>

  <span class="card-start" data-card-title="VLM Fine-tuning + Evals" data-card-icon="notebook" data-card-href="https://huggingface.co/spaces/fireworks-ai/catalog-extract/tree/main/notebooks"></span>
**Finetuning a VLM to beat SOTA closed source model**
  <span class="card-end"></span>
<span class="card-group-end"></span>

The cookbooks above cover the following:

* Setting up your environment with Fireworks CLI
* Preparing vision datasets in the correct format
* Launching and monitoring VLM fine-tuning jobs
* Testing your fine-tuned model
* Best practices for VLM fine-tuning
* Running inference on serverless VLMs
* Running evals to show performance gains

## Testing Your Fine-tuned VLM

After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:

```python
import openai

client = openai.OpenAI(
    base_url="https://api.fireworks.ai/inference/v1",
    api_key="<FIREWORKS_API_KEY>",
)

response = client.chat.completions.create(
    model="accounts/your-account/models/my-custom-vlm",
    messages=[{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg"
            },
        },{
            "type": "text",
            "text": "What's in this image?",
        }],
    }]
)
print(response.choices[0].message.content)

If you fine-tuned using the example dataset, your model should include <think></think> tags in its response.

Link last verified June 7, 2026. View original ↗
Source: Fireworks AI Docs
Link last verified: 2026-06-07