Supervised Fine Tuning - Vision ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.fireworks.ai/llms.txt Use this file to discover all available pages before exploring further.
Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets
Vision-language model (VLM) fine-tuning allows you to adapt pre-trained models that can understand both text and images to your specific use cases. This is particularly valuable for tasks like document analysis, visual question answering, image captioning, and domain-specific visual understanding.
To see all vision models that support fine-tuning, visit the Model Library for vision models.
Fine-tuning a VLM using LoRA#
vision datasets must be in JSONL format in OpenAI-compatible chat format. Each line represents a complete training example.
Dataset Requirements:
- Format:
.jsonlfile - Minimum examples: 3
- Maximum examples: 3 million per dataset
- Images: Must be base64 encoded with proper MIME type prefixes
- Supported image formats: PNG, JPG, JPEG
Message Schema:
Each training example must include a messages array where each message has:
role: one ofsystem,user, orassistantcontent: an array containing text and image objects or just text
Basic VLM Dataset Example#
{
"messages": [
{
"role": "system",
"content": "You are a helpful visual assistant that can analyze images and answer questions about them."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What objects do you see in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
}
}
]
},
{
"role": "assistant",
"content": "I can see a red car, a tree, and a blue house in this image."
}
]
}
```
### If your dataset contains image urls
Images must be base64 encoded with MIME type prefixes. If your dataset contains image URLs, you'll need to download and encode them to base64.
<span class="tab-group-start"></span>
<span class="tab-start" data-tab-title="❌ Incorrect"></span>
```json
{
"type": "image_url",
"image_url": {
// ❌ Raw HTTP/HTTPS URLs are NOT supported
"url": "https://example.com/image.jpg"
}
}
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="✅ Correct"></span>
```json
{
"type": "image_url",
"image_url": {
// ✅ Use data URI with base64 encoding
// Format: data:image/{format};base64,{base64_encoded_data}
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
}
}
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
You can use the following script to automatically convert your dataset to the correct format:
<AccordionGroup>
<Accordion title="Python script to download and encode images to base64">
**Usage:**
```bash
# Install required dependency
pip install requests
# Download the script
wget https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/utils/download_images_and_encode_to_b64.py
# Run the script - will output a new dataset <path_to_your_dataset>_base64.jsonl
python download_images_and_encode_to_b64.py --input_file <path_to_your_dataset.jsonl>
```
</Accordion>
</AccordionGroup>
### Advanced Dataset Examples
<span class="tab-group-start"></span>
<span class="tab-start" data-tab-title="Multi-image Conversation"></span>
```json
{
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images and tell me the differences"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
}
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4BBBSkZJRg..."
}
}
]
},
{
"role": "assistant",
"content": "The first image shows a daytime scene while the second shows the same location at night. The lighting and shadows are completely different."
}
]
}
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="Multi-turn Conversation"></span>
```json
{
"messages": [
{
"role": "system",
"content": "You are a helpful visual assistant that can analyze images and remember details from previous images in our conversation."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Can you describe this kitchen layout for me?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."
}
}
]
},
{
"role": "assistant",
"content": "This is a modern open-plan kitchen with white cabinets, granite countertops, and stainless steel appliances. The island is positioned in the center with bar seating for 3-4 people. There's a large window above the sink providing natural light."
},
{
"role": "user",
"content": "Now look at this living room. Do you think the styles would work well together?"
},
{
"role": "assistant",
"content": "I'd be happy to help compare the styles! However, I don't see a living room image in your message. Could you please share the living room photo so I can analyze how well it would coordinate with the modern kitchen style we just discussed?"
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Sorry about that! Here's the living room:"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4BBBSkZJRgABAQ..."
}
}
]
},
{
"role": "assistant",
"content": "Perfect! This living room has a complementary modern aesthetic that would work beautifully with the kitchen. Both spaces feature clean lines and a neutral color palette. The living room's contemporary furniture and the kitchen's sleek design would create a cohesive flow in an open floor plan. The warm wood accents in the living room would add nice contrast to the kitchen's cooler tones."
},
]
}
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
### Try with an Example Dataset
To get a feel for how VLM fine-tuning works, you can use an example vision dataset. This is a classification dataset that contains images of food with `<think></think>` tags for reasoning.
<span class="tab-group-start"></span>
<span class="tab-start" data-tab-title="Download with curl"></span>
```bash
# Download the example dataset
curl -L -o food_reasoning.jsonl https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="Download with wget"></span>
```bash
# Download the example dataset
wget https://huggingface.co/datasets/fireworks-ai/vision-food-reasoning-dataset/resolve/main/food_reasoning.jsonl
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
<span class="step-end"></span>
<span class="step-marker" data-step-title="Upload your VLM dataset"></span>
Upload your prepared JSONL dataset to Fireworks for training:
<span class="tab-group-start"></span>
<span class="tab-start" data-tab-title="firectl"></span>
```bash
firectl dataset create my-vlm-dataset /path/to/vlm_training_data.jsonl
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="UI"></span>
Navigate to the Datasets tab in the Fireworks console, click "Create Dataset", and upload your JSONL file through the wizard.
<img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/dataset.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=406fa721650d41553f3adc5e4d372a68" alt="Dataset creation interface" width="2972" height="2060" data-path="images/fine-tuning/dataset.png" />
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="REST API"></span>
```javascript
// Create dataset entry
const createDatasetPayload = {
datasetId: "my-vlm-dataset",
dataset: { userUploaded: {} }
};
const response = await fetch(`${BASE_URL}/datasets`, {
method: "POST",
headers: {
"Authorization": `Bearer ${API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify(createDatasetPayload)
});
// Upload JSONL file
const formData = new FormData();
formData.append("file", fileInput.files[0]);
const uploadResponse = await fetch(`${BASE_URL}/datasets/my-vlm-dataset:upload`, {
method: "POST",
headers: { "Authorization": `Bearer ${API_KEY}` },
body: formData
});
```
<span class="tab-end"></span>
<span class="tab-group-end"></span>
<span class="callout-start" data-callout-type="tip"></span>
For larger datasets (>500MB), use `firectl` as it handles large uploads more reliably than the web interface. For enhanced data control and security, we also support bring your own bucket (BYOB) configurations. See our [Secure Fine Tuning](/fine-tuning/secure-fine-tuning#gcs-bucket-integration) guide for setup details.
<span class="callout-end"></span>
<span class="step-end"></span>
<span class="step-marker" data-step-title="Launch VLM fine-tuning job"></span>
Create a supervised fine-tuning job for your VLM:
<span class="tab-group-start"></span>
<span class="tab-start" data-tab-title="firectl"></span>
```bash
firectl sftj create \
--base-model accounts/fireworks/models/qwen2p5-vl-32b-instruct \
--dataset my-vlm-dataset \
--output-model my-custom-vlm \
--epochs 3
```
For additional parameters like learning rates, evaluation datasets, and batch sizes, see [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings).
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="UI"></span>
1. Navigate to the Fine-tuning tab in the Fireworks console
2. Click "Create Fine-tuning Job"
3. Select your VLM base model (Qwen 2.5 VL)
4. Choose your uploaded dataset
5. Configure training parameters
6. Launch the job
<img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/create-sftj.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=a2ea6a163d5d3e83ee7322aa90bb51e6" alt="Fine-tuning job creation interface" width="2970" height="2048" data-path="images/fine-tuning/create-sftj.png" />
<span class="tab-end"></span>
<span class="tab-group-end"></span>
VLM fine-tuning jobs typically take longer than text-only models due to the additional image processing. Expect training times of several hours depending on dataset size and model complexity.
<span class="step-end"></span>
<span class="step-marker" data-step-title="Monitor training progress"></span>
Track your VLM fine-tuning job in the [Fireworks console](https://app.fireworks.ai/dashboard/fine-tuning).
<img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/vlm-sftj.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=e93405e55268af1c2202169c0bff2a39" alt="VLM fine-tuning job in the Fireworks console" width="3802" height="1690" data-path="images/fine-tuning/vlm-sftj.png" />
Monitor key metrics:
* **Training loss**: Should generally decrease over time
* **Evaluation loss**: Monitor for overfitting if using evaluation dataset
* **Training progress**: Epochs completed and estimated time remaining
<span class="callout-start" data-callout-type="check"></span>
Your VLM fine-tuning job is complete when the status shows `COMPLETED` and your custom model is ready for deployment.
<span class="callout-end"></span>
<span class="step-end"></span>
<span class="step-marker" data-step-title="Deploy your fine-tuned VLM"></span>
Once training is complete, deploy your custom VLM:
<span class="tab-group-start"></span>
<span class="tab-start" data-tab-title="firectl"></span>
```bash
# Create a deployment for your fine-tuned VLM
firectl deployment create my-custom-vlm
# Check deployment status
firectl deployment get accounts/your-account/deployment/deployment-id
```
<span class="tab-end"></span>
<span class="tab-start" data-tab-title="UI"></span>
Deploy from the UI using the `Deploy` dropdown in the fine-tuning job page.
<img src="https://mintcdn.com/fireworksai/XAK4ji8XrlzPoITj/images/fine-tuning/vlm-sftj-deploy.png?fit=max&auto=format&n=XAK4ji8XrlzPoITj&q=85&s=c4d8d33d8ea388c186e078dd781386dc" alt="Deploy dropdown in the fine-tuning job page" width="3802" height="1690" data-path="images/fine-tuning/vlm-sftj-deploy.png" />
<span class="tab-end"></span>
<span class="tab-group-end"></span>
<span class="step-end"></span>
<span class="steps-end"></span>
## Advanced Configuration
For additional fine-tuning parameters and advanced settings like custom learning rates, batch sizes, and optimization options, see the [Additional SFT job settings](/fine-tuning/fine-tuning-models#additional-sft-job-settings) section in our comprehensive fine-tuning guide.
<span class="callout-start" data-callout-type="tip"></span>
Need custom training loops for VLMs? The **Training API** also supports vision-language model fine-tuning with full control over loss functions, training objectives, and evaluation. See [Training API — Vision Inputs](/fine-tuning/training-api/vision-inputs) for details.
<span class="callout-end"></span>
## Interactive Tutorials: Fine-tuning VLMs
For a hands-on, step-by-step walkthrough of VLM fine-tuning, we've created two fine tuning cookbooks that demonstrates the complete process from dataset preparation, model deployment to evaluation.
<span class="card-group-start" data-cols="2"></span>
<span class="card-start" data-card-title="VLM Fine-tuning Quickstart" data-card-icon="notebook" data-card-href="https://colab.research.google.com/drive/11WpagNa6xKgh1zhr1xh5uIuVtkPPL-qn"></span>
**Google Colab Notebook: Fine-tune Qwen2.5 VL on Fireworks AI**
<span class="card-end"></span>
<span class="card-start" data-card-title="VLM Fine-tuning + Evals" data-card-icon="notebook" data-card-href="https://huggingface.co/spaces/fireworks-ai/catalog-extract/tree/main/notebooks"></span>
**Finetuning a VLM to beat SOTA closed source model**
<span class="card-end"></span>
<span class="card-group-end"></span>
The cookbooks above cover the following:
* Setting up your environment with Fireworks CLI
* Preparing vision datasets in the correct format
* Launching and monitoring VLM fine-tuning jobs
* Testing your fine-tuned model
* Best practices for VLM fine-tuning
* Running inference on serverless VLMs
* Running evals to show performance gains
## Testing Your Fine-tuned VLM
After deployment, test your fine-tuned VLM using the same API patterns as base VLMs:
```python
import openai
client = openai.OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key="<FIREWORKS_API_KEY>",
)
response = client.chat.completions.create(
model="accounts/your-account/models/my-custom-vlm",
messages=[{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {
"url": "https://raw.githubusercontent.com/fw-ai/cookbook/refs/heads/main/learn/vlm-finetuning/images/icecream.jpeg"
},
},{
"type": "text",
"text": "What's in this image?",
}],
}]
)
print(response.choices[0].message.content)
If you fine-tuned using the example dataset, your model should include <think></think> tags in its response.