Vision on AI Knowledge Base

Vision & Multimodal AI

Mon, 01 Jan 0001 00:00:00 +0000

Build applications that understand and generate images, documents, and audio. This path covers vision capabilities across 5 providers, document processing, image generation, multimodal embeddings, and audio — the complete multimodal toolkit.

The key cross-provider insight: each provider has different vision strengths. OpenAI offers the broadest multimodal coverage (vision + generation + audio), Anthropic excels at document understanding, Cohere provides multimodal embeddings for search, and Mistral/Together AI offer cost-effective open-source alternatives. Choosing the right provider per modality can dramatically improve both quality and cost.

Vision

Mon, 01 Jan 0001 00:00:00 +0000

Send images to Claude for analysis, OCR, diagram interpretation, and multimodal reasoning.

Agentic RAG for PDFs with mixed data

Mon, 01 Jan 0001 00:00:00 +0000

This page describes building a powerful, multi-step chatbot with Cohere’s models.

Arxiv Paper Tool

Mon, 01 Jan 0001 00:00:00 +0000

The ‘ArxivPaperTool’ searches arXiv for papers matching a query and optionally downloads PDFs.

Aya Vision

Mon, 01 Jan 0001 00:00:00 +0000

Understand Cohere Labs groundbreaking multilingual model Aya Vision, a state-of-the-art multimodal language model excelling at multiple tasks.

Basic OCR

Mon, 01 Jan 0001 00:00:00 +0000

Extract text and structured content from PDFs and images with Mistral’s Document AI OCR processor

Build a content builder agent

Mon, 01 Jan 0001 00:00:00 +0000

Build a content writing agent with brand memory, skills, subagents, and image generation

Build a content builder agent

Mon, 01 Jan 0001 00:00:00 +0000

Build a content writing agent with brand memory, skills, subagents, and image generation

Clone and export reports

Mon, 01 Jan 0001 00:00:00 +0000

Export a W&B Report as a PDF or LaTeX.

Cohere's Command A Vision Model

Mon, 01 Jan 0001 00:00:00 +0000

Command A Vision is a powerful visual language model capable of interacting with image inputs. This document contains information about its capabilities.

Computer use

Mon, 01 Jan 0001 00:00:00 +0000

Enable models to interact with computer interfaces — clicking, typing, and navigating applications via screenshots, creating agents that can operate any software.

Connectors Overview

Mon, 01 Jan 0001 00:00:00 +0000

Connectors enable Agents and users to access tools like websearch, code interpreter, image generation, and document library on demand

DALL-E Tool

Mon, 01 Jan 0001 00:00:00 +0000

The ‘DallETool’ is a powerful tool designed for generating images from textual descriptions.

Dedicated Read Nodes

Mon, 01 Jan 0001 00:00:00 +0000

Dedicated read nodes use provisioned hardware for read operations, providing predictable, low-latency performance at high query volumes.

Deploying Models in Private Environments

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to pull and test Cohere’s container images using a license with Docker and Kubernetes.

Developer quickstart

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use the OpenAI API to generate human-like responses to natural language prompts, analyze images with computer vision, use powerful built-in tools, and more.

Files

Mon, 01 Jan 0001 00:00:00 +0000

Pass images, PDFs, audio, video, and text files to your agents for multimodal processing.

How to build a real-time image generator with Flux and Together AI

Mon, 01 Jan 0001 00:00:00 +0000

How To Build An Open Source NotebookLM: PDF To Podcast

Mon, 01 Jan 0001 00:00:00 +0000

In this guide we will see how to create a podcast like the one below from a PDF input!

Image generation

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to generate or edit images with the OpenAI API and image generation models.

Image generation

Mon, 01 Jan 0001 00:00:00 +0000

Generate and edit images using DALL-E and GPT-4o’s built-in image generation capabilities as tools within the Responses API.

Image Generation

Mon, 01 Jan 0001 00:00:00 +0000

Built-in tool for agents to generate images on demand with detailed output handling and download options

Image Generation

Mon, 01 Jan 0001 00:00:00 +0000

Generate high-quality images from text + image prompts.

Image Generation Prompt iteration

Mon, 01 Jan 0001 00:00:00 +0000

Image Generation with DALL-E

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use DALL-E for AI-powered image generation in your CrewAI projects

Image Generation with Flux2

Mon, 01 Jan 0001 00:00:00 +0000

Deploy a Flux2 image generation model on Together’s managed GPU infrastructure using Dedicated Containers.

Image, Audio, Video & Document Input

Mon, 01 Jan 0001 00:00:00 +0000

Images and vision

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to understand or generate images with the OpenAI API.

Include multimodal content in a prompt

Mon, 01 Jan 0001 00:00:00 +0000

Introduction to Aya Vision

Mon, 01 Jan 0001 00:00:00 +0000

In this notebook, we will explore the capabilities of Aya Vision, which can take text and image inputs to generates text responses.

Log media

Mon, 01 Jan 0001 00:00:00 +0000

Log media returned in your traces, such as images and videos.

Log multimodal traces

Mon, 01 Jan 0001 00:00:00 +0000

Mirror images for your LangSmith installation

Mon, 01 Jan 0001 00:00:00 +0000

Models Benchmarks

Mon, 01 Jan 0001 00:00:00 +0000

Mistral’s benchmarked models excel in reasoning, multilingual tasks, coding, and multimodal capabilities, outperforming competitors in key benchmarks

Models Overview

Mon, 01 Jan 0001 00:00:00 +0000

Mistral offers open and premier models for various tasks, including text, code, audio, and multimodal processing

Multi-modal Messages

Mon, 01 Jan 0001 00:00:00 +0000

Support for multimodal input messages including text, images, audio, video, and documents

Multimodal

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal context for assistants

Mon, 01 Jan 0001 00:00:00 +0000

Process images and charts in PDFs with multimodal assistants.

Multimodal Embeddings

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to work with multimodal data in Chroma collections.

Multimodal Inputs

Mon, 01 Jan 0001 00:00:00 +0000

Use modality-specific user input parts with typed data/url sources in @ag-ui/core

Multimodal Inputs

Mon, 01 Jan 0001 00:00:00 +0000

Use modality-specific user input parts with typed data/url sources in ag_ui.core

Multimodal Metrics Image Coherence

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Editing

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Helpfulness

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Image Reference

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal Metrics Text To Image

Mon, 01 Jan 0001 00:00:00 +0000

OCR Tool

Mon, 01 Jan 0001 00:00:00 +0000

The ‘OCRTool’ extracts text from local images or image URLs using an LLM with vision.

OpenAI CLI

Mon, 01 Jan 0001 00:00:00 +0000

Install and use the generated openai command-line tool for Responses, structured outputs, images, speech, and shell workflows.

Overview

Mon, 01 Jan 0001 00:00:00 +0000

Leverage AI services, generate images, process vision, and build intelligent systems

Part 5. Audio, Images, and Video

Mon, 01 Jan 0001 00:00:00 +0000

PDF Extractor with Native Multi Step Tool Use

Mon, 01 Jan 0001 00:00:00 +0000

This page describes how to create an AI agent able to extract information from PDFs.

PDF RAG Search

Mon, 01 Jan 0001 00:00:00 +0000

The ‘PDFSearchTool’ is designed to search PDF files and return the most relevant results.

Pdf Support

Mon, 01 Jan 0001 00:00:00 +0000

This is the reference for sending PDFs to Claude, and it matters because Claude processes both the extracted text and the rendered page images, letting it reason over tables, charts, and scanned layouts that plain text extraction would lose. Pay close attention to the page and size limits and to token accounting, since each page consumes both text and image tokens and costs add up fast. A common pitfall is assuming PDFs are as cheap as text. Compared with Mistral’s and OpenAI’s image inputs the differentiator is native multi-page document handling; read the token-counting page alongside this to estimate cost.

PDF Text Writing Tool

Mon, 01 Jan 0001 00:00:00 +0000

The ‘PDFTextWritingTool’ writes text to specific positions in a PDF, supporting custom fonts.

Playground

Mon, 01 Jan 0001 00:00:00 +0000

Guide to using Together AI’s web playground for interactive AI model inference across chat, image, video, audio, and transcribe models.

Quickstart: Flux Kontext

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use Flux’s new in-context image generation models

Quickstart: FLUX.2

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use FLUX.2, the next generation image model with advanced prompting capabilities

Quickstart: How to do OCR

Mon, 01 Jan 0001 00:00:00 +0000

A step by step guide on how to do OCR with Together AI’s vision models with structured outputs

Realtime API

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to build low-latency, multimodal LLM applications with the Realtime API.

Run an evaluation with multimodal content

Mon, 01 Jan 0001 00:00:00 +0000

Sandbox templates

Mon, 01 Jan 0001 00:00:00 +0000

Define container images, resource limits, and configuration for sandboxes using templates.

Sandbox warm pools

Mon, 01 Jan 0001 00:00:00 +0000

Pre-provision sandboxes for faster execution with automatic replenishment.

Serverless Pricing

Mon, 01 Jan 0001 00:00:00 +0000

Per-token serverless pricing for text, vision, and embedding models, including Priority and Fast serving paths

Supervised Fine Tuning - Vision

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to fine-tune vision-language models on Fireworks AI with image and text datasets

Text & Vision Fine-tuning

Mon, 01 Jan 0001 00:00:00 +0000

Fine-tune Mistral’s text and vision models with custom datasets in JSONL format for domain-specific or conversational improvements

Together AI Skills

Mon, 01 Jan 0001 00:00:00 +0000

Give your AI coding agent deep knowledge of the Together AI platform with ready-made skills for inference, training, images, video, audio, and infrastructure.

Trace and Evaluate a Computer Vision Pipeline with Weave

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use trace and evaluate a computer vision pipeline with weave with W&B Weave

Unlocking the Power of Multimodal Embeddings

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal embeddings convert text and images into embeddings for search and classification (API v2).

Using Cohere's Models to Work with Image Inputs

Mon, 01 Jan 0001 00:00:00 +0000

This page describes how a Cohere large language model works with image inputs. It covers passing images with the API, limitations, and best practices.

Using Multimodal Agents

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to enable and use multimodal capabilities in your agents for processing images and other non-text content within the CrewAI framework.

Video & Audio Inputs

Mon, 01 Jan 0001 00:00:00 +0000

Query multimodal models to process video and audio content directly

Video Generation

Mon, 01 Jan 0001 00:00:00 +0000

Generate high-quality videos from text and image prompts.

Vision

Mon, 01 Jan 0001 00:00:00 +0000

Multimodal AI models analyze images and text for insights, supporting use cases like OCR, chart understanding, and receipt transcription

Vision fine-tuning

Mon, 01 Jan 0001 00:00:00 +0000

Fine-tune models for better image understanding.

Vision Inputs

Mon, 01 Jan 0001 00:00:00 +0000

Fine-tune vision-language models (VLMs) with the Training API using multimodal chat data containing images and text.

Vision LLMs

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use the vision models supported by Together AI.

Vision Models

Mon, 01 Jan 0001 00:00:00 +0000

Query vision-language models to analyze images and visual content

Vision Tool

Mon, 01 Jan 0001 00:00:00 +0000

The ‘VisionTool’ is designed to extract text from images.

Vision-Language Fine-tuning

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to fine-tune Vision-Language Models (VLMs) on image+text data using Together AI.