Multi-modal Messages

no
Summary: Support for multimodal input messages including text, images, audio, video, and documents

Original Documentation

Documentation Index#

Fetch the complete documentation index at: https://docs.ag-ui.com/llms.txt Use this file to discover all available pages before exploring further.

Support for multimodal input messages including text, images, audio, video, and documents

Multi-modal Messages Proposal#

Summary#

Problem Statement#

Current AG-UI protocol only supports text-based user messages. As LLMs increasingly support multimodal inputs (images, audio, files), the protocol needs to evolve to handle these richer input types.

Motivation#

Evolve AG-UI to support multimodal input messages without breaking existing apps. Inputs may include text, images, audio, video, and documents. Each modality is represented as a distinct, typed content part with a clear source discriminator (data for inline base64, url for references), making it straightforward to map to any LLM provider’s API.

Status#

Detailed Specification#

Overview#

Extend the UserMessage content property to be either a string or an array of InputContentPart objects. Each modality (image, audio, video, document) has its own dedicated part type with a typed source that is either inline data or a url reference. This makes it trivial to map content parts to any LLM provider’s API.

/**
 * Supported input modality types for multimodal content.
 */
type Modality = "text" | "image" | "audio" | "video" | "document"

// ── Source types ──────────────────────────────────────────────

interface InputContentDataSource {
  /** Indicates this is inline data content. */
  type: "data"
  /** The base64-encoded content value. */
  value: string
  /** MIME type of the content (e.g., "image/png", "audio/wav"). Required. */
  mimeType: string
}

interface InputContentUrlSource {
  /** Indicates this is URL-referenced content. */
  type: "url"
  /** HTTP(S) URL or data URI pointing to the content. */
  value: string
  /** Optional MIME type hint for when it can't be inferred from the URL. */
  mimeType?: string
}

type InputContentSource = InputContentDataSource | InputContentUrlSource

// ── Content part types ────────────────────────────────────────

interface TextInputPart {
  type: "text"
  /** The text content. */
  text: string
}

interface ImageInputPart<TMetadata = unknown> {
  type: "image"
  /** Source of the image content. */
  source: InputContentSource
  /** Provider-specific metadata (e.g., OpenAI detail: "auto" | "low" | "high"). */
  metadata?: TMetadata
}

interface AudioInputPart<TMetadata = unknown> {
  type: "audio"
  /** Source of the audio content. */
  source: InputContentSource
  /** Provider-specific metadata (e.g., format, sample rate). */
  metadata?: TMetadata
}

interface VideoInputPart<TMetadata = unknown> {
  type: "video"
  /** Source of the video content. */
  source: InputContentSource
  /** Provider-specific metadata (e.g., duration, resolution). */
  metadata?: TMetadata
}

interface DocumentInputPart<TMetadata = unknown> {
  type: "document"
  /** Source of the document content. */
  source: InputContentSource
  /** Provider-specific metadata (e.g., Anthropic media_type for PDFs). */
  metadata?: TMetadata
}

type InputContentPart =
| TextInputPart
| ImageInputPart
| AudioInputPart
| VideoInputPart
| DocumentInputPart

// ── Updated UserMessage ───────────────────────────────────────

type UserMessage = {
  id: string
  role: "user"
  content: string | InputContentPart[]
  name?: string
}

Modality Type#

The Modality type enumerates the supported content modalities:

ValueDescription
"text"Plain text content
"image"Image content (JPEG, PNG, GIF, WebP, etc.)
"audio"Audio content (WAV, MP3, OGG, etc.)
"video"Video content (MP4, WebM, etc.)
"document"Document content (PDF, DOCX, XLSX, etc.)

Source Types#

Every non-text content part carries a source property that describes how the content is delivered. The source is a discriminated union with two variants:

InputContentDataSource#

Inline base64-encoded content.

PropertyTypeRequiredDescription
type"data"Discriminator for inline data
valuestringBase64-encoded content
mimeTypestringMIME type (required to ensure correct handling)

InputContentUrlSource#

URL-referenced content.

PropertyTypeRequiredDescription
type"url"Discriminator for URL reference
valuestringHTTP(S) URL or data URI
mimeTypestring?Optional MIME type hint

Content Part Types#

TextInputPart#

Represents plain text content within a multimodal message.

PropertyTypeDescription
type"text"Identifies this as text content
textstringThe text content

ImageInputPart#

Represents image content. Maps directly to provider image inputs (e.g., OpenAI vision, Anthropic image blocks).

PropertyTypeDescription
type"image"Identifies this as image content
sourceInputContentSourceEither inline data or URL reference
metadataTMetadata?Provider-specific metadata (e.g., OpenAI detail level)

AudioInputPart#

Represents audio content.

PropertyTypeDescription
type"audio"Identifies this as audio content
sourceInputContentSourceEither inline data or URL reference
metadataTMetadata?Provider-specific metadata (e.g., format, sample rate)

VideoInputPart#

Represents video content.

PropertyTypeDescription
type"video"Identifies this as video content
sourceInputContentSourceEither inline data or URL reference
metadataTMetadata?Provider-specific metadata (e.g., duration, resolution)

DocumentInputPart#

Represents document content such as PDFs, Word documents, or spreadsheets.

PropertyTypeDescription
type"document"Identifies this as document content
sourceInputContentSourceEither inline data or URL reference
metadataTMetadata?Provider-specific metadata (e.g., Anthropic media_type)

Provider Metadata#

The generic metadata field on each content part allows provider-specific information to flow through the protocol without polluting the core schema. Examples:

  • OpenAI: ImageInputPart<{ detail: 'auto' | 'low' | 'high' }>
  • Anthropic: DocumentInputPart<{ media_type: 'application/pdf' }>
  • Custom: Any provider can define its own metadata shape

Implementation Examples#

Simple Text Message (Backward Compatible)#

{
  "id": "msg-001",
  "role": "user",
  "content": "What's in this image?"
}

Image with Inline Data#

{
  "id": "msg-002",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "image",
      "source": {
        "type": "data",
        "value": "/9j/4AAQSkZJRg...",
        "mimeType": "image/jpeg"
      }
    }
  ]
}

Image with URL Reference#

{
  "id": "msg-003",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What's in this image?"
    },
    {
      "type": "image",
      "source": {
        "type": "url",
        "value": "https://example.com/photo.png"
      },
      "metadata": {
        "detail": "high"
      }
    }
  ]
}

Multiple Images with Question#

{
  "id": "msg-004",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What are the differences between these images?"
    },
    {
      "type": "image",
      "source": {
        "type": "url",
        "value": "https://example.com/image1.png",
        "mimeType": "image/png"
      }
    },
    {
      "type": "image",
      "source": {
        "type": "url",
        "value": "https://example.com/image2.png",
        "mimeType": "image/png"
      }
    }
  ]
}

Audio Transcription Request#

{
  "id": "msg-005",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Please transcribe this audio recording"
    },
    {
      "type": "audio",
      "source": {
        "type": "url",
        "value": "https://example.com/meeting-recording.wav",
        "mimeType": "audio/wav"
      }
    }
  ]
}

Document Analysis#

{
  "id": "msg-006",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Summarize the key points from this PDF"
    },
    {
      "type": "document",
      "source": {
        "type": "url",
        "value": "https://example.com/reports/q4-2024.pdf",
        "mimeType": "application/pdf"
      }
    }
  ]
}

Video Analysis#

{
  "id": "msg-007",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Describe what happens in this video"
    },
    {
      "type": "video",
      "source": {
        "type": "url",
        "value": "https://example.com/demo.mp4",
        "mimeType": "video/mp4"
      },
      "metadata": {
        "duration": 120
      }
    }
  ]
}

Mixed Modalities#

{
  "id": "msg-008",
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Compare the screenshot with the design spec"
    },
    {
      "type": "image",
      "source": {
        "type": "data",
        "value": "iVBORw0KGgo...",
        "mimeType": "image/png"
      }
    },
    {
      "type": "document",
      "source": {
        "type": "url",
        "value": "https://example.com/design-spec.pdf",
        "mimeType": "application/pdf"
      }
    }
  ]
}

Implementation Considerations#

Client SDK Changes#

TypeScript SDK:

  • New Modality type and all InputContentPart types in @ag-ui/core
  • InputContentSource, InputContentDataSource, InputContentUrlSource types
  • Updated UserMessage with content: string | InputContentPart[]
  • Helper methods for constructing typed content parts
  • Provider-specific metadata generics on each content part type

Python SDK:

  • Pydantic models for each content part type (TextInputPart, ImageInputPart, etc.)
  • InputContentSource discriminated union
  • Updated UserMessage model
  • Provider-specific metadata support via generics

Framework Integration#

Frameworks need to:

  • Parse typed InputContentPart parts and dispatch on part.type
  • Map content parts to provider-specific formats (the typed structure makes this straightforward)
  • Use source.type to determine whether to send inline data or a URL to the provider
  • Forward metadata to providers that support it
  • Handle fallbacks for models that don’t support certain modalities
  • Validate that mimeType is appropriate for the declared content part type

Use Cases#

Visual Question Answering#

Users can upload images (ImageInputPart) and ask questions about them.

Document Processing#

Upload PDFs, Word documents, or spreadsheets (DocumentInputPart) for analysis.

Audio Transcription and Analysis#

Process voice recordings, podcasts, or meeting audio (AudioInputPart).

Video Understanding#

Analyze video content (VideoInputPart) for summaries, descriptions, or content moderation.

Multi-modal Comparison#

Compare multiple images, documents, or mixed media using different content part types in a single message.

Screenshot Analysis#

Share screenshots (ImageInputPart) for UI/UX feedback or debugging assistance.

Testing Strategy#

  • Unit tests for each InputContentPart type and InputContentSource variant
  • Validate source.type discriminator correctly narrows the union
  • Integration tests with multimodal LLMs (OpenAI, Anthropic, Google)
  • Backward compatibility tests with plain string content
  • Verify metadata passthrough for provider-specific fields
  • Performance tests for large base64 payloads in InputContentDataSource
  • Security tests for URL validation and content sanitization
  • Type-safety tests ensuring generic TMetadata works across SDKs

References#

Link last verified June 7, 2026. View original ↗
Source: AG-UI Protocol
Link last verified: 2026-02-26