S3 Sync ↗
noOriginal Documentation
Documentation Index#
Fetch the complete documentation index at: https://docs.trychroma.com/llms.txt Use this file to discover all available pages before exploring further.
Sync files from Amazon S3 into Chroma Cloud.
S3 Sync lets you connect an Amazon S3 bucket to Chroma Cloud and sync files into collections. It supports documents (PDFs, Office files, images, ebooks), code, and plain text. Collections are created automatically if they don’t already exist.
S3 Sync is designed for append-only workloads — it indexes new files but does not handle updates or deletes. If you re-sync the same object key, a new copy will be indexed. Creating a source does not automatically sync existing files in the bucket. Each file must be synced individually via an invocation. Configure Auto-sync to automatically sync new uploads.
The Sync API uses your Chroma Cloud API key for authentication. See the Sync API Reference for all endpoints.
Walkthrough#
Creating an S3 Source via the Dashboard#
- Navigate to a database in Chroma Cloud and select Sync from the menu.
- Click Create and select S3 as the source type.
- Enter your AWS credentials, AWS region, and bucket name.
- Configure a collection name and optional path prefix to limit which keys can be synced.
- Click Sync and enter an S3 object key to index.
S3 Source Configuration#
| Parameter | Required | Description |
|---|---|---|
bucket_name | Yes | S3 bucket name. |
region | Yes | AWS region of the bucket. |
collection_name | Yes | Default target collection name for synced data. |
aws_credential_id | Yes | ID of AWS credentials created in the Chroma dashboard. |
path_prefix | No | Limits which S3 keys can be synced. Only keys starting with this prefix are allowed. Useful for multi-tenant setups. |
auto_sync | No | Auto-sync mode: none (default), direct, or metadata. Configured by Chroma during Auto-Sync setup. |
S3 Invocation Parameters#
| Parameter | Required | Description |
|---|---|---|
object_key | Yes | Full S3 object key to sync. This is always relative to the bucket root, even if a path_prefix is configured on the source. The key must start with the path_prefix or the invocation will be rejected. |
custom_id | No | Custom document ID (max 120 bytes). Chunk IDs become custom_id-{chunk} instead of sha256(object_key)-{chunk}. Stored as custom_id metadata on each chunk. |
metadata | No | Additional metadata merged with standard chunk metadata. Values can be scalars (string, number, boolean, or null) or homogeneous arrays of scalars (e.g. ["action", "comedy"]). |
target_collection_name | No | Overrides the source’s collection_name. Collection is created if it doesn’t exist. |
Supported File Types#
File types are detected by filename suffix.
Document Types#
Document files are converted to markdown and incur a $0.01/page extraction fee. Tables, headings, and structure are preserved. Images within documents get text descriptions extracted, but the images themselves are not stored.
| Format | Extensions |
|---|---|
.pdf | |
| Word | .doc, .docx, .odt |
| Spreadsheets | .xls, .xlsx, .xlsm, .xltx, .csv, .ods |
| Presentations | .ppt, .pptx, .odp |
| HTML | .html |
| Ebooks | .epub |
| Images | .png, .jpg, .jpeg, .webp, .gif, .tiff, .tif |
Other Files#
All other files must contain valid UTF-8 text. Non-UTF-8 files will fail.
Limits#
- Region: Currently available for databases in the AWS
us-east-1region only. - Maximum file size: 200 MB per file.
- Maximum document pages: 7,000 pages per document. Documents exceeding this limit will fail.
Contact support@trychroma.com if you need these limits raised.
Chunking#
Files are chunked using a three-stage pipeline:
- Tree-sitter syntax-aware chunking — if the file extension maps to a known programming language, chunking respects function boundaries, class definitions, and code structure.
- Tree-sitter markdown chunking — if the content is markdown (e.g. from document extraction), chunking respects headings, sections, and paragraph boundaries.
- Line-based chunking — fallback for other text content (max 10 lines, max 4096 bytes per chunk).
Auto-Sync#
Auto-sync lets S3 file uploads automatically trigger indexing without manual API calls.
Setup#
Chroma runs one SQS queue per AWS region. To enable auto-sync:
- Contact Chroma at support@trychroma.com with your AWS region.
- Chroma will provide the SQS queue ARN for your region.
- Configure S3 Event Notifications on your bucket to send
s3:ObjectCreated:*events to that queue.
Direct Mode#
When Chroma configures your source for direct mode (auto_sync: "direct"), every file upload to your bucket triggers indexing of that file. This is the simplest setup when filenames are stable identifiers. If a .meta.json file is uploaded, it is processed as metadata mode for that file.
Metadata Mode#
When Chroma configures your source for metadata mode (auto_sync: "metadata"), only .meta.json file uploads trigger indexing. This gives you low-level control over each file’s document ID, additional metadata, and target collection. It also lets you choose which files to index — only files referenced by a .meta.json are processed.
Metadata File Format#
A metadata file is any file with a .meta.json suffix. It can have any name and be in any folder, as long as it falls within the source’s path_prefix (if one is configured).
{
"version": "chroma-v1",
"id": "unique-document-id",
"path": "path/to/document.pdf",
"target_collection_name": "my-collection",
"metadata": {
"author": "Jane Doe",
"year": 2024,
"tags": ["quarterly", "finance"]
}
}| Field | Required | Description |
|---|---|---|
version | Yes | Must be "chroma-v1". |
id | Yes | Custom ID for the document in Chroma. |
path | Yes | Full S3 object key of the document to index. |
target_collection_name | No | Overrides the target collection (created if it doesn’t exist). |
metadata | No | Additional metadata. Values can be scalars (string, number, boolean, or null) or homogeneous arrays of scalars. |
Example Workflow#
# Upload document
aws s3 cp report.pdf s3://my-bucket/docs/report.pdf
# Upload metadata file to trigger indexing
aws s3 cp report.meta.json s3://my-bucket/docs/report.meta.jsonMulti-Tenant Buckets#
S3 Sync supports multi-tenant setups where a single bucket serves multiple tenants.
Path prefixes restrict which S3 keys a source can sync. When a path_prefix is configured, only objects whose key starts with that prefix can be synced — invocations for keys outside the prefix will be rejected. Create one source per tenant with a distinct prefix (e.g. tenant-a/, tenant-b/) to enforce isolation within a shared bucket.
Metadata files offer another approach to multi-tenancy. In metadata mode, each .meta.json file can specify a target_collection_name, routing different files to different collections. This lets you partition data per tenant at the collection level without needing separate sources or path prefixes.
Built with Mintlify.