Configure Collections ↗

chroma guide intermediate embeddings configuration

Summary: Learn how to configure Chroma collection index settings and embedding functions.

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.trychroma.com/llms.txt Use this file to discover all available pages before exploring further.

Learn how to configure Chroma collection index settings and embedding functions.

export const YouTube = ({src, title, allow, allowFullScreen = true, referrerPolicy}) => { const [isVisible, setIsVisible] = useState(false); const wrapperRef = useRef(null); useEffect(() => { const wrapper = wrapperRef.current; if (!wrapper) return; const observer = new IntersectionObserver(([entry]) => { if (entry.isIntersecting) { setIsVisible(true); observer.disconnect(); } }, { threshold: 0 }); observer.observe(wrapper); return () => observer.disconnect(); }, []); return {isVisible && } ; };export const Warning = ({title, children}) =><pre><code>{title && {title}} {children} </code></pre>;Chroma collections have a <code>configuration</code> that determines how their embeddings index is constructed and used. We use default values for these index configurations that should give you great performance for most use cases out-of-the-box.The <a href=../embeddings/embedding-functions>embedding function</a> you choose to use in your collection also affects its index construction, and is included in the configuration.When you create a collection, you can customize these index configuration values for different data, accuracy and performance requirements. Some query-time configurations can also be customized after the collection’s creation using the <code>.modify</code> function. <h2 id=hnsw-index-configuration>HNSW Index Configuration<a class=anchor href=#hnsw-index-configuration>#</a></h2>In Single Node Chroma collections, we use an HNSW (Hierarchical Navigable Small World) index to perform approximate nearest neighbor (ANN) search.<accordion title="What is an HNSW index?">An HNSW (Hierarchical Navigable Small World) index is a graph-based data structure designed for efficient approximate nearest neighbor search in high-dimensional vector spaces. It works by constructing a multi-layered graph where each layer contains a subset of the data points, with higher layers being sparser and serving as "highways" for faster navigation. The algorithm builds connections between nearby points at each layer, creating "small-world" properties that allow for efficient search complexity. During search, the algorithm starts at the top layer and navigates toward the query point in the embedding space, then moves down through successive layers, refining the search at each level until it finds the final nearest neighbors.</accordion>The HNSW index parameters include:<ul><li><code>space</code> defines the distance function of the embedding space, and hence how similarity is defined. The default is <code>l2</code> (squared L2 norm), and other possible values are <code>cosine</code> (cosine similarity), and <code>ip</code> (inner product).</li></ul><table><thead><tr><th>Distance</th><th style=text-align:center>parameter</th><th style=text-align:right>Equation</th><th style=text-align:center>Intuition</th></tr></thead><tbody><tr><td>Squared L2</td><td style=text-align:center><code>l2</code></td><td style=text-align:right>$d = \sum\left(A_i-B_i\right)^2$</td><td style=text-align:center>measures absolute geometric distance between vectors, making it suitable when you want true spatial proximity.</td></tr><tr><td>Inner product</td><td style=text-align:center><code>ip</code></td><td style=text-align:right>$d = 1.0 - \sum\left(A_i \times B_i\right)$</td><td style=text-align:center>focuses on vector alignment and magnitude, often used for recommendation systems where larger values indicate stronger preferences</td></tr><tr><td>Cosine similarity</td><td style=text-align:center><code>cosine</code></td><td style=text-align:right>$d = 1.0 - \frac{\sum\left(A_i \times B_i\right)}{\sqrt{\sum\left(A_i^2\right)} \cdot \sqrt{\sum\left(B_i^2\right)}}$</td><td style=text-align:center>measures only the angle between vectors (ignoring magnitude), making it ideal for text embeddings or cases where you care about direction rather than scale</td></tr></tbody></table> You should make sure that the <code>space</code> you choose is supported by your collection’s embedding function. Every Chroma embedding function specifies its default space and a list of supported spaces. <ul><li><code>ef_construction</code> determines the size of the candidate list used to select neighbors during index creation. A higher value improves index quality at the cost of more memory and time, while a lower value speeds up construction with reduced accuracy. The default value is <code>100</code>.</li><li><code>ef_search</code> determines the size of the dynamic candidate list used while searching for the nearest neighbors. A higher value improves recall and accuracy by exploring more potential neighbors but increases query time and computational cost, while a lower value results in faster but less accurate searches. The default value is <code>100</code>. This field can be modified after creation.</li><li><code>max_neighbors</code> is the maximum number of neighbors (connections) that each node in the graph can have during the construction of the index. A higher value results in a denser graph, leading to better recall and accuracy during searches but increases memory usage and construction time. A lower value creates a sparser graph, reducing memory usage and construction time but at the cost of lower search accuracy and recall. The default value is <code>16</code>.</li><li><code>num_threads</code> specifies the number of threads to use during index construction or search operations. The default value is <code>multiprocessing.cpu_count()</code> (available CPU cores). This field can be modified after creation.</li><li><code>batch_size</code> controls the number of vectors to process in each batch during index operations. The default value is <code>100</code>. This field can be modified after creation.</li><li><code>sync_threshold</code> determines when to synchronize the index with persistent storage. The default value is <code>1000</code>. This field can be modified after creation.</li><li><code>resize_factor</code> controls how much the index grows when it needs to be resized. The default value is <code>1.2</code>. This field can be modified after creation.</li></ul>For example, here we create a collection with customized values for <code>space</code> and <code>ef_construction</code>:<div class=highlight><pre tabindex=0 style=background-color:#f7f7f7;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none><code class=language-python data-lang=python> collection = client.create_collection( name="my-collection", embedding_function=OpenAIEmbeddingFunction(model_name="text-embedding-3-small"), configuration={ "hnsw": { "space": "cosine", "ef_construction": 200 } } ) ``` ```typescript collection = await client.createCollection({ name: "my-collection", embeddingFunction: new OpenAIEmbeddingFunction({ modelName: "text-embedding-3-small", }), configuration: { hnsw: { space: "cosine", ef_construction: 200, }, }, }); ``` ### Fine-Tuning HNSW Parameters In the context of approximate nearest neighbors search, **recall** refers to how many of the true nearest neighbors were retrieved. Increasing `ef_search` normally improves recall, but slows down query time. Similarly, increasing `ef_construction` improves recall, but increases the memory usage and runtime when creating the index. Choosing the right values for your HNSW parameters depends on your data, embedding function, and requirements for recall, and performance. You may need to experiment with different construction and search values to find the values that meet your requirements. For example, for a dataset with 50,000 embeddings of 2048 dimensions, generated by ```python embeddings = np.random.randn(50000, 2048).astype(np.float32).tolist() ``` we set up two Chroma collections: * The first is configured with `ef_search: 10`. When querying using a specific embedding from the set (with `id = 1`), the query takes `0.00529` seconds, and we get back embeddings with distances:</code></pre></div><pre><code>[3629.019775390625, 3666.576904296875, 3684.57080078125] ``` </code></pre><ul><li>The second collection is configured with <code>ef_search: 100</code> and <code>ef_construction: 1000</code>. When issuing the same query, this time it takes <code>0.00753</code> seconds (about 42% slower), but with better results as measured by their distance:</li></ul><pre tabindex=0><code> [0.0, 3620.593994140625, 3623.275390625] ``` In this example, when querying with the test embedding (`id=1`), the first collection failed to find the embedding itself, despite it being in the collection (where it should have appeared as a result with a distance of `0.0`). The second collection, while slightly slower, successfully found the query embedding itself (shown by the `0.0` distance) and returned closer neighbors overall, demonstrating better accuracy at the cost of performance. ## SPANN Index Configuration In Distributed Chroma and Chroma Cloud collections, we use a SPANN (Spacial Approximate Nearest Neighbors) index to perform approximate nearest neighbor (ANN) search. <YouTube src="https://www.youtube.com/embed/1QdwYWd3S1g" title="SPANN Video" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerPolicy="strict-origin-when-cross-origin" allowFullScreen /> <Accordion title="What is a SPANN index?"> A SPANN index is a data structure used to efficiently find approximate nearest neighbors in large sets of high-dimensional vectors. It works by dividing the set into broad clusters (so we can ignore most of the data during search) and then building efficient, smaller indexes within each cluster for fast local lookups. This two-level approach helps reduce both memory use and search time, making it practical to search billions of vectors stored even on hard drives or separate machines in a distributed system. </Accordion> We currently don't allow customization or modification of SPANN configuration. If you set these values they will be ignored by the server. The SPANN index parameters include: * `space` defines the distance function of the embedding space, and hence how similarity is defined. The default is `l2` (squared L2 norm), and other possible values are `cosine` (cosine similarity), and `ip` (inner product). | Distance | parameter | Equation | Intuition | | ----------------- | :-------: | --------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------------------------: | | Squared L2 | `l2` | $d = \sum\left(A_i-B_i\right)^2$ | measures absolute geometric distance between vectors, making it suitable when you want true spatial proximity. | | Inner product | `ip` | $d = 1.0 - \sum\left(A_i \times B_i\right)$ | focuses on vector alignment and magnitude, often used for recommendation systems where larger values indicate stronger preferences | | Cosine similarity | `cosine` | $d = 1.0 - \frac{\sum\left(A_i \times B_i\right)}{\sqrt{\sum\left(A_i^2\right)} \cdot \sqrt{\sum\left(B_i^2\right)}}$ | measures only the angle between vectors (ignoring magnitude), making it ideal for text embeddings or cases where you care about direction rather than scale | * `search_nprobe` is the number of centers that are probed for a query. The higher the value the more accurate the result will be. The query response time also increases as `search_nprobe` increases. Recommended values are 64/128. We don't allow setting a value higher than 128 today. The default value is 64. * `write_nprobe` is the same as `search_nprobe` but for the index construction phase. It is the number of centers searched when appending or reassigning a point. It has the same limits as `search_nprobe`. The default value is 64. * `ef_construction` determines the size of the candidate list used to select neighbors during index creation. A higher value improves index quality at the cost of more memory and time, while a lower value speeds up construction with reduced accuracy. The default value is 200. * `ef_search` determines the size of the dynamic candidate list used while searching for the nearest neighbors. A higher value improves recall and accuracy by exploring more potential neighbors but increases query time and computational cost, while a lower value results in faster but less accurate searches. The default value is 200. * `max_neighbors` defines the maximum number of neighbors for a node. The default value is 64. * `reassign_neighbor_count` is the number of closest neighboring clusters of a split cluster whose points are considered for reassignment. The default value is 64. ## Embedding Function Configuration The embedding function you choose when creating a collection, along with the parameters you instantiate it with, is persisted in the collection's configuration. This allows us to reconstruct it correctly when you use collection across different clients. You can set your embedding function as an argument to the "create" methods, or directly in the configuration: Install the `openai` and `cohere` packages: ```bash pip install openai cohere ``` ```bash poetry add openai cohere ``` ```bash uv pip install openai cohere ``` Creating collections with embedding function and custom configuration: ```python import os from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction, CohereEmbeddingFunction # Using the `embedding_function` argument openai_collection = client.create_collection( name="my_openai_collection", embedding_function=OpenAIEmbeddingFunction( model_name="text-embedding-3-small" ), configuration={"hnsw": {"space": "cosine"}} ) # Setting `embedding_function` in the collection's `configuration` cohere_collection = client.get_or_create_collection( name="my_cohere_collection", configuration={ "embedding_function": CohereEmbeddingFunction( model_name="embed-english-light-v2.0", truncate="NONE" ), "hnsw": {"space": "cosine"} } ) ``` **Note:** Many embedding functions require API keys to interface with the third party embeddings providers. The Chroma embedding functions will automatically look for the standard environment variable used to store a provider's API key. For example, the Chroma `OpenAIEmbeddingFunction` will set its `api_key` argument to the value of the `OPENAI_API_KEY` environment variable if it is set. If your API key is stored in an environment variable with a non-standard name, you can configure your embedding function to use your custom environment variable by setting the `api_key_env_var` argument. In order for the embedding function to operate correctly, you will have to set this variable in every environment where you use your collection. ```python cohere_ef = CohereEmbeddingFunction( api_key_env_var="MY_CUSTOM_COHERE_API_KEY", model_name="embed-english-light-v2.0", truncate="NONE", ) ``` Install the `@chroma-core/openai` and `@chroma-core/cohere` packages: ```bash npm install @chroma-core/openai @chroma-core/cohere ``` ```bash pnpm add @chroma-core/openai @chroma-core/cohere ``` ```bash bun add @chroma-core/openai @chroma-core/cohere ``` ```bash yarn add @chroma-core/openai @chroma-core/cohere ``` Creating collections with embedding function and custom configuration: ```typescript import { OpenAIEmbeddingFunction } from "@chroma-core/openai"; import { CohereEmbeddingFunction } from "@chroma-core/cohere"; // Using the `embedding_function` argument const openAICollection = await client.createCollection({ name: "my_openai_collection", embedding_function: new OpenAIEmbeddingFunction({ model_name: "text-embedding-3-small", }), configuration: { hnsw: { space: "cosine" } }, }); // Setting `embedding_function` in the collection's `configuration` const cohereCollection = await client.getOrCreateCollection({ name: "my_cohere_collection", configuration: { embeddingFunction: new CohereEmbeddingFunction({ modelName: "embed-english-light-v2.0", truncate: "NONE", }), hnsw: { space: "cosine" }, }, }); ``` **Note:** Many embedding functions require API keys to interface with the third party embeddings providers. The Chroma embedding functions will automatically look for the standard environment variable used to store a provider's API key. For example, the Chroma `OpenAIEmbeddingFunction` will set its `api_key` argument to the value of the `OPENAI_API_KEY` environment variable if it is set. If your API key is stored in an environment variable with a non-standard name, you can configure your embedding function to use your custom environment variable by setting the `apiKeyEnvVar` argument. In order for the embedding function to operate correctly, you will have to set this variable in every environment where you use your collection. ```typescript cohere_ef = CohereEmbeddingFunction({ apiKeyEnvVar: "MY_CUSTOM_COHERE_API_KEY", modelName: "embed-english-light-v2.0", truncate: "NONE", }); ``` </code></pre></div><div class=artifact-freshness>Link last verified June 7, 2026. <a href=https://docs.trychroma.com/docs/collections/configure.md target=_blank rel=noopener>View original ↗</a></div><div class=artifact-source-label>Source: Chroma Docs</div><div class=artifact-verified>Link last verified: 2026-03-04</div></article><footer class=book-footer><div class="flex flex-wrap justify-between"><div></div><div></div></div><script>(function () { document.querySelectorAll("pre:has(code)").forEach(code => { code.addEventListener("click", code.focus); code.addEventListener("copy", function (event) { event.preventDefault(); if (navigator.clipboard) { const content = window.getSelection().toString() || code.textContent; navigator.clipboard.writeText(content); } }); }); })(); </script></footer><label for=menu-control class="hidden book-menu-overlay"></label></div></main><script src=/pagefind/pagefind-ui.js></script><script> (function() { var embeds = document.querySelectorAll('.artifact-embed'); embeds.forEach(function(container) { var iframe = container.querySelector('.artifact-embed-iframe'); var loading = container.querySelector('.artifact-embed-loading'); if (!iframe || !loading) return; var timer = setTimeout(function() { loading.textContent = 'This site may block embedding — try the ↗ link in the title to read the original.'; loading.classList.add('artifact-embed-error'); }, 8000); iframe.addEventListener('load', function() { loading.style.display = 'none'; clearTimeout(timer); }); }); })(); </script><script src="/js/components.js?v=1780778970"></script><div id=firebase-auth class=firebase-auth-widget><button id=firebase-sign-in class=firebase-sign-in-btn style=display:none> Sign in with Google</button><div id=firebase-user-info style=display:none><img id=firebase-user-avatar class=firebase-avatar alt> <a href=# id=firebase-sign-out>Sign out</a></div></div><script> window.__firebaseConfig = { apiKey: "AIzaSyBf7LD7BNhjFrAkk5agDhxBF3F-uke5wUc", authDomain: "learn-ai-fbcdb.firebaseapp.com", projectId: "learn-ai-fbcdb", appId: "1:1092940296061:web:8e70d11d61ab0ce3e0a614" }; </script><script src="/js/firebase-auth.js?v=1780778970"></script><script src="/js/artifact-read.js?v=1780778970"></script></body></html>