Basic Semantic Search with Cohere Models ↗

Summary: This page describes how to do basic semantic search with Cohere's models.

Original Documentation

title: Basic Semantic Search with Cohere Models slug: /page/basic-semantic-search description: This page describes how to do basic semantic search with Cohere’s models. image: type: fileId value: ‘https://files.buildwithfern.com/cohere.docs.buildwithfern.com/8ba30b46486ea7bfab24f3e8856d7411d1b745b26e9026abff3ee62af52ce268/assets/images/f1cc130-cohere_meta_image.jpg' keywords: ‘Cohere, semantic search’#

Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search.

Searching an archive using sentence embeddings

In this notebook, we’ll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can also be used to power features like StackOverflow’s “similar questions” feature.

Get the archive of questions
Embed the archive
Search using an index and nearest neighbor search
Visualize the archive based on the embeddings

And if you’re running an older version of the SDK, you might need to upgrade it like so:

#!pip install --upgrade cohere

Get your Cohere API key by signing up here. Paste it in the cell below.

1. Getting Set Up#

#@title Import libraries (Run this cell to execute required code) {display-mode: "form"}

import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

You’ll need your API key for this next cell. Sign up to Cohere and get one if you haven’t yet.

model_name = "embed-v4.0"
api_key = ""
input_type_embed = "search_document"

co = cohere.Client(api_key)

2. Get The Archive of Questions#

We’ll use the trec dataset which is made up of questions and their categories.

dataset = load_dataset("trec", split="train")

df = pd.DataFrame(dataset)[:1000]

df.head(10)

<td>
  0
</td>

<td>
  0
</td>

<td>
  How did serfdom develop in and then leave Russia ?
</td>

<td>
  1
</td>

<td>
  1
</td>

<td>
  What films featured the character Popeye Doyle ?
</td>

<td>
  0
</td>

<td>
  0
</td>

<td>
  How can I find a list of celebrities ' real names ?
</td>

<td>
  1
</td>

<td>
  2
</td>

<td>
  What fowl grabs the spotlight after the Chinese Year of the Monkey ?
</td>

<td>
  2
</td>

<td>
  3
</td>

<td>
  What is the full form of .com ?
</td>

<td>
  3
</td>

<td>
  4
</td>

<td>
  What contemptible scoundrel stole the cork from my lunch ?
</td>

<td>
  3
</td>

<td>
  5
</td>

<td>
  What team did baseball 's St. Louis Browns become ?
</td>

<td>
  3
</td>

<td>
  6
</td>

<td>
  What is the oldest profession ?
</td>

<td>
  0
</td>

<td>
  7
</td>

<td>
  What are liver enzymes ?
</td>

<td>
  3
</td>

<td>
  4
</td>

<td>
  Name the scar-faced bounty hunter of The Old West .
</td>

`<th> label-coarse </th> <th> label-fine </th> <th> text </th>`
0
1
2
3
4
5
6
7
8
9

2. Embed the archive#

The next step is to embed the text of the questions.

To get a thousand embeddings of this length should take about fifteen seconds.

embeds = co.embed(texts=list(df['text']),
                  model=model_name,
                  input_type=input_type_embed).embeddings

embeds = np.array(embeds)
embeds.shape

(1000, 4096)

3. Search using an index and nearest neighbor search#

Building the search index from the embeddings

Let’s now use Annoy to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include Faiss, ScaNN, and PyNNDescent).

After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).

search_index = AnnoyIndex(embeds.shape[1], 'angular')
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

True

3.1. Find the neighbors of an example from the dataset#

If we’re only interested in measuring the distance between the questions in the dataset (no outside queries), a simple way is to calculate the distance between every pair of embeddings we have.

example_id = 92

similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Question:'{df.iloc[example_id]['text']}'\nNearest neighbors:")
results

Question:'What are bear and bull markets ?'
Nearest neighbors:

<td>
  What animals do you find in the stock market ?
</td>

<td>
  0.904278
</td>

<td>
  What are equity securities ?
</td>

<td>
  0.992819
</td>

<td>
  What do economists do ?
</td>

<td>
  1.066583
</td>

<td>
  What does NASDAQ stand for ?
</td>

<td>
  1.080738
</td>

<td>
  What does it mean `` Rupee Depreciates '' ?
</td>

<td>
  1.086724
</td>

<td>
  Why did the world enter a global depression in 1929 ?
</td>

<td>
  1.099370
</td>

<td>
  Where can stocks be traded on-line ?
</td>

<td>
  1.105368
</td>

<td>
  What is the difference between a median and a mean ?
</td>

<td>
  1.141870
</td>

<td>
  What is `` the bear of beers '' ?
</td>

<td>
  1.154140
</td>

`<th> texts </th> <th> distance </th>`
614
137
513
307
363
932
547
922
601

3.2. Find the neighbors of a user query#

We’re not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.

query = "What is the tallest mountain in the world?"
input_type_query = "search_query"

query_embed = co.embed(texts=[query],
                  model=model_name,
                  input_type=input_type_query).embeddings

similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                include_distances=True)
query_results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
                             'distance': similar_item_ids[1]})


print(f"Query:'{query}'\nNearest neighbors:")
print(query_results) # NOTE: Your results might look slightly different to ours.

Query:'What is the tallest mountain in the world?'
Nearest neighbors:

<td>
  What is the name of the tallest mountain in the world ?
</td>

<td>
  0.447309
</td>

<td>
  What is the highest mountain in the world ?
</td>

<td>
  0.552254
</td>

<td>
  What was the highest mountain on earth before Mount Everest was
  discovered ?
</td>

<td>
  0.801252
</td>

<td>
  What mountain range is traversed by the highest railroad in the world
  ?
</td>

<td>
  0.929516
</td>

<td>
  What is the highest peak in Africa ?
</td>

<td>
  0.930806
</td>

<td>
  Where is the highest point in Japan ?
</td>

<td>
  0.977315
</td>

<td>
  What 's the longest river in the world ?
</td>

<td>
  1.064209
</td>

<td>
  What is the largest snake in the world ?
</td>

<td>
  1.076390
</td>

<td>
  What 's the second-largest island in the world ?
</td>

<td>
  1.088034
</td>

<td>
  What is the highest waterfall in the United States ?
</td>

<td>
  1.091145
</td>

`<th> texts </th> <th> distance </th>`
236
670
412
907
435
109
901
114
962
27

4. Visualizing the archive#

Finally, let’s plot out all the questions onto a 2D chart so you’re able to visualize the semantic similarities of this dataset!

#@title Plot the archive {display-mode: "form"}

reducer = umap.UMAP(n_neighbors=20)
umap_embeds = reducer.fit_transform(embeds)
df_explore = pd.DataFrame(data={'text': df['text']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text']
).properties(
    width=700,
    height=400
)
chart.interactive()

Hover over the points to read the text. Do you see some of the patterns in clustered points? Similar questions, or questions asking about similar topics?

This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case).

We can’t wait to see what you start building! Share your projects or find support on Discord.

Link last verified June 7, 2026. View original ↗

Source: Cohere Docs

Link last verified: 2026-02-26