Code Embeddings ↗
noOriginal Documentation
Embeddings are at the core of multiple enterprise use cases, such as retrieval systems, clustering, code analytics, classification, and a variety of search applications. With code embedings, you can embed code databases and repositories, and power coding assistants with state-of-the-art retrieval capabilities.
How to Generate Embeddings#
To generate code embeddings using Mistral AI’s embeddings API, we can make a request to the API endpoint and specify the embedding model codestral-embed, along with providing a list of input texts. The API will then return the corresponding embeddings as numerical vectors, which can be used for further analysis or processing in NLP applications.
We also provide output_dtype and output_dimension parameters that allow you to control the type and dimensional size of your embeddings.
output_dtype allows you to select the precision and format of the embeddings, enabling you to obtain embeddings with your desired level of numerical accuracy and representation.
The accepted dtypes are:
- float (default): A list of 32-bit (4-byte) single-precision floating-point numbers. Provides the highest precision and retrieval accuracy.
- int8: A list of 8-bit (1-byte) integers ranging from -128 to 127.
- uint8: A list of 8-bit (1-byte) integers ranging from 0 to 255.
- binary: A list of 8-bit integers that represent bit-packed, quantized single-bit embedding values using the
int8type. The length of the returned list of integers is 1/8 ofoutput_dimension. This type uses the offset binary method. - ubinary: Similar to
binary, but uses theuint8type for bit-packed, quantized single-bit embedding values.
output_dimension allows you to select a specific size for the embedding, enabling you to obtain an embedding of your chosen dimension, defaults to 1536 and has a maximum value of 3072.
For any integer target dimension n, you can choose to retain the first n dimensions. These dimensions are ordered by relevance, and the first n are selected for a smooth trade-off between quality and cost.
Below is an example of how to use the embeddings API to generate embeddings for a list of input texts code related.
import os
from mistralai import Mistral
api_key = os.environ["MISTRAL_API_KEY"]
model = "codestral-embed"
client = Mistral(api_key=api_key)
embeddings_batch_response = client.embeddings.create(
model=model,
# output_dtype="binary",
# output_dimension=512,
inputs=[
"Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: Because nums[0] + nums[1] == 9, we return [0, 1]. Example 2: Input: nums = [3,2,4], target = 6 Output: [1,2] Example 3: Input: nums = [3,3], target = 6 Output: [0,1] Constraints: 2 <= nums.length <= 104 -109 <= nums[i] <= 109 -109 <= target <= 109 Only one valid answer exists.",
"class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]: d = {} for i, x in enumerate(nums): if (y := target - x) in d: return [d[y], i] d[x] = i"
],
)
const apiKey = process.env.MISTRAL_API_KEY;
const model = "codestral-embed";
const client = new Mistral({ apiKey: apiKey });
async function getEmbeddings() {
const embeddingsBatchResponse = await client.embeddings.create({
model: model,
// output_dtype: "binary",
// output_dimension: 512,
inputs: [
"Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: Because nums[0] + nums[1] == 9, we return [0, 1]. Example 2: Input: nums = [3,2,4], target = 6 Output: [1,2] Example 3: Input: nums = [3,3], target = 6 Output: [0,1] Constraints: 2 <= nums.length <= 104 -109 <= nums[i] <= 109 -109 <= target <= 109 Only one valid answer exists.",
"class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]: d = {} for i, x in enumerate(nums): if (y := target - x) in d: return [d[y], i] d[x] = i"
],
});
}
// Call the async function
getEmbeddings().catch(console.error);problem_description="Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. Example 1: Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: Because nums[0] + nums[1] == 9, we return [0, 1]. Example 2: Input: nums = [3,2,4], target = 6 Output: [1,2] Example 3: Input: nums = [3,3], target = 6 Output: [0,1] Constraints: 2 <= nums.length <= 104 -109 <= nums[i] <= 109 -109 <= target <= 109 Only one valid answer exists."
solution="class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]: d = {} for i, x in enumerate(nums): if (y := target - x) in d: return [d[y], i] d[x] = i"
curl -X POST "https://api.mistral.ai/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{"model": "codestral-embed", "output_dimension": 10, "output_dtype": "binary", "input": ["'"$problem_description"'", "'"$solution"'"]}' \
-o embedding.json{
"id": "e49d725673554aa480eb639cfc3bd7b1",
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
0.03143310546875,
-0.001312255859375,
-0.048126220703125,
0.18017578125,
...
-0.0146026611328125,
0.0020160675048828125,
-0.00493621826171875,
0.0023822784423828125
],
"index": 0
},
{
"object": "embedding",
"embedding": [
-0.0616455078125,
-0.1959228515625,
0.060791015625,
0.206298828125,
...
-0.0045013427734375,
0.002071380615234375,
-0.003078460693359375,
0.004718780517578125
],
"index": 1
}
],
"model": "codestral-embed",
"usage": {
"prompt_audio_seconds": null,
"prompt_tokens": 263,
"total_tokens": 263,
"completion_tokens": 0,
"request_count": null,
"prompt_token_details": null
}
}Let’s take a look at the length of the first embedding:
len(embeddings_batch_response.data[0].embedding)It returns 1553, which means that our embedding dimension is 1553. The codestral-embed model generates embedding vectors up to dimensions of 3072 for each text string, regardless of the text length, you can reduce the dimension using output_dimension if needed. It’s worth nothing that while higher dimensional embeddings can better capture text information and improve the performance of NLP tasks, they may require more resources and may result in increased latency and memory usage for storing and processing these embeddings. This trade-off between performance and computational resources should be considered when designing NLP systems that rely on text embeddings.
Below you will find some examples of how to use the Codestral Embeddings API and different use cases.
Let’s take a look at a simple example. To simplify working with text embeddings, we can wrap the embedding API in this function:
from sklearn.metrics.pairwise import euclidean_distances
from datasets import load_dataset
def get_code_embedding(inputs):
embeddings_batch_response = client.embeddings.create(
model=model,
inputs=inputs
)
return embeddings_batch_response.data[0].embeddingSuppose we have two code snippets: one about two sum and the other about reverse integer. We want to find how similar each code snippets is to the reference code palindrome number. We can see that the distance between the reference code embeddings and the reverse embeddings is smaller than the distance between the reference code embeddings and the two sum code embeddings.
Inputs:
{
"code_snippets": {
"two_sum_solution": "classSolution:deftwoSum(self,nums:List[int],target:int)->List[int]:d={}fori,xinenumerate(nums):if(y:=target-x)ind:return[d[y],i]d[x]=i",
"reverse_integer_solution": "classSolution:defreverse(self,x:int)->int:ans=0mi,mx=-(2**31),2**31-1whilex:ifans<mi//10+1orans>mx//10:return0y=x%10ifx<0andy>0:y-=10a",
},
"reference_code_snippet": "classSolution:defisPalindrome(self,x:int)->bool:ifx<0or(xandx%10==0):returnFalsey=0whiley<x:y=y*10+x%10x//=10returnxin(y,y//10)"
}dataset = load_dataset("newfacade/LeetCodeDataset")
two_sum_solution = dataset["train"][0]["completion"]
reverse_integer_solution = dataset["train"][6]["completion"]
palindrome_number_solution = dataset["train"][8]["completion"]
def remove_whitespace(code):
return code.replace("\n", "").replace("\t", "").replace(" ", "")
two_sum_solution_clean = remove_whitespace(two_sum_solution)
reverse_integer_solution_clean = remove_whitespace(reverse_integer_solution)
palindrome_number_solution_clean = remove_whitespace(palindrome_number_solution)
code_snippets = [
two_sum_solution_clean,
reverse_integer_solution_clean
]
embeddings = [get_code_embedding([t]) for t in code_snippets]
reference_code_snippet = palindrome_number_solution
reference_embedding = get_code_embedding([reference_code_snippet])
for t, e in zip(code_snippets, embeddings):
distance = euclidean_distances([e], [reference_embedding])
print(t, distance)classSolution:deftwoSum(self,nums:List[int],target:int)->List[int]:d={}fori,xinenumerate(nums):if(y:=target-x)ind:return[d[y],i]d[x]=i [[0.909916]]
classSolution:defreverse(self,x:int)->int:ans=0mi,mx=-(2**31),2**31-1whilex:ifans<mi//10+1orans>mx//10:return0y=x%10ifx<0andy>0:y-=10ans=ans*10+yx=(x-y)//10returnans [[0.64201937]]In our example above, we used the Euclidean distance to measure the distance between embedding vectors (note that since Mistral AI embeddings are norm 1, cosine similarity, dot product or Euclidean distance are all equivalent).
We wrote a function get_embeddings_by_chunks that splits data into chunks and then sends each chunk to the Mistral AI Embeddings API to get the embeddings. Then we saved the embeddings as a new column in the dataframe. Note that the API will provide auto-chunking in the future, so that users don’t need to manually split the data into chunks before sending it.
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/mistralai/cookbook/main/data/LeetCodeTSNE.csv"
)
def get_embeddings_by_chunks(data, chunk_size):
chunks = [data[x : x + chunk_size] for x in range(0, len(data), chunk_size)]
embeddings_response = [
client.embeddings.create(model=model, inputs=c) for c in chunks
]
return [d.embedding for e in embeddings_response for d in e.data]
df["embeddings"] = get_embeddings_by_chunks(df["Code"].tolist(), 50)
display(df.head())
We mentioned previously that our embeddings have 1536 dimensions, which makes them impossible to visualize directly. Thus, in order to visualize our embeddings, we can use a dimensionality reduction technique such as t-SNE to project our embeddings into a lower-dimensional space that is easier to visualize.
In this example, we transform our embeddings to 2 dimensions and create a 2D scatter plot showing the relationships among embeddings of different problems.
import seaborn as sns
from sklearn.manifold import TSNE
import numpy as np
tsne = TSNE(n_components=2, random_state=0).fit_transform(np.array(df['embeddings'].to_list()))
ax = sns.scatterplot(x=tsne[:, 0], y=tsne[:, 1], hue=np.array(df['Name'].to_list()))
sns.move_legend(ax, 'upper left', bbox_to_anchor=(1, 1))
For more information and guides on how to make use of our embedding sdk, we have the following cookbooks: