Databricks ↗

Summary: Using Databricks and Pinecone to create and index vector embeddings at scale

Original Documentation

Documentation Index#
Fetch the complete documentation index at: https://docs.pinecone.io/llms.txt Use this file to discover all available pages before exploring further.

Using Databricks and Pinecone to create and index vector embeddings at scale

export const PrimarySecondaryCTA = ({primaryLabel, primaryHref, primaryTarget, secondaryLabel, secondaryHref, secondaryTarget}) => {primaryLabel && primaryHref &&

  {primaryLabel}

<svg width="24" height="24" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg">
  <path d="M9.70492 6L8.29492 7.41L12.8749 12L8.29492 16.59L9.70492 18L15.7049 12L9.70492 6Z" fill="white" />
</svg>

}

{secondaryLabel && secondaryHref &&

    {secondaryLabel}
  
</a>

}

;

Databricks is a Unified Analytics Platform on top of Apache Spark. The primary advantage of using Spark is its ability to distribute workloads across a cluster of machines. By adding more machines or increasing the number of cores on each machine, it is easy to horizontally scale a cluster to handle computationally intensive tasks like vector embedding, where parallelization can save many hours of precious computation time and resources. Leveraging GPUs with Spark can produce even better results — enjoying the benefits of the fast computation of a GPU combined with parallelization will ensure optimal performance.

Efficiently create, ingest, and update vector embeddings at scale with Databricks and Pinecone.

Setup guide #

In this guide, you’ll create embeddings based on the sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face, but the approach demonstrated here should work with any other model and dataset.

Before you begin#

Ensure you have the following:

1. Install the Spark-Pinecone connector#

Install the Spark-Pinecone connector as a library.

Configure the library as follows:

Select File path/S3 as the Library Source.

Enter the S3 URI for the Pinecone assembly JAR file:

    s3://pinecone-jars/1.1.0/spark-pinecone-uberjar.jar  
    ```

<span class="callout-start" data-callout-type="note"></span>
  Databricks platform users must use the Pinecone assembly jar listed above to ensure that the proper dependecies are installed.
<span class="callout-end"></span>

Click Install.

Install the Spark-Pinecone connector as a library.
Configure the library as follows:
1. Select File path/S3 as the Library Source.
2. Enter the S3 URI for the Pinecone assembly JAR file:
```
    s3://pinecone-jars/1.1.0/spark-pinecone-uberjar.jar  
    ```
```
3. Click Install.

Install the Spark-Pinecone connector as a library.
Configure the library as follows:
1. Download the Pinecone assembly JAR file.
2. Select Workspace as the Library Source.
3. Upload the JAR file.
4. Click Install.

2. Load the dataset into partitions#

As your example dataset, use a collection of news articles from Hugging Face’s datasets library:

Create a new notebook attached to your cluster.

Install dependencies:

pip install datasets transformers pinecone torch

Load the dataset:

from datasets import list_datasets, load_dataset  
dataset_name = "allenai/multinews_sparse_max"  
dataset = load_dataset(dataset_name, split="train")

Convert the dataset from the Hugging Face format and repartition it:
```
dataset.to_parquet("/dbfs/tmp/dataset_parquet.pq")  
num_workers = 10  
dataset_df = spark.read.parquet("/tmp/dataset_parquet.pq").repartition(num_workers)  
```
Once the repartition is complete, you get back a DataFrame, which is a distributed collection of the data organized into named columns. It is conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations under the hood. As mentioned above, each partition in the dataframe has an equal amount of the original data.

The dataset doesn’t have identifiers associated with each document, so add them:

from pyspark.sql.types import StringType  
from pyspark.sql.functions import monotonically_increasing_id  
dataset_df = dataset_df.withColumn("id", monotonically_increasing_id().cast(StringType()))

As its name suggests, withColumn adds a column to the dataframe, containing a simple increasing identifier that you cast to a string.

3. Create the vector embeddings#

Create a UDF (User-Defined Function) to create the embeddings, using the AutoTokenizer and AutoModel classes from the Hugging Face transformers library:

from transformers import AutoTokenizer, AutoModel  
def create_embeddings(partitionData):  
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")  
    model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")  
    for row in partitionData:  
        document = str(row.document)  
        inputs = tokenizer(document, padding=True, truncation=True, return_tensors="pt", max_length=512)  
        result = model(**inputs)  
        embeddings = result.last_hidden_state[:, 0, :].cpu().detach().numpy()  
    lst = embeddings.flatten().tolist()  
    yield [row.id, lst, "", "{}", None]

Apply the UDF to the data:
```
embeddings = dataset_df.rdd.mapPartitions(create_embeddings)  
```
A dataframe in Spark is a higher-level abstraction built on top of a more fundamental building block called a resilient distributed dataset (RDD). Here, you use the mapPartitions function, which provides finer control over the execution of the UDF by explicitly applying it to each partition of the RDD.

Convert the resulting RDD back into a dataframe with the schema required by Pinecone:

from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType  
schema = StructType([  
    StructField("id",StringType(),True),  
    StructField("values",ArrayType(FloatType()),True),  
    StructField("namespace",StringType(),True),  
    StructField("metadata", StringType(), True),  
    StructField("sparse_values", StructType([  
        StructField("indices", ArrayType(LongType(), False), False),  
        StructField("values", ArrayType(FloatType(), False), False)  
    ]), True)  
])  
embeddings_df = spark.createDataFrame(data=embeddings,schema=schema)

4. Save the embeddings in Pinecone#

Initialize the connection to Pinecone:

from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")

Create an index for your embeddings:

pc.create_index(
name="news",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(
    cloud="aws",
    region="us-east-1"
)
)

Use the Spark-Pinecone connector to save the embeddings to your index:

(  
    embeddings_df.write  
    .option("pinecone.apiKey", api_key) 
    .option("pinecone.indexName", index_name)  
    .format("io.pinecone.spark.pinecone.Pinecone")  
    .mode("append")  
    .save()  
)

The process of writing the embeddings to Pinecone should take approximately 15 seconds. When it completes, you’ll see the following:

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@41638051  
pineconeOptions: scala.collection.immutable.Map[String,String] = Map(pinecone.apiKey -><YOUR API KEY>, pinecone.indexName -> "news")

This means the process was completed successfully and the embeddings have been stored in Pinecone.

Perform a similarity search using the embeddings you loaded into Pinecone by providing a set of vector values or a vector ID. The query endpoint will return the IDs of the most similar records in the index, along with their similarity scores:
```
    index.query(
        namespace="example-namespace",
        vector=[0.3, 0.3, 0.3, 0.3, 0.3],
        top_k=3,
        include_values=True
    )
```
If you want to make a query with a text string (e.g., "Summarize this article"), use the search endpoint via integrated inference.

Link last verified June 7, 2026. View original ↗

Source: Pinecone Docs

Link last verified: 2026-03-04