Text Classification Using Embeddings ↗
noOriginal Documentation
title: Text Classification Using Embeddings slug: /page/text-classification-using-embeddings description: >- This page discusses the creation of a text classification model using word vector embeddings. image: type: fileId value: ‘https://files.buildwithfern.com/cohere.docs.buildwithfern.com/8ba30b46486ea7bfab24f3e8856d7411d1b745b26e9026abff3ee62af52ce268/assets/images/f1cc130-cohere_meta_image.jpg' keywords: ‘Cohere, text classification, classification models, word vector embeddings’#
This notebook shows how to build a classifier using Cohere’s embeddings.

The example classification task here will be sentiment analysis of film reviews. We’ll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).
We’ll go through the following steps:
- Get the dataset
- Get the embeddings of the reviews (for both the training set and the test set).
- Train a classifier using the training set
- Evaluate the performance of the classifier on the testing set
If you’re running an older version of the SDK you’ll want to upgrade it, like this:
#!pip install --upgrade cohere1. Get the dataset#
import cohere
from sklearn.model_selection import train_test_split
import pandas as pd
pd.set_option('display.max_colwidth', None)
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)df.head() |
|---|
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
We’ll only use a subset of the training and testing datasets in this example. We’ll only use 500 examples since this is a toy example. You’ll want to increase the number to get better performance and evaluation.
The train_test_split method splits arrays or matrices into random train and test subsets.
num_examples = 500
df_sample = df.sample(num_examples)
sentences_train, sentences_test, labels_train, labels_test = train_test_split(
list(df_sample[0]), list(df_sample[1]), test_size=0.25, random_state=0)
sentences_train = sentences_train[:95]
sentences_test = sentences_test[:95]
labels_train = labels_train[:95]
labels_test = labels_test[:95]2. Set up the Cohere client and get the embeddings of the reviews#
We’re now ready to retrieve the embeddings from the API. You’ll need your API key for this next cell. Sign up to Cohere and get one if you haven’t yet.
model_name = "embed-v4.0"
api_key = ""
input_type = "classification"
co = cohere.Client(api_key)embeddings_train = co.embed(texts=sentences_train,
model=model_name,
input_type=input_type
).embeddings
embeddings_test = co.embed(texts=sentences_test,
model=model_name,
input_type=input_type
).embeddingsNote that the ordering of the arguments is important. If you put input_type in before model_name, you’ll get an error.
We now have two sets of embeddings, embeddings_train contains the embeddings of the training sentences while embeddings_test contains the embeddings of the testing sentences.
Curious what an embedding looks like? We can print it:
print(f"Review text: {sentences_train[0]}")
print(f"Embedding vector: {embeddings_train[0][:10]}")Review text: the script was reportedly rewritten a dozen times either 11 times too many or else too few
Embedding vector: [1.1531117, -0.8543223, -1.2496399, -0.28317127, -0.75870246, 0.5373464, 0.63233083, 0.5766576, 1.8336298, 0.44203663]3. Train a classifier using the training set#
Now that we have the embedding, we can train our classifier. We’ll use an SVM from sklearn.
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced'))
svm_classifier.fit(embeddings_train, labels_train)Pipeline(steps=[('standardscaler', StandardScaler()),
('svc', SVC(class_weight='balanced'))])4. Evaluate the performance of the classifier on the testing set#
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on is {100*score}%!")Validation accuracy on Large is 91.2%!You may get a slightly different number when you run this code.
This was a small scale example, meant as a proof of concept and designed to illustrate how you can build a custom classifier quickly using a small amount of labelled data and Cohere’s embeddings. Increase the number of training examples to achieve better performance on this task.