Mistral's Embedding Use Cases

Just a quick post about Mistral’s embedding capability and use cases from Mistral’s embedding tutorial. We’ll cover:

Embed sentences.
Distance measure.
Paraphrase detection.
Batch processing.
Visualization with TSNE.
Classification.
Clustering.

Setup

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage
import os

api_key = os.environ["MISTRAL_API_KEY"]
client = MistralClient(api_key=api_key)

Embed Sentences

Embedding dimensions are 1024 long.

embeddings_batch_response = client.embeddings(
    model="mistral-embed",
    input=["Embed this sentence.", "As well as this one."],
  )

Distance Measure

Setup embedding function:

from sklearn.metrics.pairwise import euclidean_distances

def get_text_embedding(input):
    embeddings_batch_response = client.embeddings(
          model="mistral-embed",
          input=input
      )
    return embeddings_batch_response.data[0].embedding

Embed candidate sentences, one about a cat, the other about books:

sentences = [
    "A home without a cat — and a well-fed, well-petted and properly revered cat — may be a perfect home, perhaps, but how can it prove title?",
    "I think books are like people, in the sense that they’ll turn up in your life when you most need them"
]

embeddings = [get_text_embedding(t) for t in sentences]

Let’s define a reference sentence, which is about books:

reference_sentence = "Books are mirrors: You only see in them what you already have inside you"

reference_embedding = get_text_embedding(reference_sentence)

And let’s see the distance between the reference sentence and candidate sentences:

for t, e in zip(sentences, embeddings):
    distance = euclidean_distances([e], [reference_embedding])
    print(t, distance)

In the response we can see that the distance between the reference sentence (about books) and the book candidate sentence is ~0.8, while the distance between the reference sentence and the cat sentence is ~0.6:

A home without a cat — and a well-fed, well-petted and properly revered cat — may be a perfect home, perhaps, but how can it prove title? [[0.80094257]]
I think books are like people, in the sense that they’ll turn up in your life when you most need them [[0.58162089]]

Paraphrase Detection

Let’s see a common use case of distance measures: paraphrase detection. Given a list of sentences, we want to see if any of the two sentences are paraphrases of each other. If the distance between the two sentence embeddings is small, they might be semantically similar and could be paraphrasing.

Sentences and their embeddings:

sentences = [
    'Have a safe happy Memorial Day weekend everyone',
    'To all our friends at Whatsit Productions Films enjoy a safe happy Memorial Day weekend',
    'Where can I find the best cheese?'
]

sentence_embeddings = [get_text_embedding(t) for t in sentences]

Create pairs of distance measures:

import itertools

sentence_embeddings_pairs = list(itertools.combinations(sentence_embeddings, 2))

sentence_pairs = list(itertools.combinations(sentences, 2))

Display results:

for s, e in zip(sentence_pairs, sentence_embeddings_pairs):
    print(s, euclidean_distances([e[0]], [e[1]]))

From the results below, we can see that the two sentences about Memorial Day weekend have the smallest distance value (~0.5), and therefore have the potential of being paraphrase:

('Have a safe happy Memorial Day weekend everyone', 'To all our friends at Whatsit Productions Films enjoy a safe happy Memorial Day weekend') [[0.54326686]]
('Have a safe happy Memorial Day weekend everyone', 'Where can I find the best cheese?') [[0.92573978]]
('To all our friends at Whatsit Productions Films enjoy a safe happy Memorial Day weekend', 'Where can I find the best cheese?') [[0.9114184]]

Batch Processing

The Mistral embeddings can process text in batches for speed and efficiency. Here we use the Symptom2Disease dataset from Kaggle, which has 1200 rows and 2 columns:

label – disease category.
text – describes the symptoms associated with that disease.

Download the data:

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mistralai/cookbook/main/data/Symptom2Disease.csv", index_col=0)

Define batch processing:

def get_embeddings_by_chunks(data, chunk_size):
    chunks = [data[x:x+chunk_size] for x in range(0, len(data), chunk_size)]
    embeddings_response = [client.embeddings(model="mistral-embed", input=c) for c in chunks]
    embeddings = []
    for i in range(len(embeddings_response)):
        for d in embeddings_response[i].data:
            embeddings.append(d.embedding)
    return embeddings

Batch processing, with 50 rows at a time:

df['embeddings'] = get_embeddings_by_chunks(df['text'].tolist(), 50)
df.head()

Response:

label text embeddings
0 Psoriasis I have been experiencing a skin rash on my arm… [-0.036102294921875, 0.041351318359375, 0.0734…
1 Psoriasis My skin has been peeling, especially on my kne… [-0.05364990234375, 0.05224609375, 0.073791503…
2 Psoriasis I have been experiencing joint pain in my fing… [-0.035400390625, 0.026275634765625, 0.0360107…
3 Psoriasis There is a silver like dusting on my skin, esp… [-0.035980224609375, 0.057037353515625, 0.0528…
4 Psoriasis My nails have small dents or pits in them, and… [-0.02471923828125, 0.039337158203125, 0.04772…
Symptom2Disease Dataset with Embeddings

	label	text	embeddings
0	Psoriasis	I have been experiencing a skin rash on my arm…	[-0.036102294921875, 0.041351318359375, 0.0734…
1	Psoriasis	My skin has been peeling, especially on my kne…	[-0.05364990234375, 0.05224609375, 0.073791503…
2	Psoriasis	I have been experiencing joint pain in my fing…	[-0.035400390625, 0.026275634765625, 0.0360107…
3	Psoriasis	There is a silver like dusting on my skin, esp…	[-0.035980224609375, 0.057037353515625, 0.0528…
4	Psoriasis	My nails have small dents or pits in them, and…	[-0.02471923828125, 0.039337158203125, 0.04772…

Visualization with TSNE

We can visualize embeddings by projecting the 1024 high dimensions down to two dimensions using techniques such as t-SNE:

import seaborn as sns
from sklearn.manifold import TSNE
import numpy as np

tsne = TSNE(n_components=2, random_state=0).fit_transform(np.array(df['embeddings'].to_list()))

ax = sns.scatterplot(x=tsne[:, 0], y=tsne[:, 1], hue=np.array(df['label'].to_list()))

sns.move_legend(ax, 'upper left', bbox_to_anchor=(1, 1))

Produced visualization:

Classification

Embeddings can also be used to train classifiers to predict the labels. First let’s do a simple train/test split. For simplicity, we won’t do cross-validation etc.

# Create a train / test split
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(df['embeddings'], df["label"],test_size=0.2)

Standardize features:

# Standardize features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
train_x = scaler.fit_transform(train_x.to_list())
test_x = scaler.transform(test_x.to_list())

Train a classifier and compute the test accuracy:

# Train a classifier and compute the test accuracy
from sklearn.linear_model import LogisticRegression

# For a real problem, C should be properly cross validated and the confusion matrix analyzed
clf = LogisticRegression(random_state=0, C=1.0, max_iter=500).fit(train_x, train_y.to_list())

print(f"Precision: {100*np.mean(clf.predict(test_x) == test_y.to_list()):.2f}%")

Precision: 97.50%

Let’s test with an example:

# Classify a single example
text = "I've been experiencing frequent headaches and vision problems."
clf.predict([get_text_embedding(text)]).item()

'Migraine'

Clustering

In the case when labels aren’t available, one can turn to clustering. If we know the number of clusters ahead of time, we can just use KMeans with desired clusters. Otherwise one can turn to the elbow method, for example, to determine likely clusters. Here let’s assume we already know that there are 24 clusters:

from sklearn.cluster import KMeans
model = KMeans(n_clusters=24, max_iter=1000)
model.fit(df['embeddings'].to_list())
df["cluster"] = model.labels_

And let’s take a look at some examples from one cluster:

print(*df[df.cluster==23].text.head(3), sep='\n')

From the examples, it looks like they are about the same topic: skin rash:

I've been having a really bad rash on my skin lately. It's full of pus-filled pimples and blackheads. My skin has also been scurring a lot.
I've just developed a severe rash on my skin. It's clogged with pus-filled pimples and blackheads. My skin has also been quite sensitive.
My skin has been breaking out in a terrible rash lately. Blackheads and pus-filled pimples abound on it. Additionally, my skin has been scurring a lot.

That’s all the fun for now!