Just a quick post about Mistral’s embedding capability and use cases from Mistral’s embedding tutorial. We’ll cover:
- Embed sentences.
- Distance measure.
- Paraphrase detection.
- Batch processing.
- Visualization with TSNE.
- Classification.
- Clustering.
Setup
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage
import os
api_key = os.environ["MISTRAL_API_KEY"]
client = MistralClient(api_key=api_key)
Embed Sentences
Embedding dimensions are 1024 long.
embeddings_batch_response = client.embeddings(
model="mistral-embed",
input=["Embed this sentence.", "As well as this one."],
)
Distance Measure
Setup embedding function:
from sklearn.metrics.pairwise import euclidean_distances
def get_text_embedding(input):
embeddings_batch_response = client.embeddings(
model="mistral-embed",
input=input
)
return embeddings_batch_response.data[0].embedding
Embed candidate sentences, one about a cat, the other about books:
sentences = [
"A home without a cat — and a well-fed, well-petted and properly revered cat — may be a perfect home, perhaps, but how can it prove title?",
"I think books are like people, in the sense that they’ll turn up in your life when you most need them"
]
embeddings = [get_text_embedding(t) for t in sentences]
Let’s define a reference sentence, which is about books:
reference_sentence = "Books are mirrors: You only see in them what you already have inside you"
reference_embedding = get_text_embedding(reference_sentence)
And let’s see the distance between the reference sentence and candidate sentences:
for t, e in zip(sentences, embeddings):
distance = euclidean_distances([e], [reference_embedding])
print(t, distance)
In the response we can see that the distance between the reference sentence (about books) and the book candidate sentence is ~0.8, while the distance between the reference sentence and the cat sentence is ~0.6:
A home without a cat — and a well-fed, well-petted and properly revered cat — may be a perfect home, perhaps, but how can it prove title? [[0.80094257]] I think books are like people, in the sense that they’ll turn up in your life when you most need them [[0.58162089]]
Paraphrase Detection
Let’s see a common use case of distance measures: paraphrase detection. Given a list of sentences, we want to see if any of the two sentences are paraphrases of each other. If the distance between the two sentence embeddings is small, they might be semantically similar and could be paraphrasing.
Sentences and their embeddings:
sentences = [
'Have a safe happy Memorial Day weekend everyone',
'To all our friends at Whatsit Productions Films enjoy a safe happy Memorial Day weekend',
'Where can I find the best cheese?'
]
sentence_embeddings = [get_text_embedding(t) for t in sentences]
Create pairs of distance measures:
import itertools
sentence_embeddings_pairs = list(itertools.combinations(sentence_embeddings, 2))
sentence_pairs = list(itertools.combinations(sentences, 2))
Display results:
for s, e in zip(sentence_pairs, sentence_embeddings_pairs):
print(s, euclidean_distances([e[0]], [e[1]]))
From the results below, we can see that the two sentences about Memorial Day weekend have the smallest distance value (~0.5), and therefore have the potential of being paraphrase:
('Have a safe happy Memorial Day weekend everyone', 'To all our friends at Whatsit Productions Films enjoy a safe happy Memorial Day weekend') [[0.54326686]]
('Have a safe happy Memorial Day weekend everyone', 'Where can I find the best cheese?') [[0.92573978]]
('To all our friends at Whatsit Productions Films enjoy a safe happy Memorial Day weekend', 'Where can I find the best cheese?') [[0.9114184]]
Batch Processing
The Mistral embeddings can process text in batches for speed and efficiency. Here we use the Symptom2Disease dataset from Kaggle, which has 1200 rows and 2 columns:
- label – disease category.
- text – describes the symptoms associated with that disease.
Download the data:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mistralai/cookbook/main/data/Symptom2Disease.csv", index_col=0)
Define batch processing:
def get_embeddings_by_chunks(data, chunk_size):
chunks = [data[x:x+chunk_size] for x in range(0, len(data), chunk_size)]
embeddings_response = [client.embeddings(model="mistral-embed", input=c) for c in chunks]
embeddings = []
for i in range(len(embeddings_response)):
for d in embeddings_response[i].data:
embeddings.append(d.embedding)
return embeddings
Batch processing, with 50 rows at a time:
df['embeddings'] = get_embeddings_by_chunks(df['text'].tolist(), 50)
df.head()
Response:
label text embeddings 0 Psoriasis I have been experiencing a skin rash on my arm… [-0.036102294921875, 0.041351318359375, 0.0734… 1 Psoriasis My skin has been peeling, especially on my kne… [-0.05364990234375, 0.05224609375, 0.073791503… 2 Psoriasis I have been experiencing joint pain in my fing… [-0.035400390625, 0.026275634765625, 0.0360107… 3 Psoriasis There is a silver like dusting on my skin, esp… [-0.035980224609375, 0.057037353515625, 0.0528… 4 Psoriasis My nails have small dents or pits in them, and… [-0.02471923828125, 0.039337158203125, 0.04772… Symptom2Disease Dataset with Embeddings
Visualization with TSNE
We can visualize embeddings by projecting the 1024 high dimensions down to two dimensions using techniques such as t-SNE:
import seaborn as sns
from sklearn.manifold import TSNE
import numpy as np
tsne = TSNE(n_components=2, random_state=0).fit_transform(np.array(df['embeddings'].to_list()))
ax = sns.scatterplot(x=tsne[:, 0], y=tsne[:, 1], hue=np.array(df['label'].to_list()))
sns.move_legend(ax, 'upper left', bbox_to_anchor=(1, 1))
Produced visualization:
Classification
Embeddings can also be used to train classifiers to predict the labels. First let’s do a simple train/test split. For simplicity, we won’t do cross-validation etc.
# Create a train / test split
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(df['embeddings'], df["label"],test_size=0.2)
Standardize features:
# Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_x = scaler.fit_transform(train_x.to_list())
test_x = scaler.transform(test_x.to_list())
Train a classifier and compute the test accuracy:
# Train a classifier and compute the test accuracy
from sklearn.linear_model import LogisticRegression
# For a real problem, C should be properly cross validated and the confusion matrix analyzed
clf = LogisticRegression(random_state=0, C=1.0, max_iter=500).fit(train_x, train_y.to_list())
print(f"Precision: {100*np.mean(clf.predict(test_x) == test_y.to_list()):.2f}%")
Precision: 97.50%
Let’s test with an example:
# Classify a single example
text = "I've been experiencing frequent headaches and vision problems."
clf.predict([get_text_embedding(text)]).item()
'Migraine'
Clustering
In the case when labels aren’t available, one can turn to clustering. If we know the number of clusters ahead of time, we can just use KMeans with desired clusters. Otherwise one can turn to the elbow method, for example, to determine likely clusters. Here let’s assume we already know that there are 24 clusters:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=24, max_iter=1000)
model.fit(df['embeddings'].to_list())
df["cluster"] = model.labels_
And let’s take a look at some examples from one cluster:
print(*df[df.cluster==23].text.head(3), sep='\n')
From the examples, it looks like they are about the same topic: skin rash:
I've been having a really bad rash on my skin lately. It's full of pus-filled pimples and blackheads. My skin has also been scurring a lot.
I've just developed a severe rash on my skin. It's clogged with pus-filled pimples and blackheads. My skin has also been quite sensitive.
My skin has been breaking out in a terrible rash lately. Blackheads and pus-filled pimples abound on it. Additionally, my skin has been scurring a lot.
That’s all the fun for now!