How to build an AI product - (2) Dive into Text Embeddings using Sentence Transformers and kNN

Let's dive deeper into the realm of text embeddings. Text embeddings transform human-readable content into numerical vectors, making them palatable for machine learning models.

Introduction

In our previous post, we took a journey through intriguing datasets. Now, let's dive deeper into the realm of text embeddings. Text embeddings transform human-readable content into numerical vectors, making them palatable for machine learning models.

We'll use the Sentence Transformers library for this purpose and explore how to find similar documents using k-Nearest Neighbors (kNN) with scikit-learn.

Why Sentence Transformers?

Sentence Transformers are a fine-tuned variant of the popular BERT model, optimized for creating embeddings of entire sentences. The results are vectors that can capture semantic meaning and be compared for similarity.

Getting Started: Extracting Embeddings

# Installing required libraries
!pip install sentence-transformers scikit-learn

# Importing necessary modules
from sentence_transformers import SentenceTransformer

# Initializing the model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Let's say we have the following sentences from our dataset
documents = [
    "Deep learning achieves state-of-the-art performance in image classification.",
    "Quantum computing is the study of how to use phenomena in quantum mechanics to create new ways of computing.",
    "A black hole is a region in space where the gravitational pull is so much that nothing, not even light, can escape."
]

# Generating embeddings for our documents
embeddings = model.encode(documents)

Finding Similar Documents using kNN

k-Nearest Neighbors (kNN) is a simple yet powerful algorithm used for classification and regression.

Here, we'll use it to find the most similar documents based on their embeddings.

from sklearn.neighbors import NearestNeighbors
import numpy as np

# Training the kNN model
knn_model = NearestNeighbors(n_neighbors=3, metric='cosine') 
knn_model.fit(embeddings)

# Querying the model to find similar documents
query = ["I am studying about regions in space that have strong gravitational forces."]
query_embedding = model.encode(query)

distances, indices = knn_model.kneighbors(query_embedding)

# Displaying the similar documents
for distance, idx in zip(distances[0], indices[0]):
    print(f"Document: {documents[idx]}, Distance: {distance}")

Learn more about Nearest Neighbors in sklearn

1.6. Nearest Neighbors
sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably m…

Conclusion

Embeddings created by models like Sentence Transformers effectively capture the essence of textual data.

Combining them with algorithms like kNN allows us to harness their potential in applications like document similarity, clustering, and more.

In our next post, we'll explore the Annoy Vector Search Library, which provides a more efficient way of searching through vast embeddings.

Stay tuned!

🤖 Want to Build the Next Big AI Product?

Join our hands-on, real-life bootcamp and transform your ideas into groundbreaking AI solutions.

Sign Up Now