How to build an AI product - (3) Exploring the World of Embedding Models for Diverse Tasks

Introduction

Text embeddings are the backbone of many modern Natural Language Processing (NLP) tasks. By transforming text into numerical vectors, we capture the semantic essence of content, making it accessible for machine learning algorithms.

Over the years, a plethora of embedding models have been designed, each fine-tuned for specific tasks. Let’s embark on a journey exploring these models, with a particular focus on the offerings from the Sentence Transformers library.

all-mpnet-base-v2

sentence-transformers/all-mpnet-base-v2 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Best for:General-purpose tasks where speed and model size are essential, such as on mobile devices or in web applications.

Paraphrase Models

Model Names in Sentence Transformers: paraphrase-xlm-r-multilingual-v1, paraphrase-distilroberta-base-v1

sentence-transformers/paraphrase-MiniLM-L6-v2 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

What are they?These models are fine-tuned specifically for paraphrasing tasks, making them adept at understanding sentences that convey similar meanings but are phrased differently.

Best for:Tasks that require capturing semantic equivalence, such as duplicate question detection, paraphrase generation, or semantic textual similarity.

Multilingual Models

Model Name in Sentence Transformers: quora-distilbert-multilingual

sentence-transformers/quora-distilbert-multilingual · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

What is it?Multilingual models are trained on text from multiple languages, enabling them to understand and generate embeddings for a diverse range of languages.

Best for:Applications that cater to global audiences, such as multilingual chatbots, cross-language information retrieval, or global sentiment analysis.

This is not all. A large number of pre-trained models are available for various tasks.

Learn more at https://www.sbert.net/docs/pretrained_models.html

All these models can be used to develop specific applications like Semantic Search.

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

query_embedding = model.encode('How big is London')
passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census', 'London is known for its finacial district'])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))

Conclusion

The realm of text embeddings is vast and evolving. The Sentence Transformers library offers a versatile set of models optimized for various tasks. Depending on the specifics of your application, whether it's the need for speed, deep semantic understanding, or multilingual capabilities, there's likely a model that's just right for you.

In our next post, we'll delve deeper into building vector index and search using Annoy.

🤖 Want to Build the Next Big AI Product?

Join our hands-on, real-life bootcamp and transform your ideas into groundbreaking AI solutions.