White Paper: Enhancing Search Similarity with Vector Embeddings and n-gram Indices
In today's data-rich environments, efficient and accurate search is paramount. Traditional keyword-based searches often fall short in capturing semantic nuances and user intent, leading to suboptimal search experiences. This paper describes our journey to enhance search similarity, moving beyond exact string matching to a more intelligent system capable of understanding and ranking relevant results. Our solution combines the power of PostgreSQL's trigram GIN indices for efficient approximate string matching with the sophisticated capabilities of SearchVector and SearchRank for more refined text analysis and relevance scoring.
The core challenge in search similarity lies in bridging the gap between a user's query and the multitude of ways information can be expressed in text. Key issues include:
To address these challenges, we implemented a hybrid approach that leverages the strengths of both n-gram based indexing and vector-based search.
N-grams, particularly trigrams (sequences of three characters), are powerful tools for approximate string matching and identifying similarities even with minor variations or errors. By indexing text fields using trigrams, we can quickly identify documents that share a significant number of trigrams with the query.
We utilized PostgreSQL's pg_trgm extension to create GIN (Generalized Inverted Index) indices on key text fields: title and description. The Django models implement this using GinIndex:
from django.contrib.postgres.indexes import GinIndex
class StreamerPV(models.Model):
# ... other fields ...
class Meta:
indexes = [
GinIndex(
name="streamer_pv_title_trgm_gin_idx",
fields=["title"],
opclasses=["gin_trgm_ops"],
),
GinIndex(
name="streamer_pv_desc_trgm_gin_idx",
fields=["description"],
opclasses=["gin_trgm_ops"],
),
]
These indices significantly accelerate queries involving LIKE, ILIKE, and similarity operators (%, <->) by providing a fast lookup for documents containing similar character sequences.
While n-gram indices provide efficient approximate matching, SearchVector and SearchRank offer a more sophisticated way to analyze text content and assign relevance scores based on a predefined dictionary and weighting scheme. This allows for a deeper understanding of the query's intent and the document's content.
We used Django's SearchVector, SearchQuery, and SearchRank functionalities, which abstract PostgreSQL's Text Search features.
from django.contrib.postgres.search import SearchVector, SearchQuery, SearchRank
# Example usage in a Django view or manager
def perform_search(query_string):
search_vector = SearchVector('title', 'description', weight='A')
search_query = SearchQuery(query_string)
# Annotate results with a rank based on the search vector and query
results = MyModel.objects.annotate(
search=search_vector
).filter(search=search_query).annotate(
rank=SearchRank(search_vector, search_query)
).order_by('-rank')
return results
By assigning different weights to fields (e.g., title having a higher weight than description), we can prioritize matches in more critical areas of the document.
The true power of our solution lies in the synergistic combination of these two techniques:
SearchVector and SearchRank can then be applied to this subset (or the entire dataset for highly precise queries) to provide a nuanced relevance score. This allows us to prioritize documents where the query terms appear more frequently, in more important fields, or as part of a more semantically relevant phrase.The implementation of this hybrid search similarity engine yielded significant improvements across several key metrics:
Our approach is built upon a foundation of established research in information retrieval and natural language processing. The following arXiv papers and foundational works provide context:
SearchVector. Modern advancements in NLP, particularly with word embeddings, align with the broader principles of our system.
SearchRank function is an implementation of ranking algorithms commonly employed in IR to order search results based on estimated relevance (e.g., based on term frequency and field weights).While our current system provides substantial improvements, future enhancements could include:
SearchVector representations.By strategically combining PostgreSQL's robust n-gram GIN indices with the powerful SearchVector and SearchRank functionalities, we have developed a highly effective and efficient solution for enhancing search similarity. This hybrid approach addresses common challenges in information retrieval, leading to more relevant results, improved recall for approximate matches, and a superior user experience. Our work underscores the importance of leveraging both efficient indexing techniques and intelligent text analysis for modern Django-powered search systems.
A deep dive by Noelabs into the “Monolith: Real Time Recommendation System With Collisionless Embedding Table” white paper, and how its ideas helped …
White Paper: Enhancing Search Similarity with Vector Embeddings and n-gram Indices
How Alkane Live Engineered a Hybrid Search System for Relevance, Speed & Semantic Accuracy