Hybrid Vector Search Index – An example

Posted on

Over the past couple of years, we have seen Vector Search and Vector Indexes being added to Databases. These allow us to perform similarity searches on text, images and other types of object, with text being the typical examples demonstrated. One thing that seems to get lost in all the examples, is that the ability to perform text search in Databases has been around for a long, long time. Most databases come with various functions, search features, similarity search and indexes to find the text you are looking for. Here are links to two examples DBMS_SEARCH and Explicit Semantic Analysis [also check out Oracle Text]. But what if, instead of having multiple different features, which for each need setting up and configuring etc, you could combine Vector and Text Search into one Index. This is what Hybrid Vector Search Index does. Let’s have a look at how to setup and use this type of Index.

A Hybrid Vector Search index combines an Oracle Text domain index with an AI Vector Search index, which can be either an HSNW or IVF vector index. One key advantage of using a hybrid vector index is that it handles both chunking and vector embedding automatically during the index creation process, while at the same time setups up Oracle Text text search features. These are combined into one index.

Chunking in Vector Indexes is a technique used to break down large bodies of text into smaller, more manageable segments. This serves two main purposes: first, it improves the relevance of the content being vectorized, since lengthy text can contain multiple unrelated ideas that may reduce the accuracy of similarity searches. Second, it accommodates the input size limitations of embedding models. There’s no universal rule for determining the best chunk size—it often requires some trial and error to find what works best for your specific dataset.

Think of a “hybrid search” as a search engine that uses two different methods at once to find what you’re looking for.

  1. Keyword Search: This is like a standard search, looking for the exact words you typed in.
  2. Similarity Search: This is a smarter search that looks for things that are similar in meaning or concept, not just by the words used.

A hybrid search runs both types of searches and then mixes the results together into one final list. It does this by giving each result a “keyword score” (for the exact match) and a “semantic score” (for the conceptual match) to figure out the best ranking. You also have the flexibility to just do one or the other—only a keyword search or only a similarity search—and you can even adjust how the system combines the scores to get the results you want.

The example below will use a Wine Reviews (130K) dataset. This is available on Kaggle and other websites.This data set contain descriptive information of the wine with attributes about each wine including country, region, number of points, price, etc as well as a text description contain a review of the wine. The following are 2 files containing the DDL (to create the table) and then Import the data set (using sql script with insert statements). These can be run in your schema (in order listed below).

Insert records into WINEREVIEWS_130K_IMP table

Create table WINEREVIEWS_130K_IMP

I’ve also loaded the all-MiniLM-L12-v2 ONNX model into my schema. We’ll use this model for the Vector Indexing. The DESCRIPTION column contains the wine reviews details. It is this column we want to index.

To create a basic hybrid-vector index with the detailed settings we can run.

CREATE HYBRID VECTOR INDEX wine_reviews_hybrid_idx
ON WineReviews130K(description)
PARAMETERS ('MODEL all_minilm_l12_v2');

You can alter the parameters for this index. This can be done by defining your preferences. Here is an example where all the preferences are the defaults used in the above CREATE. Alter these to suit your situation and text.

BEGIN
  DBMS_VECTOR_CHAIN.CREATE_PREFERENCE(
    'my_vector_settings',
     dbms_vector_chain.vectorizer,
        json('{
            "vector_idxtype":  "ivf",
            "distance"      :  "cosine",
            "accuracy"      :  95,
            "model"         :  "minilm_l12_v2",
            "by"            :  "words",
            "max"           :  100,
            "overlap"       :  0,
            "split"         :  "recursively"          }'
        ));
END;

CREATE HYBRID VECTOR INDEX wine_reviews_hybrid_idx 
ON WineReviews130K(description)
PARAMETERS('VECTORIZER my_vector_settings');

After creating the hybrid-vector index you can explore some of the characteristics of the index and how the text was chunked, etc, using the dictionary view which was created for the index, for example.

desc wine_reviews_hybrid_idx$vectors

Name                          Null?    Type
----------------------------- -------- --------------------
DOC_ROWID                     NOT NULL ROWID
DOC_CHUNK_ID                  NOT NULL NUMBER
DOC_CHUNK_COUNT               NOT NULL NUMBER
DOC_CHUNK_OFFSET              NOT NULL NUMBER
DOC_CHUNK_LENGTH              NOT NULL NUMBER
DOC_CHUNK_TEXT                         VARCHAR2(4000)
DOC_EMBEDDING                          VECTOR(*, *, DENSE)

You are now ready to start querying your data and using the hybrid-vector index. The syntax for these queries can be a little complex to start with, but with a little practice, you’ll get the hang of it. Check out the documentation for more details and examples. You can perform the following types of searches:

  • Semantic Document
  • Semantic Chunk Mode
  • Keyword on Document
  • Keyword and Semantic on Document
  • Keyword and Semantic on Chunk

Here’s an example of using ‘Keyword and Semantic on Document’.

select json_Serialize(
  DBMS_HYBRID_VECTOR.SEARCH(
    json(
      '{
         "hybrid_index_name" : "wine_reviews_hybrid_idx",
         "search_scorer"     : "rsf",
         "search_fusion"     : "UNION",
         "vector":
          {
             "search_text"   : "What wines are similar to Tempranillo",
             "search_mode"   : "DOCUMENT",
             "aggregator"    : "MAX",
             "score_weight"  : 1,
             "rank_penalty"  : 5
          },
         "text":
          {
             "contains"      : "$tempranillo"
             "score_weight"  : 10,
             "rank_penalty"  : 1
          },
         "return":
          {
             "values"        : [ "rowid", "score", "vector_score", "text_score" ],
             "topN"          : 10
          }
      }'
    )
  ) RETURNING CLOB pretty);

This search will return the top matching records. The ROWID is part of the returned results, allowing you to look up the corresponding records or rows from the table.

Check out the other types of searches to see what might work best for your data and search criteria.