Oracle Semantic Search using Vectors on Iceberg Tables

Posted on

In my previous blog posts I’ve explored how to use Iceberg Tables and how to integrate these in with your Data Lake. Additionally, I showed how to setup your Oracle Data Lake (Database) to access the data stored in Iceberg Tables stored in OCI Object Storage. To access this Iceberg Table data from the Oracle Database we created an External Table. This allows us to query the Iceberg Table data as if it was internal to the database. With all new releases there is continuous improvement in the features and to make them easier to use. One such new feature (as of 23.26.1) is the ability to read vector data types from an External Table. This new feature is called or referred to as ‘Vectors on Ice’.

With Oracle Database External Tables now supporting vector embedding stored in Iceberg Tables, means you can generate vector embeddings with your preferred embedding model (external to Oracle using your faviourite tool/library), store them in Iceberg Tables in cloud object storage (OCI Object Storage, AWS S3, etc.), and run semantic search from Oracle AI Database, accessing vector data stored within the database and externally with the minimum of data movement and with similar SQL queries.

Summary: Vectors on Ice lets you ask semantic questions of your data lake using the same SQL and vector search tooling you already use in Oracle Database, without copying or moving the data.

Oracle vector indexes can be created to speed up semantic search over the vectors in the Iceberg Tables. You don’t need to copy Iceberg data into an Oracle Database to use them or to update the Vector Index. The vector indexes are stored standalone in the database and (depending on the type of Vector Index used) can automatic sync as data is added to the files on Object Storage. The syncing of the Vector Index is not immediately updated but is updated based on a background process, so you can look on this index being eventually consistent. In might happen fairly quickly or during large write periods in might take seconds to a few minutes to get fully up-to-date.

Let’s have a look at an example of this. The following example builds upon my example in my previous posts which walked through the setup sets needed for gaining access to Object Storage, etc

CREATE TABLE IF NOT EXISTS external_vector_file
(
id VARCHAR(10) PRIMARY KEY,
text_desc VARCHAR2(1000),
vec_embedding VECTOR(1024, float32)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY DATA_PUMP_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.credential.name=OCI_CRED
com.oracle.bigdata.fileformat=parquet
com.oracle.bigdata.access_protocol=iceberg
)
LOCATION ('<path-to-iceberg-table>/metadata/v1.metadata.json')
)
REJECT LIMIT UNLIMITED;

and to create a Vector Index on this external file

CREATE VECTOR INDEX external_vector_file_idx_ivf
ON external_vector_file (vec_embedding)
ORGANIZATION NEIGHBOR PARTITIONS;

We can now run our SQL queries on this external vector data. This query assumes we have the same vector embedding model loaded into the database.

SELECT id,
text_desc,
vector_distance(vec_embedding,
vector_embedding(my_onnx_model using :INPUT_TERM as data) DIST
FROM external_vector_file
ORDER BY DIST
FETCH FIRST 10 ROWS ONLY;

Why is this important? This allows organisations that already store vectors or embeddings in Iceberg Tables, as part of an existing ML pipeline, and who want to query them, performing semantic search, alongside the data in their Oracle Database, can now do so with the minimum of setup. This brings the benefits of Semantic Search to a wider audience within the organisation.