Thank you very much for your interest..
Problem statement: Many
organizations uses various document store application to store their important
process/technical documents. Those document repository grows over the time and
finding right document during actual need becomes challenging. Specifically,
someone who joins the team as a fresher find it difficult to find right
documents for reference. This article demonstrates how to efficiently search
using oracle database 23 ai as document store and APEX 24.2 as a frontend
application.
Assumption: Prior knowledge in basic database administration
and Apex workspace setup
Pre-requisite: -
1.
Install oracle database 23 ai software in
engineered system like Exadata or Oracle database appliance
2.
Create a pluggable database and a dedicated
tablespace for storing data related to documents.
3.
Create a schema(docadmin) with necessary
privileges and create below 2 tables.I will explain in detail later how these
tables work
CREATE TABLE "DOCSTORE"."MYDOCS1"
("ID" INTEGER
GENERATED BY DEFAULT ON NULL AS IDENTITY ( START WITH 1 CACHE 20) PRIMARY KEY,
"FILENAME"
VARCHAR2(500 BYTE),
"FILE_SIZE"
NUMBER(*,0),
"FILE_TYPE"
VARCHAR2(100 BYTE),
"FILE_CONTENT"
BLOB
) LOGGING;
CREATE TABLE
"DOCSTORE"."VECTORE_STORE1"
( "DOCID" NUMBER(*,0) NOT NULL ENABLE,
"EMBED_ID"
NUMBER,
"EMBED_DATA"
VARCHAR2(4000 BYTE),
"EMBED_VECTOR"
VECTOR,
FOREIGN KEY ("DOCID")
REFERENCES
"DOCSTORE"."MYDOCS" ("ID") ENABLE);
4.
Choose a
file path where we need to copy our documents and create an oracle directory
based on that.
5.
Insert some document in the table mydocs using
below insert statement from sqlplus or any other client using variable where
FILE_TRANSFER is the name of the oracle directory
insert into
docstore.mydocs(filename,file_size,file_type,file_content)
values('&&filename',dbms_lob.getlength(to_blob(bfilename('FILE_TRANSFER','&&filename'))),'DOCX',to_blob(bfilename('FILE_TRANSFER','&&filename')));
commit;
Lets deep dive now how we can
utilize vector database as backend and apex as a front end to support search
based on RAG aka retrieval-augmented generation.
Login in your workspace and create
a simple application from App builder.We want to create a search page where we
can search based on similarity using RAG.To do that we have to create search
configuration like below
Follow below navigation in apex
Shared Components>Search
Configurations
Here we can see that provider name is showing as 'ALL_MINILM_L12_V2'.Lets take a pause and see in below steps how this is appearing in drop down list.
Ø
Download onnx model using below link
once
we unzip we can see file with .onnx model and a read me how to import that
model within the database.It is self explanatory so just validate whether you
can see imported model using below query.
select model_name, algorithm,
mining_function from user_mining_models where
model_name='ALL_MINILM_L12_V2';
Ø
Apex is still not aware of this imported model.Go
to Workspace Utilities and then Vector Providers. Create following entry so
that Apex is aware of the imported model.Once it is done it will appear as drop
down as a vector provider in search configuration.
Ø
Continue search configuration now. Here we
provide our Label,Table/View name and column name which must be a vector
column,in our case it is embed_vector.You can chose search type between exact
and similar.Find below reference to know more details.
Ø
If you want to use similarity search then set
vector_meory_size database parameter as 1G and create below index.
CREATE VECTOR INDEX vector_store_idx ON
docstore.vectore_store (embed_vector) ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE WITH TARGET ACCURACY 95;
Ø Now focus on column mapping where we
want to see real textual data not vectorized data.You are now ready to
similarity search using vector database
Ø
Now,we are in the most critical steps of
vectorizing our text/clob data.Every model has some limitation utpo which it
can generate embeddings and that limitation depends on number of token in the
text.For my case I have assumed it as 300 so I have split document content with
300 tokens and generated vector embedding based on that.Below is the code for
generating vector embedding
insert into
DOCSTORE.VECTORE_STORE (docid,embed_id,embed_data,embed_vector)
select
id,embed_id,text_chunk,embed_vector from docstore.mydocs doc cross join table(
dbms_vector_chain.utl_to_embeddings(
dbms_vector_chain.utl_to_chunks(
dbms_vector_chain.utl_to_text(doc.file_content),
json('{"by":"words","max":"300","split":"sentence","normalize":"all"}')
),
json('{"provider":"database","model":"ALL_MINILM_L12_V2"}')
)
) embed
cross join json_table(
embed.column_value,
'$[*]' COLUMNS (
embed_id NUMBER PATH '$.embed_id',
text_chunk varchar2(4000) PATH '$.embed_data',
embed_vector CLOB PATH '$.embed_vector')) as et ;
Basically, we should
write a trigger for every row insert on master table mydocs so that it can
generate vector embedding and insert immediately in child table.
For more information
refer below link
https://docs.oracle.com/en/database/oracle/oracle-database/23/arpls/dbms_vector_chain1.html
Ø Once embedding is generated search using app like below.I have not searched with ORA- error rather I have written plain text asking oracle error
Hope you have learned something useful.