Following the in-depth exploration of PieCloudVector's applications in image and audio data in the previous articles of this series, this article will focus on text data, examining PieCloudVector's vectorization, storage, and retrieval of textual data, and ultimately integrating it with LLMs to create a chatbot.
PieCloudVector, as one of the core computing engines of OpenPie's large model data computing system PieDataCS, represents the dimensional upgrade of analytical databases in the era of LLMs, specifically designed for multimodal AI applications. In addition to PieCloudVector, PieDataCS also supports two computing engines, PieCloudDB Database and PieCloudML.
In the realm of natural language processing, we often encounter vast quantities of text data that require sophisticated processing, analysis, and comprehension, where vector databases are indispensable. This article, the third installment in the "PieCloudVector Advanced Series," will guide you through the process of leveraging PieCloudVector to build a Chatbot.
The process of building a chatbot with PieCloudVector can be divided into three steps, detailed as follows:
Once a user submits a query, the system initiates a text retrieval operation tailored to the user's question. Subsequently, the gathered data is fed into a large language model, which crafts responses based on the input content. This article will illustrate, through practical examples, the initial two stages: the vectorization and storage of data, as well as the construction of retrieval mechanisms.
When processing textual data, we employ a method similar to image data processing: using language Embedding models to convert text into vector form and storing these vectors in the database. Below, we will detail the process of converting textual data into vector form and conducting similarity queries.
Dataset Preparation
The Wikipedia dataset used in the example comes from Hugging Face[1].
from datasets import load_dataset
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:500]")
The dataset has the following four features:
In: dataset. features
Out: {'id': Value(dtype='string', id=None),
'url': Value(dtype='string', id=None),
'title': Value(dtype='string', id=None),
'text': Value(dtype='string', id=None)}
Taking the first data entry as an example, the data corresponding to each feature is as follows:
print("**text id**:", dataset['id'][0])
print("**text url**:", dataset['url'][0])
print("**text title**:", dataset['title'][0])
print("**text content**:", dataset['text'][0])
The printout result is shown in the figure below:
**text id**: 1
**text url**: https://simple.wikipedia.org/wiki/April
**text title**: April
**text content**: April is the fourth month of the year in the Julian and Gregorian calendars, and comes between March and May. It is one of four months to have 30 days.
April always begins on the same day of week as July, and additionally, January in leap years. April always ends on the same day of the week as December.
April's flowers are the Sweet Pea and Daisy. Its birthstone is the diamond. The meaning of the diamond is innocence.
The Month
April comes between March and May, making it the fourth month of the year. It also comes first in the year out of the four months that have 30 days, as June, September and November are later in the year.
April begins on the same day of the week as July every year and on the same day of the week as January in leap years. April ends on the same day of the week as December every year, as each other's last days are exactly 35 weeks (245 days) apart.
From the results, it can be seen that the first data entry contains a Wikipedia English text about April, with its corresponding identifier (id) being 1. When storing these data in the database, we will retain all feature data except for the "content" field. For the "content" field, we will replace the original text content with the vector results generated by the Embedding model.
Vectorization and Similarity Search
The text data Embedding model used here is the paraphrase-MiniLM-L6-v2[2] provided by Hugging Face, which is a relatively lightweight model, particularly suitable for text clustering and similarity queries. In the process of loading this model, we also need the sentence_transformers tool.
Next, the target Embedding model will be loaded, and textual data will be converted into vectors.
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings = emb_model.encode(dataset['text'])
embeddings_lst = embeddings.tolist()
After the textual data is input into the model, the final output is a 384-dimensional vector.
We will write the data and vector results into the database. Before that, we need to create the corresponding table in the database.
CREATE TABLE wiki_text (id int PRIMARY KEY, url text, title text, embedding vector);
Using the Postgres driver to write data into PieCloudVector.
import psycopg2
conn = psycopg2.connect('postgresql://usr:pswd@192.168.***.***:5432/db')
cur = conn.cursor()
for i in range(1,len(embeddings_lst)):
cur.execute('INSERT INTO wiki_text (id, url, title, embedding) values (%s,%s,%s,%s)', (dataset['id'][i],dataset['url'][i],dataset['title'][i],embeddings_lst[i]))
conn.commit()
conn.close()
Using L2 Distance to find the Top 10 similar documents.
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://usr:pswd@192.168.***.***:5432/db', echo=False)
text_id = pd.read_sql('select id, title from wiki_text order by embedding <-> ' + "'" + str(embeddings_lst[0]) + "'" + ' limit 10',
con=engine)
The results are shown in the figure below.
It is noticeable that, except for the seventh entry, all articles are related to months and years. The seventh entry, named "Alanis Morissette," is an article introducing the life of a certain singer, with the content as follows:
In: data_14 = dataset.filter(lambda x: x['id']=='14')
print(data_14['text'][0])
Out: Alanis Nadine Morissette (born June 1, 1974) is a Grammy Award-winning Canadian-American singer and songwriter. She was born in Ottawa, Canada. She began singing in Canada as a teenager in 1990. In 1995, she became popular all over the world.
As a young child in Canada, Morissette began to act on television, including 5 episodes of the long-running series, You Can't Do That on Television. Her first album was released only in Canada in 1990.
Her first international album was Jagged Little Pill, released in 1995. It was a rock-influenced album. Jagged has sold more than 33 million units globally. It became the best-selling debut album in music history. Her next album, Supposed Former Infatuation Junkie, was released in 1998. It was a success as well. Morissette took up producing duties for her next albums, which include Under Rug Swept, So-Called Chaos and Flavors of Entanglement. Morissette has sold more than 60 million albums worldwide.
She also acted in several movies, including Kevin Smith's Dogma, where she played God.
About her life
Alanis Morissette was born in Riverside Hospital of Ottawa in Ottawa, Ontario. Her father is French-Canadian. Her mother is from Hungary. She has an older brother, Chad, and a twin brother, Wade, who is 12 minutes younger than she is. Her parents had worked as teachers at a military base in Lahr, Germany.
Morissette became an American citizen in 2005. She is still Canadian citizen.
On May 22, 2010, Morissette married rapper Mario "MC Souleye" Treadway.
Jagged Little Pill
Morissette has had many albums. Her 1995 album Jagged Little Pill became a very popular album. It has sold over 30 million copies worldwide. The album caused Morissette to win four Grammy Awards. The album Jagged Little Pill touched many people.
This article is associated because the life story frequently mentions time information, leading the model to judge that the article is highly related to months and years.
Vector Indexing: Fuzzy Queries and Exact Queries
In the previous examples, the dataset we processed was small, primarily for demonstration purposes. However, as the volume of data increases, exact queries require comparing the input vector with every record in the database, which can lead to increased computational pressure as the data grows. Using vector indexing (Index) can preemptively obtain the approximate relationships between data, significantly improving query speed (at the cost of some query accuracy), a method also known as fuzzy querying.
The core idea of fuzzy querying is to use Approximate Nearest Neighbor (ANN) algorithms to build an index, which can determine the relationships between vectors in advance during queries, avoiding the need to scan the entire dataset. PieCloudVector offers two ANN algorithms when creating indexes: IVFFlat and HNSW. We can choose the most suitable indexing algorithm based on the specific characteristics of the data.
Taking the above Wikipedia data as an example, in the month example, we took the first 500 entries from the data training set, and below we will take the first 8000 entries from the Wikipedia dataset to demonstrate fuzzy querying.
Note that when data has already been indexed, PieCloudVector defaults to enabling fuzzy querying. We can use the following command to disable fuzzy querying in the current session.
set enable_indexscan to off
First, we reload the Wikipedia dataset and Embedding model, converting textual data into vectors.
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
dataset_large = load_dataset("wikipedia", "20220301.simple", split="train[:8000]")
emb_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings_large = emb_model.encode(dataset_large['text'])
Then, we create the target table for the data in the database.
CREATE TABLE wiki_text_8000 (id int PRIMARY KEY, url text, title text, embedding vector(384));
Write the processed vectors into the database. Due to the large amount of data, we use batch writing here.
from psycopg2.extras import execute_values
conn = psycopg2.connect('postgresql://usr:pswd@192.168.***.***:5432/db')
cur = conn.cursor()
data_list = [(dataset_large['id'][i],dataset_large['url'][i],dataset_large['title'][i],embeddings_large.tolist()[i]) for i in range(1,len(embeddings_large.tolist()))]
execute_values(cur, 'INSERT INTO wiki_text_8000 (id, url, title, embedding) values %s', data_list)
conn.commit()
conn.close()
Next, we select the indexing algorithm. PieCloudVector supports two ANN algorithms:
When the amount of data is not particularly large, we tend to pursue query accuracy, so we choose the HNSW algorithm here. When building an index, we need to specify a distance measurement method for the index. For example, we used L2 distance in the aforementioned search for related articles, which means we need to create an HNSW index for L2 distance. Similarly, if we choose to use cosine distance, we also need to create a corresponding index for it.
Create an HNSW index for the L2 distance algorithm.
CREATE INDEX ON wiki_text_8000 USING pdb_nn (embedding vector_l2_ops) WITH (dimension = '384', index_key = 'HNSW32', search_k=10, hnsw_efsearch = 16, hnsw_efconstruction = 32);
Perform fuzzy querying.
set enable_indexscan to on; -- Make sure fuzzy query is enabled
select id, title from wiki_text_8000 where id != 2 order by embedding <-> (select embedding from wiki_text_8000 where id = 2) limit 10;
The results are as follows:
Turn off fuzzy querying and perform exact querying.
set enable_indexscan to off;
select id, title from wiki_text_8000 where id != 2 order by embedding <-> (select embedding from wiki_text_8000 where id = 2) limit 10;
The results are as follows:
It can be seen that through fuzzy querying, we can quickly obtain results consistent with exact querying, which not only improves query efficiency but also ensures the accuracy of the results. In practical applications, this means we can provide users with faster and more accurate search results. The balance between speed and accuracy is very important in many business scenarios. In the future, as the volume of data grows, we can adjust indexing strategies according to actual conditions to maintain query efficiency and accuracy, ensuring that the system can continue to meet user needs.