With the rapid advancement of artificial intelligence and the continuous expansion of data volumes, traditional databases have long been overwhelmed when dealing with high-dimensional and massive data processing. To address this challenge, OpenPie draws on their profound expertise in distributed databases, has developed the cloud-native vector database PieCloudVector. This series will be divided into three parts, using image, audio, and text—three typical types of unstructured data—as examples to introduce the applications of PieCloudVector.
PieCloudVector, as one of the core computing engines of OpenPie's large model data computing system PieDataCS, represents the dimensional upgrade of analytical databases in the era of LLMs, specifically designed for multimodal AI applications. In addition to PieCloudVector, PieDataCS also supports two computing engines, PieCloudDB Database and PieCloudML.
PieCloudVector adopts a technical approach that integrates mature open-source algorithm implementations with a relational database based on the PostgreSQL kernel. This innovative approach enables the storage and management of raw data's vector representations, while also supporting both precise and fuzzy query, allowing users to perform efficient similarity searches through the Postgres client.
Harnessing its distributed framework, PieCloudVector significantly boosts the efficiency of vector computation and provides a complete suite of upstream and downstream tools. Technically, PieCloudVector is divided into five core layers according to the actual application process: raw data storage, embedding, index construction, vector search, and data application. These layers correspond to different application scenarios in the vector data processing and analysis, forming a complete technical framework, as shown below:
Overall Technical Framework of PieCloudVector
Kicking off the "PieCloudVector Advanced Series," this inaugural article walks you through the process of constructing a product recommendation system using image data. It breaks down the crucial steps of data vectorization, including Embedding computation, storing vectors in the database, and conducting similarity searches, using data from Hugging Face for demonstration.
Vector databases play a crucial role in similarity query tasks. For instance, in the apparel industry's use case, PieCloudVector can effectively utilize the vector representation of product images in conjunction with the K-Nearest Neighbors (KNN) algorithm to retrieve and recommend similar products from the existing product catalog. The overall framework of the system is shown in the figure below:
System Overall Process Framework
Moving forward, this article will detail the step-by-step process of constructing a product recommendation system with PieCloudVector, focusing on four key stages: "Dataset Preparation," "Data Vectorization," "Vector Data Storage," and "Similarity Search."
Dataset Preparation
First, download the fashion mnist data from Hugging Face, which includes clothing images and types, as detailed in fashion_mnist[1].
We select the first 1000 data entries from the training set as demonstration data.
from datasets import load_dataset
dataset = load_dataset("fashion_mnist", split="train[:1000]")
The data includes the following information:
print(dataset.features)
{'image': Image(decode=True, id=None),
'label': CtassLabel(names=['T - shirt / top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'],
id=None)}
Data Vectorization
Next, we transform the data into numerical arrays using a suitable model, thereby creating embedding vectors that encapsulate the intrinsic characteristics of the original data.
Taking the first piece of data as an example, the objective is to pinpoint the top 10 analogous clothing items for recommendations, as depicted in the accompanying figure, with the category being Ankle boot.
Using the imgbeddings[2] library, we convert the image data into vectors. This library is based on the OpenAI CLIP model and is lightweight and fast in computation (especially on CPU), making it very suitable for datasets like fashion mnist, which feature straightforward product image.
import numpy as np
from imgbeddings import imgbeddings
ibed = imgbeddings()
Generating a vector based on the image for the first piece of data.
embedding_0 = ibed.to_embeddings(dataset['image'][0])
imgbeddings generates a vector of 768 dimensions, as shown below.
In: embedding_0.shape
(1, 768)
embedding [0]
array([-2.11191326e-01, 2.07909331e-01, -6.71815038e-01, -1.66335583e+00,
-1.57210135e+00, -5.19429862e-01, -8.80079985e-01, 2.29999766e-01,
1.67191553e+00, -9.89815831e-01, 6.54723167e-01, -2.75861591e-01,
5.89815438e-01, 2.61584610e-01, 8.86729777e-01, 5.67858696e-01,
4.75497782e-01, 3.40062588e-01, -4.25924629e-01, 8.74885023e-01,
-3.10492903e-01, 2.72458225e-01, -3.28680307e-01, -4.51324023e-02,
-6.83538735e-01, -2.32427925e-01, 5.95779240e-01, 5.50612807e-01,
7.26937175e-01, 6.75487295e-02, -7.40724325e-01, -2.07319453e-01,
1.37214720e-01, 1.55591702e+00, 1.24170937e-01, -3.53575408e-01,
-7.43186593e-01, 9.77323204e-02, 4.97219563e-02, 1.00773001e+00,
1.24602437e+00, -1.76177248e-01, 5.85671842e-01, -4.85404104e-01,
-5.25022328e-01, -1.84076607e-01, -4.65092547e-02, 7.65870810e-01,
1.27615702e+00, 7.38422930e-01, 2.59102374e-01, 5.86230934e-01,
-1.34280175e-01, -4.21402991e-01, 1.31635904e-01, 6.08720705e-02,
3.83820683e-01, 9.36180592e-01, 4.59356755e-02, 3.50226104e-01,
-5.04337013e-01, -5.55240333e-01, -7.46359229e-02, 3.54337037e-01,
-6.38039052e-01, 8.85763526e-01, -2.85562664e-01, 9.87186372e-01,
1.74211636-01, -4.21855748e-02, 2.725174430-01, -3.59927297e-01
...
Then, vectorize the data for all items except the first one.
embedding = ibed.to_embeddings(dataset['image'][1:])
Vector Data Storage
After processing the data, the next step is to store the vectors in the database. Before writing, we need to create a target table for these vectors in the database.
It should be noted that when defining the vector field, the dimension is not a necessary parameter. If the dimension is specified, then all subsequent data written must conform to that dimension, vectors of other dimensions cannot be written to that table. If no dimension is specified, once data is written to the table, all subsequent vector dimensions must match those already in the table; however, if the table is cleared, vectors of other dimensions can also be written to the table without a specified dimension, and any new data added must conform to the new dimension, otherwise, it cannot be written. In any case, if there is a vector field in the table, all vectors in that field must have the same dimension.
CREATE TABLE pictures (id bigserial PRIMARY KEY, embedding vector(768));
Once the vector table is created, write the vector data into the database sequentially.
import psycopg2
conn = psycopg2.connect('postgresql://usr:pswd@192.138.***.***:5432/db')
cur = conn.cursor()
embedding_lst = embedding.tolist()
for i in range(len(embedding_lst)):
cur.execute('INSERT INTO pictures (embedding) values (%s)', (embedding_lst[i],))
conn.commit()
conn.close()
Similarity Search
After data writing, you can use PieCloudVector to retrieve the most similar data to the target vector using the K-Nearest Neighbors (KNN) algorithm. Currently, there are three different distance metrics to choose from: L2 distance, Inner Product, and Cosine Similarity. You can flexibly choose the most suitable distance calculation algorithm based on the specific application scenario, data characteristics, and requirements for search result accuracy.
In the previous steps, the target data (the first data entry in the original dataset, a pair of ankle boots) has been vectorized. You can directly send a query to the database via Python to search for the 10 most similar items (the ones closest to the target data vector), using the L2 Distance algorithm.
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://usr:pswd@192.138.***.***:5432/db', echo=False)
img_id = pd.read_sql('select id from pictures where id != 1 order by embedding <-> ' + "'" + str(embedding_0.tolist()[0]) + "'" + ' limit 10',
con=engine)
After finding these items, you can retrieve their images to display the results of the similarity search.
id_lst = img_id['id'].to_list()
for i in id_lst[:5]:
display(dataset['image'][i])
The result is shown in the figure below:
Additionally, you can also judge the category of the target image by other similar clothing categories, which can also be used as auxiliary information to optimize the recommendation results.
def most_common(lst):
return max(set(lst), key=lst.count)
label = most_common([dataset['label'][i] for i in id_lst])
print(dataset.features["label"].int2str(label))
It can be seen that the final recommendation result is ankle boots, and the system can provide consumers with recommendations for this type of product.
dataset.features["label"].int2str(label)
'Ankle boot'
This is the entire process of using PieCloudVector to build a product recommendation system based on image data. Through vectorization processing and similarity search, the system can efficiently find similar recommendation items to the target product, improving the accuracy of recommendations and enhancing the user experience. In the next article, we will continue to explore how to use PieCloudVector to process audio data and achieve audio content recognition.