The OpenPie’s PieDataComputingSystem(πDataCS), officially launched during 2023 OpenPie Annual Technology Forum on October 24th, is reconstructed with cloud-native technology for data storage and computation. With the motto "One Storage, Multiple Engine Data Computing," πDataCS enhances AI models to be larger and faster, comprehensively upgrading big data systems to the era of large models. In addition to the cloud-native virtual data warehouse PieCloudDB Database, πDataCS introduces its second computing engine: Cloud-Native Vector Computing Engine PieCloudVector. PieCloudVector supports massive vector data storage and efficient queries, empowering multimodality large model AI applications.
AI is poised to lead the next wave of global GDP growth. According to a June 2023 report by McKinsey, generative AI (based on large models) is expected to contribute approximately $2.6 to $4.4 trillion to the global GDP annually, equivalent to the total GDP of the United Kingdom in 2021 ($3.1 trillion). Goldman Sachs also noted in its April 2023 report that generative AI could contribute to a 7% growth in the global GDP. The rapid rise of large models is driving continuous innovation in the applications of generative AI based on large models. The increasing demand for handling large-scale vector data, similarity searches, and other requirements is also fostering the further development of vector databases.
OpenPie's self-developed cloud-native vector computing engine, PieCloudVector, as the second computing engine of πDataCS, represents the dimensional upgrade of analytical databases in the era of large models. Its goal is to support multimodality AI applications with large models, further enabling the storage and efficient querying of massive vector data. PieCloudVector is designed to seamlessly integrate with large models' embeddings, facilitating rapid adaptation and secondary development of foundational models in AI scenarios.
With the explosive growth of data and the improvement of computational capabilities, large models have become crucial tools for tackling complex problems and analyzing massive datasets. Large models refer to machine learning models with massive parameter scales, high complexity, and powerful learning capabilities. These models typically consist of millions or even billions of parameters, acquired through training on large-scale data to gain knowledge and reasoning abilities. The advent of large models has led to significant breakthroughs in tasks across various domains, including natural language processing, image recognition, speech recognition, and recommendation systems.
Vectorized Representation of Features
In mathematics and computer science, a vector is a quantity that has magnitude and direction. A vector represents a set of "features" using a set of floating point numbers. This feature is extracted from the binary representation (text, picture, audio, video, etc.) of a real object (cat, flower, etc.) (as shown in the figure above). Generally Extracted from large models. By converting real objects into vector representations, calculations and comparisons can be performed in vector space, such as similarity calculations, cluster analysis, classification tasks, etc. Vector representation also provides the basis for building recommendation systems, sentiment analysis, information retrieval and other tasks.
Vector database is a database system specially used to store and manage vector data. It can provide efficient storage, indexing and query functions for vectors.
In vector search, different distance measures (such as Euclidean distance, cosine similarity, Manhattan distance, etc.) can be used to calculate the distance between two vectors. The closer the distance, the more similar the two vectors are. As shown in the figure below, the distance measure between "Paipai" and "Sloth" can be calculated by cosine similarity to determine their degree of similarity.
Calculate Cosine Similarity of Vectors
Traditional databases are better at exact matching, but lack storage and processing capabilities for floating-point numbers and cannot efficiently process vector data. In order to efficiently store and query vector data, vector databases came into being.
Vector database can meet the specific needs of storing and processing vector data, and can efficiently store vectors and original entities (text/image/speech) and associate them. This provides efficient similarity search, large-scale data management, complex vector calculations, and real-time recommendations, helping users better utilize and analyze vector data and facilitate large model applications.
OpenPie believes that in addition to efficient vector storage and similarity search functions, an excellent vector database must also meet transaction ACID guarantees and user permission control to ensure that the insertion, update, and deletion operations of vector data can be performed correctly. Execution, while ensuring data consistency during concurrent access, provides users with stable, reliable and secure services, and is suitable for various data management and application scenarios. This is also the design idea of PieCloudVector.
After comparing the performance and performance of various open source implementations such as pgvector and pgembedding, OpenPie team did not choose this type of open source implementation. Instead, it completely independently developed PieCloudVector to meet the user's usage scenarios. PieCloudVector has features such as efficient storage and retrieval of vector data, similarity search, vector indexing, vector clustering and classification, high-performance parallel computing, strong scalability and fault tolerance.
Architecture of PieCloudVector
In terms of architecture design, the Tuoshupai team used its experience and advantages accumulated in the fields of eMPP (elastic MPP) and distributed architecture when building PieCloudDB, πDataCS's first computing engine cloud-native virtual data warehouse, to create vector computing engine PieCloudVector's eMPP distributed architecture. As shown in the figure below, each Executor of PieCloudVector corresponds to a PieCloudVector instance, thereby achieving high performance, scalability and reliability of vector storage and similarity search services. The converted vector representation will be stored in the unified storage engine "JANM" of πDataCS.
PieCloudVector’s eMPP Distributed Architecture
Users only need one client to perform similar searches in any language. With the help of PieCloudVector, users can not only store and manage the vectors corresponding to the original data, but also call PieCloudVector related tools to perform fuzzy search. Compared with global search, some accuracy is sacrificed to achieve millisecond-level search, further improving query efficiency.
Functionality of PieCloudVector
PieCloudVector can provide two search modes: precise search and fuzzy search. Currently, PieCloudVector provides users with the following features:
Next, we will introduce the first two of these functions in detail:
Approximate Search KNN-ANN
K-Nearest Neighbor (KNN) is one of the basic problems of vector search. The problem is to find the K vectors that are closest to a given vector among the existing N vectors. Through the K nearest neighbor algorithm, applications such as similar image retrieval, related news recommendation, and user portrait matching can be realized. It allows to quickly find the most similar vector to a given vector based on the distance or similarity between vectors, thus providing efficient similarity search and recommendation services.
However, as the amount of data gradually increases, accurate query requires comparing the input vector with each record, and the computational cost will increase exponentially. In order to solve this problem, PieCloudDB builds vector indexes to obtain the approximate relationship between data in advance and speed up query efficiency. PieCloudVector introduces the Approximate Nearest Neighbor (ANN) algorithm to build vector indexes. Through ANN, PieCloudVector can save global search time, sacrifice part of the accuracy to accelerate query speed, further improve query efficiency, achieve millisecond-level query speed, and achieve fuzzy query.
PieCloudVector provides a variety of ANN algorithms when establishing vector indexes, including the most popular IVFFlat (Inverted File with Flat) algorithm and HNSW (Hierarchical Navigable Small World) algorithm. Users can choose according to the characteristics of the data:
Product Quantization
Vector similarity search requires a large amount of memory to support when processing large-scale data. For example, an index containing 1 million dense vectors would typically require several GB of memory to store. High-dimensional data makes the memory usage problem even more serious because as the dimensions increase, the vector representation space becomes extremely large and requires more memory to store.
To solve this memory pressure problem, Product Quantization is a common method. It can compress high-dimensional vectors, thereby significantly reducing memory usage. By dividing each vector into several subspaces and quantizing each subspace, PQ can convert the original high-dimensional vector into multiple low-dimensional codebooks, thereby reducing memory requirements.
After using PQ, the memory required to store indexes can be reduced by up to 97%, allowing PieCloudVector to manage memory more efficiently when processing large-scale data sets and speed up similarity searches. In addition, PQ can also improve the speed of nearest neighbor search, usually increasing the search speed by 5.5 times. In addition, the IVF+PQ composite index formed by combining PQ with Inverted File (IVF) can further increase the search speed by 16.5 times without affecting the search accuracy. The overall search speed can be increased by 92 times compared to not using quantized indexing.
Product Quantization
According to the actual use process of vectors, the application scenarios of PieCloudVector can be roughly divided into four layers, which correspond to different scenarios in the actual use of vectors.
Prepare data and segmentation (images, text, audio, etc.)
At this layer, data preparation and segmentation are involved. For example, in the form of images, text, audio, etc. The original data needs to be preprocessed, cleaned and feature extracted to obtain a vector representation suitable for subsequent processing. This step is usually done to transform raw data into input for creating embeddings.
Create Embeddings
At this layer, the data is converted into vector representation through appropriate algorithms or models. This vector representation reflects the characteristics and semantic information of the data. For example, models such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Transformer, etc. can be used to generate embedded representations of images, text, or audio.
Vector Storage
At this layer, the created vector representation is stored for subsequent vector searches. PieCloudVector supports distributed vector storage, can flexibly expand storage resources, and reduces memory usage through vector compression.
Vector Search
At this layer, a similarity search is performed based on stored vectors. PieCloudVector provides an efficient vector search function. Through vector search algorithms such as KNN and ANN, it supports L2 distance, Inner Product, and Cosine Distance vector distance measurement methods, which can quickly find the vector most similar to a given query vector. This vector search function is widely used in similar image retrieval, related news recommendation, user portrait matching and other scenarios.
The figure below shows the application process architecture of PieCloudVector in a knowledge base system, which includes six steps from text segmentation to the application returning answers to the user. The knowledge base system leverages PieCloudVector to support semantic search and answer retrieval capabilities in the knowledge base system. It converts text into vector representations and searches for vector similarity to find relevant answers. This architecture can efficiently process large-scale text data sets and provide accurate answers back to users.
Application Process Architecture of Knowledge-based System
In the future, PieCloudVector will continue to iterate and develop to provide unique memory and support for large models. As generative AI and large models continue to evolve, PieCloudVector will more deeply incorporate the benefits of vector databases and be tightly integrated with other technologies and algorithms.
PieCloudVector will continue to improve its storage, indexing and query capabilities to cope with increasingly complex and large vector data. It will explore new quantization algorithms, approximate search methods, and parallel computing strategies to improve query efficiency and accuracy.
At the same time, PieCloudVector will be committed to integrating with application scenarios in different fields, and will gradually expand its support for multi-modal data processing and analysis capabilities to provide more comprehensive and flexible solutions.