From Technology Theory to Practical Applications: The Key Points for Vector Database Selection

JUNE 18TH, 2024

Three Key Points for Vector Database Selection

Vector database, designed specifically for vector searches, have made significant progress in both academic research and industrial practice. However, with the breakthroughs in large language model technology, the volume of vector data closely related to natural language has grown exponentially. This not only intensifies the demand for efficient search capabilities but also poses new challenges for the mixed management of vector and scalar data.

Traditional databases struggle to handle mixed queries of vector and scalar data, failing to fully adapt to complex scenarios of multimodal data processing and efficient similarity search, which cannot meet the growing business needs of enterprises. Vector databases excel at accommodating multimodal data such as images, audio, and text, and by mapping these data into vector representations, they utilize vector similarity for association and retrieval.

When selecting a vector database, users need to consider three key aspects: vector algorithms, general data management, and the ecosystem of supporting tools. Additionally, it is necessary to evaluate and test based on specific business needs and technical requirements to choose the most suitable database. Moreover, with the continuous development of technology, it is important to pay attention to the updates and upgrades of database products to ensure they can continue to meet business needs.

Vector Algorithm Optimization

As the core functionality of vector databases, vector search algorithms play a crucial role. Different algorithms have their strengths based on specific scenarios and performance requirements. When evaluating, the focus is usually on key performance indicators such as QPS, Recall (or Accuracy), CPU and memory resource consumption, and GPU acceleration support.

These performance indicators often require trade-offs, and no single algorithm can be optimal in all dimensions. Therefore, offering a variety of algorithm options and detailed parameter adjustment capabilities is essential. This helps users find the best balance between various performance indicators, thereby expanding the applicable scenarios of the database and enhancing its versatility.

General Data Management

General data management is an indispensable part of vector databases, focusing on effectively integrating vector data and its accompanying metadata, such as original text, creation time, user identifiers, source paths or URLs. These ancillary information are collectively referred to as scalar data, and the search for vectors ultimately reflects on these associated information.

Data consistency, operation atomicity, mixed query capabilities, multi-user support, and permission management are key indicators for measuring the general data management capabilities of a vector database.

Ecosystem of Supporting Tools

The ecosystem of supporting tools is directly related to the user-friendliness and practicality of vector databases. Key optimization points include SDK development, data import and export, backup and recovery, data visualization, and integration with large language model ecosystems.

Two Technological Theories of Vector Databases

Currently, the field of vector database technology is divided into two majors: one is proprietary vector databases such as Pinecone, Zilliz, and Chroma, which are known for their excellent vector retrieval speed but lack flexibility when dealing with complex multi-dimensional general data processing.

The other is traditional databases like PostgreSQL, which have enhanced their ability to handle vector data by integrating extensions like pgvector. Although this improves generality, it still falls short of the performance and scalability of proprietary vector databases.

In fact, the former focuses on vector search algorithms as the core, building a comprehensive ecosystem around them, such as Pinecone and Zilliz, which are mostly based on the powerful open-source library faiss, and their performance directly benefits from the optimization of faiss. The latter introduces vector search capabilities on the basis of mature SQL databases, such as PostgreSQL, with pgvector being a typical example, to facilitate convenient queries of vectorized data.

In the design of PieCloudVector, OpenPie strives to integrate the strengths of two major technological directions. OpenPie has chosen to integrate the faiss component with its self-developed relational database based on the Postgres kernel. This approach not only achieves performance on par with products like Pinecone but also retains the general database capabilities of Postgres.

PieCloudVector: Provides Long-Term Memory for Large Models

Adhering to the mission of "Data Computing for New Discoveries", OpenPie's large model data computing system (PieDataComputing System, PieDataCS) has achieved seamless integration of AI mathematical models, data, and computing, jointly promoting the continuous growth of social and economic benefits. As one of the core computing engines of the large model data computing system, PieCloudVector is an upgrade of the analytical database in the era of large models, specifically designed for multimodal large model AI applications.

Compared to traditional databases, PieCloudVector has broken through technical bottlenecks, achieving vectorized storage and computational resource elasticity, improving ease of use and performance, enhancing metadata change functions, solving data consistency issues, and overcoming technical challenges in security, reliability, and online capabilities.

PieCloudVector is fully compatible with SQL:2016 and the PostgreSQL ecosystem, supporting row storage and a mix of row and column storage. Built on the eMPP (Elastic Massive Parallel Processing) architecture, PieCloudVector not only supports non-structured data retrieval through SQL interfaces but also allows for associated analysis with structured data.

In terms of functionality, PieCloudVector is built on the PostgreSQL kernel and the faiss algorithm library, with complete ACID data management capabilities, supporting mixed queries of scalars and vectors. It supports mainstream ANN algorithms and vector encoding or compression algorithms, supports SIMD/GPU acceleration, and is compatible with the large model tool ecosystem such as langchain.

PieCloudVector not only supports flexible single-node deployment but can also be easily expanded to a distributed architecture. In distributed deployment, each node carries a sub-slice of the dataset, and search results are aggregated and re-sorted across nodes to ensure the return of the global optimal solution. This architectural design allows PieCloudVector to linearly expand its data processing capabilities with the increase of node numbers, easily coping with the challenges of massive data.

In terms of performance tuning, PieCloudVector offers a flexible parameter adjustment mechanism, focusing particularly on the optimization of vector search algorithm parameters. Taking the IVF algorithm as an example, users can adjust the total number of partitions and the number of partitions searched each time according to their needs. A lower number of partitions helps to shorten the index creation time, while increasing the number of search partitions can enhance the Recall of a single search, but may be accompanied by an increase in search time.

In terms of data security, OpenPie has also created a transparent encryption feature for PieCloudVector. This feature automatically encrypts data when it is written to the disk, without the need for additional user operations, thereby greatly simplifying the data encryption process. Transparent encryption not only ensures the confidentiality of the data but also ensures that even if the data is illegally obtained on the storage medium, it cannot be easily decrypted and read.

With its excellent performance and wide applicability, PieCloudVector has been successfully applied in various industries in the field of large models, especially showing significant advantages in the financial large model field. Among them, Soochow Securities, as the first successful case of PieCloudVector, has provided valuable experience for users to deeply understand market demand, optimize product design and functions.

As technology evolves and market demands change, the future of vector databases will develop towards a more comprehensive and intelligent direction, that is, AI databases which can directly support text search. Based on this, OpenPie is actively exploring the integration of large models and built-in vector conversion and other cutting-edge technologies to achieve automatic conversion from text to vectors and efficient search.

In the future, OpenPie will continue to focus on market and technological development trends, continuously explore and innovate the application scenarios of databases in multimodal large model systems, and plan to explore in-depth in this field with PieCloudVector. By optimizing PieCloudVector's processing capabilities for multimodal data, it will provide users with a richer and more efficient AI application experience.

About PieCloudVector

PieDataComputing System(PieDataCS) currently supports three computing engines: PieCloudDB Database, PieCloudVector, and PieCloudML. As the second cloud-native vector computing engine, PieCloudVector is an upgrade of the analytical database in the era of large models. PieCloudVector aids multimodal large model AI applications, further achieving massive vector data storage and efficient query of vector data, supporting and cooperating with the Embeddings of large models, helping basic models to quickly adapt and re-develop in AI scenes, and is essential for large model applications.

Related Blogs:

no related blog