PieCloudVector, OpenPie's Vector Database, Empowers the AIGC Application Upgrade of Soochow Securities

APRIL 06TH, 2024

User Case Background

With continuous innovation of artificial intelligence (AI) technology, we can observe its increasingly widespread applications in various fields. Deep learning techniques have shown remarkable performance in areas such as image recognition, speech recognition, and natural language processing. Improvements in machine learning algorithms will address more practical problems, including reinforcement learning, transfer learning, and federated learning, to effectively deal with complex data issues. The ongoing advancements in natural language processing technology contribute to achieving more natural modes of conversation and communication, finding extensive applications in intelligent customer service, virtual assistants, intelligent translation, and more. The fusion of data and AI is an unstoppable trend, where big data and AI technologies mutually stimulate and complement each other, driving their combined development. This synergy holds the potential to revolutionize the financial industry once again.

AIGC applications represent a typical integration of data intelligence, with strong foundations in robust data governance capabilities. Pre-trained language models continuously undergo training, iteration, and optimization using high-quality data, leading to significantly enhanced intelligent application concepts. These LLMs unlock the potential of unstructured data within the securities industry, facilitating more efficient extraction of data value. Their application permeates throughout different stages of business operations, ushering in a new era of productivity upgrades. The emergence of LLMs has propelled the financial industry into a completely new era, yet it has also presented some challenges for the sector.

Current Situation and Pain Points

Data Security

When it comes to business applications involving sensitive information, data privacy is an issue that cannot be ignored. In certain scenarios, there may be a need to call LLM API interface services, but direct access to business data should be avoided due to the risk of data leakage. It is crucial to consider the potential risks associated with data breaches.

Private Domain Data

General LLMs have not been exposed to enterprise's private domain data and specific business scenarios during the training process. Therefore, they cannot fully fulfill the actual requirements of enterprises or optimize their specific business processes. It is necessary to integrate these models with internal enterprise knowledge and data in order to address these limitations effectively.

Real-Time Issue

LLMs are typically trained and optimized based on historical data up until a certain point in time, excluding real-time updates. As a result, if a user asks a question related to the latest data, the AI-generated answer may be incorrect, leading to what is known as the LLM illusion problem. Additionally, computing and generating corresponding answers with LLMs also require a certain amount of time, usually around 3 to 5 seconds, resulting in higher user interaction delay.

Long-Term Memory

LLMs primarily handle and generate data, but they lack the ability to retain long-term memory. This poses a critical problem for AIGC scenarios that require continuous interaction. Long-term memory is essential for maintaining contextual understanding and providing a more natural and personalized user experience. The absence of long-term memory impairs the performance of LLMs and the overall user experience in AIGC applications.

AIGC Application Based on Vector Database

Overall Architecture

Solutions

Soochow Securities had developed an AIGC application platform using self-developed LLM, Xiucai GPT, along with the LangChain development framework and PieCloudVector vector database. The platform integrates structured and unstructured data from transactional applications, with the latter primarily consisting of textual data such as legal regulations, financial news, research reports, and more.

The current production version of Soochow Securities Xiucai GPT has a parameter size of 13 billion, and by mid-April 2024, it plan to complete the training of a trillion-parameter LLM. The training dataset consists of approximately 223.5 trillion tokens of Chinese and English corpora, including 400 billion tokens of financial data. This training process utilizes 40 servers equipped with 8 H800 GPUs each.

OpenPie provided a distributed vector database called PieCloudVector, which is deployed across 4 nodes. The total data volume exceeds 4TB, with each collection capable of holding nearly 200 million vector data entries. PieCloudVector supports various types of indexes and mainstream retrieval algorithms, offering extensive search capabilities.

PieCloudVector, in conjunction with Soochow Securities Xiucai GPT, forms the overall RAG architecture. PieCloudVector primarily stores the embedded vector data while also supporting storage of scalar data for applications. Additionally, it provides an SDK for LangChain, enabling seamless integration into the AIGC application development framework.

PieCloudVector is developed by OpenPie and has successfully passed the vector database capability test conducted by the China Academy of Information and Communications Technology. For the deployment at Soochow Securities, OpenPie utilized domestically manufactured HYGON servers and the Kylin operating system, ensuring compliance with the requirements for autonomous controllability.

Practice and Benefits

Soochow Securities Xiucai GPT has been developed based on four major application paradigms: text understanding and generation, RAG-enhanced search, enterprise intelligence center, and intelligent BI. Leveraging these paradigms, it have created numerous AI-driven applications specifically tailored for the securities industry. Their existing applications include analysis of stock price movements and post-market summaries, AI-powered customer service assistants, intelligent Q&A for annual reports and mutual funds, quantitative investment support, internal training, among others.

PieCloudVector supports the classification, deduplication, and cleansing of massive data during the training phase of Soochow Securities' Xiucai GPT. This significantly reduces costs and enhances the efficiency of training for LLMs.

By plugging in a knowledge base built on PieCloudVector, it enhances the ability of the LLM to handle new questions. This approach overcomes the knowledge time constraint imposed by pre-training and prevents the occurrence of illusions.

During the inference stage, the inherent access control of PieCloudVector ensures true privacy data control within the domain. Additionally, it reduces redundant computations by utilizing caching to avoid repetitive LLM inference, thereby improving response speed and performance.

In terms of contextual limitations, PieCloudVector possesses the capability to persist historical data. It leverages built-in KNN and ANN algorithms to perform similarity searches and retrieve the most relevant content. This breakthrough enables the LLM to overcome contextual restrictions and achieve long-term memory. By caching the results of the model's question-answering process, data consistency is ensured, avoiding inconsistencies caused by model updates or changes in data. This enhances user trust in the system.

PieCloudVector vector database has the capability to perform fast queries on trillion-scale vector databases. It supports single-node multi-threaded index creation, effectively utilizing all available hardware computational resources. This results in a five-fold improvement in index creation performance, a six-fold improvement in retrieval performance, and a three-fold improvement in interactive response speed.

PieCloudVector’s Advantages

PieCloudVector’s Application Scenarios

Related Blogs:

no related blog