Embracing AGI: OpenPie's PieDataCS Leads the New Paradigm of Cloud-Native Data Computing Systems
JULY 11TH, 2024
Since 2023, AI has entered a more mature and widely applied phase, and the concept of Artificial General Intelligence (AGI) has become a hot topic. This article will introduce OpenPie's cloud-native data computing system, PieDataCS, in detail from architectural design to practical implementation, against the backdrop of the AGI era.


Trends of AGI Development in China


Market and Technological Development Trends of AGI


The year 2023 is hailed as the "Year of AGI" with large language models(LLMs) causing a storm in the field of artificial intelligence. Although LLMs have made significant progress in imitating human cognition, there is still a long way to go to achieve true general intelligence. Given that the underlying models and computing power are somewhat distant from the enterprise market, we believe that the development of AGI will be dominated by applications. 


AI Agents can simplify the interaction between users and LLMs, allowing users to specify their goals and drive the LLMs to complete tasks. Since the advantages of AI Agent application mainly focus on high environmental adaptability, specific scenarios in the enterprise environment provide an ideal application background for AI Agent, and vertical industries have become the first areas where AI Agent is applied.


Hierarchical Structure of China's AGI Market


The technical framework of China's AGI market can be divided into four levels from bottom to top: infrastructure layer, model layer, middleware layer, and application layer. 

 

  • Infrastructure Layer: It is the cornerstone of realizing AGI, providing computing power support, and is the foundation to ensure the capabilities of model training and reasoning deployment.  


  • Model Layer: It is the core of AGI, and its capabilities directly affect the efficiency of the AGI applications. Products in the industry can be divided into two implementation schemes: self-developed models and variants based on open-source models. 


  • Middleware Layer: It provides the core functions and services required for the application of AGI, serving as a bridge that combines user application scenarios with models. It is an important layer for supplementing the application landing capabilities of LLMs. This is also the role that OpenPie plays in the AGI market. 


  • Application Layer: It is the interface for users/customers to directly use AGI technology, starting from providing specific services and solving specific business problems. For example, SaaS service software on mobile and computer terminals. 


Hierarchical Structure of China's AGI Market


PieDataCS: Cloud-Native Data Computing System


In order to adapt to the development of the AGI era, OpenPie has developed a cloud-native data computing system, PieDataCS, which reconstructs data storage and computing, allowing AI mathematical models, data, and computing to enhance each other. PieDataCS implements "one storage, multiple engine data computing," fully upgrading the big data system to the era of LLMs and empowering industry AI scenario applications. 


Overall Architecture of PieDataCS


As a proponent of data warehouse virtualization technology, OpenPie's cloud-native data computing system, PieDataCS, is data-centric in its approach to computation. It employs a pioneering cloud-native eMPP (elastic Massive Parallel Processing) architecture, which facilitates a complete decoupling of metadata, data, and computing functions. This separation allows for the independent management of cloud storage and computing resources. The system enables dynamic scaling of data computing resources to meet fluctuating demands, thereby optimizing the allocation of computational resources.


Architecture of PieDataCS


The system architecture of PieDataCS can be divided from bottom to top into the data storage layer, hardware acceleration layer, data storage engine layer, and data computing engine layer: 


  • Data Storage Layer: PieDataCS adopts a storage-computing separation architecture, with three parts separated for metadata, data resources, and computing. Through the storage engine JANM, it achieves unified management of data and fully utilizes the advantages brought by cloud storage and other storage systems. 


  • Hardware Acceleration Layer: It adopts FPGA heterogeneous technology, focusing on ultimate performance optimization. At the SQL computing engine, it optimizes data filtering and sorting; at the storage engine, it accelerates storage encryption, decryption and decompression; at the model level, it integrates various algorithms such as GEMM (General Matrix Multiplication), GEMV (General Matrix-Vector Multiplication), to accelerate some operators. 


  • Data Storage Engine Layer: PieDataCS combines the capabilities of cloud storage to create the JANM storage system, which is compatible with S3 object storage, HDFS, and other distributed file systems, and can connect multiple storage technologies to achieve unified data management. 


  • Data Computing Engine Layer: It currently supports the SQL computing engine PieCloudDB Database, vector computing engine PieCloudVector, and machine learning engine PieCloudML, all computing engines share the same underlying data. 


Design of PieDataCS 


The goal of the cloud-native data computing system PieDataCS is to empower industry AI large models. From a design concept, it mainly considers five aspects: 


  • Data Preparation 

 

Data is the cornerstone of LLMs. The quality of data directly determines the effect of model training and is the key to the emergence of LLMs capabilities. PieDataCS can improve the accuracy, completeness, and consistency of data through a series of processes such as cleaning, classification, deduplication, annotation, and enhancement of actual business data (structured, unstructured, and semi-structured data), building a high-quality data set, providing a reliable basis for subsequent model training and application, and enhancing the performance and applicability of the model.  

 

  • Data Sharing


PieDataCS’s storage base JANM, all computing engines share one data resource, which can unify the storage of various data across fields in daily business, and share it with LLMs for specific problem areas through data sharing technology for model fine-tuning and optimization. 


  • Data Security 


The security and privacy of data have always been one of the most concerned topics for users. The creation of a data computing system must solve the problems of data protection and data access permission management. PieDataCS provides enterprise-level Transparent Data Encryption (TDE), which uses real-time encryption, advanced encryption algorithms, multi-level keys, and other technologies to ensure that all data is encrypted before it is written to disk, and provides fine-grained role and permission control to ensure that private data is controllable and does not leave the domain, fully ensuring data security. 

 

  • Inference Acceleration  


During the inference process, PieDataCS provides the RAG architecture for AI models, which can save the results obtained from previous calculations and match them with the current input. When a similar problem input is found, the system can directly return the results that have been calculated, without having to perform the inference process again, avoiding a large number of repetitive calculations, and greatly improving the response speed and inference efficiency. 


  • Improving Accuracy  


LLMs usually generate results based on data that has been trained, but this also brings the problem of lack of professional knowledge and data timeliness, limiting their performance in dealing with new problems. PieDataCS uses RAG technology to break through the knowledge limit brought by pre-training by introducing external knowledge bases, effectively improving the accuracy of retrieval, avoiding hallucinations and avoiding inconsistencies in results caused by model updates or data changes, thereby increasing user trust. 


PieDataCS Empowers Industry Large Models


PieDataCS Virtual Data Warehouse Engine


PieDataCS's first data computing engine, PieCloudDB Database, uses leading data warehouse virtualization technology to unify multiple physical data warehouses into a highly available virtual data warehouse, based on different business scenarios of users, to pool resources, support the dynamic creation of virtual data warehouses based on data authorization, break data silos, and solve the problem of multiple data replication.


Virtual Warehouse Engine PieCloudDB


  • Architecture and Main Module Design


In PieCloudDB, data can be saved locally or can be saved on shared storage such as S3, HDFS, etc. PieCloudDB has a flexible architecture, which can support both storage-computing separation and storage-computing integration architectures. 


For metadata, PieCloudDB extracts and stores it in its self-developed distributed KV system, which implements indexing based on the natural sorting of keys and efficient distributed lock management based on the watcher mechanism, offering higher performance and further releasing the advantages of PieCloudDB's storage-computing separation architecture. When the data volume is small, a centralized deployment of a lightweight cluster can also be adopted to quickly support business scenarios.


For computing performance optimization, PieCloudDB has created a SIMD vectorized executor, which fully utilizes hardware resources such as CPU parallel computing to achieve more efficient data processing. In addition, PieCloudDB also provides management services to help users quickly perform automated installation and deployment of clusters, and can achieve unified monitoring and management of resources to ensure the stability and reliability of the system. Through a visual interface, users can easily perform fault troubleshooting, permission management, security audits, and other maintenance work, reducing maintenance costs.


  • Distributed Optimizer Design


For cloud-native and distributed scenarios, PieDataCS has also made a lot of improvements to the query optimizer to achieve aggregation pushdown optimization. After testing, compared with not using aggregation pushdown, the performance was improved by about 300 times after enabling aggregation pushdown. In addition, PieDataCS has also implemented multi-stage clustering, partition table pruning, recursive CTE optimization, and optimal order search for multi-table connections, and other optimization methods to greatly improve query performance. 


  • Synchronization of Structured and Semi-Structured Data


PieDataCS is compatible with various file formats, including its self-developed janm format, and also supports mainstream Parquet, ORC, CSV, JSON, and other file formats, capable of performing SQL queries on these types of files without the need for data import or conversion. 


In addition, to meet the needs of real-time data analysis, PieDataCS has created the DataFlow tool, which supports the real-time extraction of data from various data sources and writing it into PieDataCS, and supports visual operations through the cloud-native platform; if the original data is too large, user can also choose to transfer the files to S3 object storage, and use different algorithms for compression to save storage space costs. 


DataFlow Supports Real-Time Data Synchronization Scenarios


PieDataCS Vector Computing Engine


The cloud-native vector computing engine PieCloudVector, as the second computing engine of PieDataCS, is an upgrade of the analytical database in the era of LLMs, helping multi-modal LLMs applications further achieving massive vector data storage and efficient querying.


Vector Computing Engine PieCloudVector


PieCloudVector integrates mainstream Embedding algorithms and models (ChatGLM, LLaMA, Qwen, etc.), users can directly call built-in algorithms or use the packaged API interfaces, or choose local or public cloud model APIs according to their own needs to perform data embedding.


For vector databases, index algorithms can accelerate the retrieval of vector data and are key to its efficient retrieval capabilities. PieCloudVector supports mainstream vector index algorithms such as IVF_FLAT, HNSW, and hybrid indexing, and has also implemented an index acceleration caching mechanism to further improve retrieval speed and reduce response time. In addition, PieCloudVector also provides various vector retrieval algorithms such as L2 distance, dot product, and cosine similarity.


At the data application level, PieCloudVector has adapted to mainstream LLMs application development frameworks (LangChain, FinGPT, etc.), providing corresponding sdk, so users do not need to develop again and can directly use the existing frameworks to call embedding algorithms, and then store the data into PieCloudVector for RAG(Retrieval-Augmented Generation) or semantic reasoning and retrieval applications. 


Unlike most traditional computing engines, PieCloudVector supports deployment on GPU computing nodes in addition to supporting CPU, making full use of its powerful parallel computing capabilities, and can also utilize hardware acceleration technologies such as SIMD to further enhance the speed and efficiency of vector computing and data processing, providing the necessary performance support for large-scale vector computing. 

 

PieDataCS Machine Learning Engine

 

PieDataCS's third computing engine, PieCloudML, aims to integrate enterprise multi-modal data resources to provide strong data computing support for multi-modal LLMs to meet the needs and usage of data scientists. 


Machine Learning Engine PieCloudML


PieCloudML has designed a flexible computing and storage architecture to support machine learning tasks of different scales and needs. It can fully integrate with the mainstream machine learning ecosystem, supporting Python, R, and other languages to meet the preferences of different data scientists. PieCloudML integrates popular deep/machine learning frameworks such as TensorFlow, PyTorch, Keras, Scikit-Learn, and provides an interactive development environment based on Jupyter Notebook, making it easy for users to quickly call various development libraries for model development and training through a visual management interface. 

 

PieCloudML leverages container orchestration technology Kubernetes to implement automated container deployment, upgrades, and rollbacks, and uses Kubernetes's elastic scaling capabilities to dynamically adjust the resource requests and limits of Pods according to real-time load, coping with different load pressures. Kubernetes's self-healing capabilities ensure the high availability of PieCloudML services. In the event of a failure, it can automatically restart failed containers or replace unhealthy Pods. 


In addition, PieCloudML also provides various data access interfaces such as Spark Connector, JDBC, ODBC, etc., to facilitate the connection with various data sources and business systems, simplifying data access and usage. 

 

Multi-Modal Data Sharing 


JANM, as the cloud storage base of PieDataCS, aims to become a data storage base that meets the high-performance computing engine in the multi-cloud scenario. Based on cloud-native design and modern hardware facilities, it is committed to simplifying the entire process of data loading, reading, and computing in the big data processing process, to complete various scenarios of data computing and analysis tasks. 


Storage Engine JANM


JANM supports multi-modal data sharing, which can connect various data within the enterprise, manage structured data, semi-structured data, and unstructured data uniformly, and has a highly abstract data access protocol. It uses the fully self-developed Table Format technology, which can seamlessly dock with various storage such as Apache Iceberg, Apache Hudi, Delta Lake, build a unified data lake management, and share data with SQL, batch-stream integration, LLMs and various data computing engines through a unified interface, one data storage, multiple engine data computing, to achieve true intercommunication of data between different services.


PieDataCS Use Cases


Since its establishment, OpenPie has been focused on the field of data computing. PieDataCS reconstructs data storage and computing with cloud-native technology, enabling LLMs technology to fully empower industry AI scenario applications, creating greater business value for enterprises, becoming the basic technology base of AI, and opening a new paradigm of AI technology. 


At present, PieDataCS provides Public Cloud Edition, Community Edition, Enterprise Edition, and All-In-One Machine to meet the needs of different business scenarios of enterprises, and has built an AI data base for users in finance, manufacturing, medical, and education industries. 

 

Data Infrastructure Platform Use Case 

 

Under the demand of digital transformation, the customer has adopted PieDataCS as a new generation of digital base, replacing the original data platform, docking with internal applications such as OA, CRM, ERP, etc., and integrating internal office data, business application data, and external data into the JANM data lake of PieDataCS, and then using different computing engines of PieDataCS data computing system according to different data formats for processing. 


Structured and semi-structured data use the virtual data warehouse engine PieCloudDB for real-time data analysis, and can also be docked with Flink for stream computing to process data. Through data layering, thematic data is formed, thereby forming standard API interfaces externally. 


For design assistance enhancement requirements, including 3D/2D drawings and other data, embedding is made through the model, and similar content is approximately retrieved using the vector engine. And through the machine learning engine PieCloudML, traditional machine learning algorithms can be managed in a universal way to achieve the enhancement of research and development integration. 


Data Infrastructure Platform Use Case


AIGC Application Use Case of Financial Industry


In a financial customer case, due to the needs of investment managers to write a large number of investment materials in their daily work, it is necessary to quickly retrieve laws and regulations, policy documents, and investment research reports to form corresponding analysis reports and provide customers with investment-related data support.


In order to improve the efficiency and accuracy of retrieval, the financial customer has built an AIGC application solution based on the vector computing engine using PieDataCS, combined with the self-developed LLM Xiucai GPT, the LangChain development framework, and PieCloudVector to build an AIGC application platform. Traditional text data is made into embedding and imported into PieCloudVector, so that according to the needs, precise search or full-text search of the content can be realized. It has met the needs of the customer to build various AI applications such as investment research analysis, quantitative trading, intelligent consulting, and emotional analysis based on GPT.


AIGC Application Use Case


Outlook and Expectations


In the era of AGI, the value of data is becoming more and more prominent. OpenPie is committed to becoming a reliable partner for customers in the field of data computing, providing customers with stronger, more reliable data services and leading data technical support in the industry. We will continue to innovate in products, continuously optimize product functions and performance to meet the growing data needs of customers.

Related Blogs:
no related blog