On October 24th, Programmer's Day, OpenPie’s 2023 Annual Technology Forum, entitled "Large Model Data Computing System", concluded successfully in Shanghai. At the event, OpenPie released new Large Model Data Computing System PieDataComputingSystem(πDataCS). πDataCS, restructured with cloud-native technology for data storage and computation, enables one data storage, multi-engine data computation, allowing AI models to be larger and faster.
πDataCS aims to help enterprises optimize computational bottlenecks, fully leverage and exploit the advantages of data scale, build core technological barriers, better empower business development, keep the autonomous and controllable large model data computing system globally leading, and comprehensively empower various industries with large model technology.
The computing platform has undergone three major changes from mainframes and PCs to today's cloud platforms. Cloud platforms represent the largest computing capacity, storage capacity, and horizontal scalability. In the era of PCs, metadata and user data were mapped to local hard drives, with computation mapped to local CPUs, the storage and computation tightly coupled on the same server.
πDataCS, by reconstructing data storage and computation with cloud-native technology, first separates computation and data in the data computing system to enhance system elasticity. Next, considering future data governance and transactions, OpenPie separates metadata and user data again, achieving the brand-new eMPP architecture. Metadata is mapped to block storage and managed by the metadata management system 「Mundo」.User data is mapped to object storage and managed by the「JANM」storage system, while computation is mapped to containers or virtual machines and managed by the computation system.
πDataCS, through Data Mesh, upgrades data governance and realizes data value. πDataCS deeply considers the requirements of global data transactions and data governance. Data, as a new means of production, is essential fuel for model development. Under the premise of privacy and security, data owners can share metadata containing data catalogs with other users, and data operators can access the user data of owners through metadata and, as needed, access the user data of owners for a fee through authorization. When accessing the data of owners, data operators need to call the data computing engine provided by data processors.
The overall architecture of πDataCS is divided into four layers, as shown in the figure below.
Architecture of πDataCS
The top layer consists of the computing engines supported by πDataCS. Currently, πDataCS supports the following computing engines:
PieCloudDB: The First Cloud-Native Data Warehouse Computing Engine
As the first computing engine of πDataCS, PieCloudDB database, a cloud-native virtual data warehouse, comprehensively supports multiple product editions of πDataCS, including public cloud, community, enterprise, and integrated machine editions. It provides deployment options in public cloud, private cloud, and bare metal, employing data warehouse virtualization technology to assist enterprises in breaking down data silos, integrating all structured data resources, and effortlessly handling complex logical computations.
The cloud-native storage-computing separation architecture utilizes a three-layer structure of metadata, computing, and data separation to achieve independent management of storage and computing resources in the cloud. On the cloud, PieCloudDB utilizes the eMPP (elastic MPP) technology to execute tasks concurrently across multiple clusters. This enables enterprises to flexibly scale up or down, efficiently adapting to changes in workload and seamlessly handling petabyte-scale data.
PieCloudVector: Cloud-Native Vector Computing Engine
A vector database is a specialized database system designed for storing, querying, and analyzing vector data, such as feature vectors. After comparing the implementation and performance of pgvector and pgembedding, OpenPie chose not to use existing open-source solutions. Instead, OpenPie independently developed PieCloudVector to meet the specific usage scenarios of our users. PieCloudVector features efficient storage and retrieval of vector data, similarity searches, vector indexing, vector clustering and classification, high-performance parallel computing, and robust scalability and fault tolerance.
Cloud-Native Vector Computing Engine: PieCloudVector
PieCloudML: Cloud-Native Machine Learning Engine
However, with the increasing advancement of artificial intelligence, more and more economic activities in the future will be driven by AI. Within πDataCS, a cloud-native machine learning engine, PieCloudML, has been established. Through the various machine learning, graph, and large model algorithms embedded in PieCloudML, data scientists can utilize familiar methods such as Python/R to accomplish various tasks. They leverage the data computing system to generate the required models.
Cloud-Native Machine Learning Engine: PieCloudML
To accelerate the performance of big data processing and computation in πDataCS, it extensively relies on new hardware for asynchronous computing, such as GPU, FPGA, etc. Through a unified metadata management layer called「Mundo」.these three major computing engines share a data storage foundation named 「JANM」. This achieves a one data storage, multiple-engine computation paradigm.
Next, we will provide a detailed introduction to 「JANM」, the cloud storage foundation of this big data computing system.
As the cloud storage foundation of πDataCS, the JANM storage system aims to create a data management and storage foundation for high-performance computing systems in various cloud scenarios. Leveraging modern hardware and infrastructure, JANM maximizes the potential of the cloud, ensuring absolute data security. It is committed to simplifying the entire process of data loading, reading, and computation in big data processing. Additionally, it provides features such as adaptive governance of data, ACID transaction support, ensuring absolute data security, and achieving optimal performance for various data computation and analysis tasks in different scenarios.
To achieve this goal, the evolution of JANM mainly goes through three stages:
Stage One: Next-Generation Cloud-Native Storage
In the first stage, JANM primarily serves as the cloud-native storage for the PieCloudDB, a cloud-native virtual data warehouse, and the development work for this stage has been completed.
JANM is compatible with various cloud environments, including public cloud, private cloud, and hybrid cloud. It uses object storage as the persistent storage layer. Considering the data distribution and elasticity under the elastic MPP (eMPP) architecture, it employs consistent hashing to ensure that each node in the distributed environment accesses roughly the same data. Even with scaling, it minimizes the number of caches implemented. JANM takes into account the security of data, integrating transparent encryption, which encrypts the data when stored on disk, in conjunction with transparent encryption used in the cloud-native virtual data warehouse PieCloudDB. Transparent encryption employs a three-tiered key system, ensuring the absolute security of the data. Moreover, JANM has undergone extensive optimizations for read and write performance, significantly improving the efficiency of data loading and querying.
Brand New File Format: janm
「JANM」the new generation of cloud-native storage, is built around the JANM file format. The JANM file format employs a design that combines row and column storage. This hybrid storage design allows the system to have the efficient performance of row storage when restructuring data, as well as the high compression ratio and cache line-friendly advantages of column storage. Additionally, the JANM file format supports vectorized (SIMD) and parallel computing. In its design, JANM also takes into account the storage representation in both internal and external memory, redefining the data format in disk and memory for table data, ensuring that there is no additional overhead in the conversion of data between disk and memory.
Within the file format, JANM also collects statistical information about the data to accelerate queries, supporting performance optimization features such as precomputation. To speed up I/O, the JANM file format incorporates various compression algorithms, such as zstd and lz. Depending on the data type, JANM can adaptively choose different encoding methods, including delta encoding, dictionary encoding, and others.
Through block-level MVCC (Multi-Version Concurrency Control), JANM provides complete transaction support. Whether the data in each file block is visible is determined by JANM based on the MVCC information of the file it belongs to, according to the current transaction isolation level. In PieCloudDB, JANM has been deeply customized for the access layer to ensure that PieCloudDB fully leverages the various optimizations provided by JANM.
Currently, JANM has undergone extensive optimization for data reading and querying, implementing numerous features, including Data Skipping, precomputation to accelerate aggregate queries, support for Smart Analyze, and TOAST:
With the completion of this phase, and in response to the needs of πDataCS, the development team has undertaken the design and implementation of the second phase for JANM. The goal is to transform JANM into the cloud storage foundation of the big data computing system.
Stage Two: Cloud Storage Foundation of the Big Data Computing System
In this phase, JANM will serve as the cloud storage foundation for πDataCS, with the goal of truly achieving "one data storage, multiple engine computation." Corresponding development work is currently underway.
To achieve this goal, JANM plans to implement the following features:
The diagram below details all the levels of the JANM Table Format, where each level depends on the one below it and draws the necessary capabilities from it. Users store data in the corresponding file format in an extremely scalable cloud storage to provide data for upper-level computations.
JANM: Cloud Storage Foundation of the Big Data Computing System
Storage Access Abstract Layer
At the bottom is the Storage Access Abstract Layer of JANMo. JANMo interacts with any type of storage, including cloud object storage (such as S3) and HDFS, using abstract APIs. This ensures compatibility with all storage engines. Additionally, JANMo wraps the file system to further optimize storage functionalities, such as providing monitoring and various read-write strategies.
Data File Format Abstract Layer
At this layer, JANMo supports various file formats and provides a unified access interface to simplify data access operations. This allows users to freely choose different file formats for storing their data. At a higher level, JANMo's unique file layout scheme involves recording all changes to each file, enabling JANMo to create an independent redo log for implementing richer functionalities.
Core Layer of Table Format
The core layer of the table format provides functional encapsulation and implementation of various features. The core layer includes the following five subsystems:
The core layer includes the table transaction engine, implementing file-level MVCC (Multi-Version Concurrency Control). It supports visibility judgments for the database based on isolation levels, ensuring a certain level of concurrency control. Regarding transaction guarantee, the fundamental idea of JANM is that logs are data, where data refers to transaction visibility information.
Indexing helps in better organizing queries, reducing overall I/O, and providing faster response times. In OLAP (Online Analytical Processing) scenarios, information about the file list and column indexes is sufficient to enable OLAP engines to quickly generate efficient query plans. Currently, JANM supports indexes required for data skipping. In the future, OpenPie will continue to explore more index implementations, including row-level indexes.
The adaptive management functions supported by JANM for table data mainly include:
➢ VACCUM: Data cleaning to reclaim space left by operations
➢ Smart Analyze: Sampling of data distribution information
➢ Compaction: Merging small files to improve I/O efficiency
➢ Cluster: Attempting to cluster similar data in the same file to enhance data skipping efficiency and improve query speed
➢ Sort: Sorting data based on specified fields or conditions
...
At this layer, JANM supports the control of the composition and layout of tables, encapsulating functions such as traversing table files and statistics on table data size. In object storage, listing files is a costly operation, and JANM facilitates fast file traversal and data size statistics through the functions provided at the table format layer.
Scalable Programming Interface
For the upper-layer interfaces, JANM provides a unified API for interacting with external services, facilitating the integration of third-party applications. JANM supports different implementations of extension services without requiring additional application development, saving users both costs and effort. It provides entry points for data access, table access services, snapshot-based operations, and rich functionalities including Time Travel.
For table application services, JANM provides stateless data management applications that can be registered with any service, thereby achieving adaptive data management.
After the completion of the second phase, πDataCS's「JANM」plans to embrace open source, enabling true interoperability of data across different services. It aims to fully support numerous services, including Spark and Clickhouse, realizing the concept of one data storage, multiple engine computations.
Stage Three: Unified Storage Engine for the Big Data Computing System
In the evolving third phase, JANM looks forward to becoming a unified storage engine for big data computing systems. The goal is to create a unified access protocol that brings together table formats, data lakes, table engines, and more, simplifying user access operations. We hope everyone will continue to follow JANM's progress!