The Evolution of JANM: Building the Cloud Storage Foundation for the Large Model Data Computing System

DECEMBER 01ST, 2023

On October 24th, Programmer's Day, OpenPie’s 2023 Annual Technology Forum, entitled "Large Model Data Computing System", concluded successfully in Shanghai. At the event, OpenPie released new Large Model Data Computing System PieDataComputingSystem(πDataCS). πDataCS, restructured with cloud-native technology for data storage and computation, enables one data storage, multi-engine data computation, allowing AI models to be larger and faster.

One Data Storage, Multi-engine Data Computation

πDataCS aims to help enterprises optimize computational bottlenecks, fully leverage and exploit the advantages of data scale, build core technological barriers, better empower business development, keep the autonomous and controllable large model data computing system globally leading, and comprehensively empower various industries with large model technology.

The computing platform has undergone three major changes from mainframes and PCs to today's cloud platforms. Cloud platforms represent the largest computing capacity, storage capacity, and horizontal scalability. In the era of PCs, metadata and user data were mapped to local hard drives, with computation mapped to local CPUs, the storage and computation tightly coupled on the same server.

πDataCS, by reconstructing data storage and computation with cloud-native technology, first separates computation and data in the data computing system to enhance system elasticity. Next, considering future data governance and transactions, OpenPie separates metadata and user data again, achieving the brand-new eMPP architecture. Metadata is mapped to block storage and managed by the metadata management system 「Mundo」.User data is mapped to object storage and managed by the「JANM」storage system, while computation is mapped to containers or virtual machines and managed by the computation system.

πDataCS, through Data Mesh, upgrades data governance and realizes data value. πDataCS deeply considers the requirements of global data transactions and data governance. Data, as a new means of production, is essential fuel for model development. Under the premise of privacy and security, data owners can share metadata containing data catalogs with other users, and data operators can access the user data of owners through metadata and, as needed, access the user data of owners for a fee through authorization. When accessing the data of owners, data operators need to call the data computing engine provided by data processors.

The overall architecture of πDataCS is divided into four layers, as shown in the figure below.

Architecture of πDataCS

The top layer consists of the computing engines supported by πDataCS. Currently, πDataCS supports the following computing engines:

PieCloudDB: As the first cloud-native data warehouse computing engine from OpenPie, it supports SQL language models and is compatible with HTAP.
PieCloudVector: A cloud-native vector computing engine established to support vector computation in conjunction with large models.
PieCloudML: A cloud-native machine learning engine established to support machine learning languages such as Python and R.

PieCloudDB: The First Cloud-Native Data Warehouse Computing Engine

As the first computing engine of πDataCS, PieCloudDB database, a cloud-native virtual data warehouse, comprehensively supports multiple product editions of πDataCS, including public cloud, community, enterprise, and integrated machine editions. It provides deployment options in public cloud, private cloud, and bare metal, employing data warehouse virtualization technology to assist enterprises in breaking down data silos, integrating all structured data resources, and effortlessly handling complex logical computations.

The cloud-native storage-computing separation architecture utilizes a three-layer structure of metadata, computing, and data separation to achieve independent management of storage and computing resources in the cloud. On the cloud, PieCloudDB utilizes the eMPP (elastic MPP) technology to execute tasks concurrently across multiple clusters. This enables enterprises to flexibly scale up or down, efficiently adapting to changes in workload and seamlessly handling petabyte-scale data.

PieCloudVector: Cloud-Native Vector Computing Engine

A vector database is a specialized database system designed for storing, querying, and analyzing vector data, such as feature vectors. After comparing the implementation and performance of pgvector and pgembedding, OpenPie chose not to use existing open-source solutions. Instead, OpenPie independently developed PieCloudVector to meet the specific usage scenarios of our users. PieCloudVector features efficient storage and retrieval of vector data, similarity searches, vector indexing, vector clustering and classification, high-performance parallel computing, and robust scalability and fault tolerance.

Cloud-Native Vector Computing Engine: PieCloudVector

PieCloudML: Cloud-Native Machine Learning Engine

However, with the increasing advancement of artificial intelligence, more and more economic activities in the future will be driven by AI. Within πDataCS, a cloud-native machine learning engine, PieCloudML, has been established. Through the various machine learning, graph, and large model algorithms embedded in PieCloudML, data scientists can utilize familiar methods such as Python/R to accomplish various tasks. They leverage the data computing system to generate the required models.

Cloud-Native Machine Learning Engine: PieCloudML

To accelerate the performance of big data processing and computation in πDataCS, it extensively relies on new hardware for asynchronous computing, such as GPU, FPGA, etc. Through a unified metadata management layer called「Mundo」.these three major computing engines share a data storage foundation named 「JANM」. This achieves a one data storage, multiple-engine computation paradigm.

Next, we will provide a detailed introduction to 「JANM」, the cloud storage foundation of this big data computing system.

JANM: Cloud Storage Foundation of the Big Data Computing System

As the cloud storage foundation of πDataCS, the JANM storage system aims to create a data management and storage foundation for high-performance computing systems in various cloud scenarios. Leveraging modern hardware and infrastructure, JANM maximizes the potential of the cloud, ensuring absolute data security. It is committed to simplifying the entire process of data loading, reading, and computation in big data processing. Additionally, it provides features such as adaptive governance of data, ACID transaction support, ensuring absolute data security, and achieving optimal performance for various data computation and analysis tasks in different scenarios.

To achieve this goal, the evolution of JANM mainly goes through three stages:

Stage One: Next-Generation Cloud-Native Storage
Stage Two: Cloud Storage Foundation of the Big Data Computing System
Stage Three: Unified Storage Engine for the Big Data Computing System

Stage One: Next-Generation Cloud-Native Storage

In the first stage, JANM primarily serves as the cloud-native storage for the PieCloudDB, a cloud-native virtual data warehouse, and the development work for this stage has been completed.

JANM is compatible with various cloud environments, including public cloud, private cloud, and hybrid cloud. It uses object storage as the persistent storage layer. Considering the data distribution and elasticity under the elastic MPP (eMPP) architecture, it employs consistent hashing to ensure that each node in the distributed environment accesses roughly the same data. Even with scaling, it minimizes the number of caches implemented. JANM takes into account the security of data, integrating transparent encryption, which encrypts the data when stored on disk, in conjunction with transparent encryption used in the cloud-native virtual data warehouse PieCloudDB. Transparent encryption employs a three-tiered key system, ensuring the absolute security of the data. Moreover, JANM has undergone extensive optimizations for read and write performance, significantly improving the efficiency of data loading and querying.

Brand New File Format: janm

「JANM」the new generation of cloud-native storage, is built around the JANM file format. The JANM file format employs a design that combines row and column storage. This hybrid storage design allows the system to have the efficient performance of row storage when restructuring data, as well as the high compression ratio and cache line-friendly advantages of column storage. Additionally, the JANM file format supports vectorized (SIMD) and parallel computing. In its design, JANM also takes into account the storage representation in both internal and external memory, redefining the data format in disk and memory for table data, ensuring that there is no additional overhead in the conversion of data between disk and memory.

Within the file format, JANM also collects statistical information about the data to accelerate queries, supporting performance optimization features such as precomputation. To speed up I/O, the JANM file format incorporates various compression algorithms, such as zstd and lz. Depending on the data type, JANM can adaptively choose different encoding methods, including delta encoding, dictionary encoding, and others.

Through block-level MVCC (Multi-Version Concurrency Control), JANM provides complete transaction support. Whether the data in each file block is visible is determined by JANM based on the MVCC information of the file it belongs to, according to the current transaction isolation level. In PieCloudDB, JANM has been deeply customized for the access layer to ensure that PieCloudDB fully leverages the various optimizations provided by JANM.

Currently, JANM has undergone extensive optimization for data reading and querying, implementing numerous features, including Data Skipping, precomputation to accelerate aggregate queries, support for Smart Analyze, and TOAST:

Data Skipping: During queries, JANM minimizes the amount of data to be read based on the query conditions, aiming to save IO and improve query performance.

Precomputation: For aggregate queries, when JANM collects aggregate data for each data block, it can directly use this precomputed data to accelerate aggregate computation.

Smart Analyze: Generally, query optimizers generate query execution plans by collecting data distribution information for the entire table through the analyze operation. In analytical scenarios, when dealing with large datasets, the data distribution information collected through regular analyze operations may have significant errors, resulting in suboptimal execution plans. Smart Analyze calculates the distribution information for each data block during data loading and then merges the statistical information for all data blocks using a merge algorithm to generate more accurate table data distribution information. The fundamental idea is to sample as much user data as possible without affecting performance.

Support for TOAST(The Oversized-Attribute Storage Technique): JANM has already implemented basic read and write operations for storing very large fields. In the new version, PieCloudDB JAMN has been further optimized to fully support UPDATE/DELETE and VACUUM functions for storing very large fields.

With the completion of this phase, and in response to the needs of πDataCS, the development team has undertaken the design and implementation of the second phase for JANM. The goal is to transform JANM into the cloud storage foundation of the big data computing system.

Stage Two: Cloud Storage Foundation of the Big Data Computing System

In this phase, JANM will serve as the cloud storage foundation for πDataCS, with the goal of truly achieving "one data storage, multiple engine computation." Corresponding development work is currently underway.

To achieve this goal, JANM plans to implement the following features:

Support for more file formats
Data interoperability
More efficient external data extraction and loading
Streaming data processing
High-performance ACID transaction processing
Adaptive data management
Support for CDC scenarios (Change Data Capture)
More support for cloud-native indexes

The diagram below details all the levels of the JANM Table Format, where each level depends on the one below it and draws the necessary capabilities from it. Users store data in the corresponding file format in an extremely scalable cloud storage to provide data for upper-level computations.

JANM: Cloud Storage Foundation of the Big Data Computing System

Storage Access Abstract Layer

At the bottom is the Storage Access Abstract Layer of JANMo. JANMo interacts with any type of storage, including cloud object storage (such as S3) and HDFS, using abstract APIs. This ensures compatibility with all storage engines. Additionally, JANMo wraps the file system to further optimize storage functionalities, such as providing monitoring and various read-write strategies.

Data File Format Abstract Layer

At this layer, JANMo supports various file formats and provides a unified access interface to simplify data access operations. This allows users to freely choose different file formats for storing their data. At a higher level, JANMo's unique file layout scheme involves recording all changes to each file, enabling JANMo to create an independent redo log for implementing richer functionalities.

Core Layer of Table Format

The core layer of the table format provides functional encapsulation and implementation of various features. The core layer includes the following five subsystems:

Table Transaction Engine

The core layer includes the table transaction engine, implementing file-level MVCC (Multi-Version Concurrency Control). It supports visibility judgments for the database based on isolation levels, ensuring a certain level of concurrency control. Regarding transaction guarantee, the fundamental idea of JANM is that logs are data, where data refers to transaction visibility information.

Index

Indexing helps in better organizing queries, reducing overall I/O, and providing faster response times. In OLAP (Online Analytical Processing) scenarios, information about the file list and column indexes is sufficient to enable OLAP engines to quickly generate efficient query plans. Currently, JANM supports indexes required for data skipping. In the future, OpenPie will continue to explore more index implementations, including row-level indexes.

Table Management

The adaptive management functions supported by JANM for table data mainly include:

➢ VACCUM: Data cleaning to reclaim space left by operations

➢ Smart Analyze: Sampling of data distribution information

➢ Compaction: Merging small files to improve I/O efficiency

➢ Cluster: Attempting to cluster similar data in the same file to enhance data skipping efficiency and improve query speed

➢ Sort: Sorting data based on specified fields or conditions

...

Encapsulation of Table Format Operations and Controls

At this layer, JANM supports the control of the composition and layout of tables, encapsulating functions such as traversing table files and statistics on table data size. In object storage, listing files is a costly operation, and JANM facilitates fast file traversal and data size statistics through the functions provided at the table format layer.

Scalable Programming Interface

For the upper-layer interfaces, JANM provides a unified API for interacting with external services, facilitating the integration of third-party applications. JANM supports different implementations of extension services without requiring additional application development, saving users both costs and effort. It provides entry points for data access, table access services, snapshot-based operations, and rich functionalities including Time Travel.

For table application services, JANM provides stateless data management applications that can be registered with any service, thereby achieving adaptive data management.

After the completion of the second phase, πDataCS's「JANM」plans to embrace open source, enabling true interoperability of data across different services. It aims to fully support numerous services, including Spark and Clickhouse, realizing the concept of one data storage, multiple engine computations.

Stage Three: Unified Storage Engine for the Big Data Computing System

In the evolving third phase, JANM looks forward to becoming a unified storage engine for big data computing systems. The goal is to create a unified access protocol that brings together table formats, data lakes, table engines, and more, simplifying user access operations. We hope everyone will continue to follow JANM's progress!

Related Blogs:

no related blog