Introduction of PieCloudDB Database V3.0: Database Kernel is Further Upgraded!

MARCH 20TH, 2024

On March 14, OpenPie 2024 Strategy and New Product Launch Event concluded successfully at the Shanghai International Conference Center. With the theme “Data Warehouse Virtualization to Mobilize Your Data”, it attracted many senior experts and partners in the industry to discuss hot topics such as data element mobilization and digital technology innovation.

Ray Von, founder and CEO of OpenPie, launched the cloud-native virtual data warehouse PieCloudDB V3.0, and shared the latest achievements of data warehouse virtualization technology and its best practices in the data element industry.

PieCloudDB V3.0 Release Ceremony

During the conference, OpenPie’s founder and CEO, Ray Von, shared the strategic layout in the data field for the year 2024: Using Data Warehouse virtualization technology to mobilize data, releasing the value of data elements. PieCloudDB adopts the innovative data warehouse virtualization technology, creating a metadata, data asset (storage), and computing separation eMPP (elastic MPP) architecture within private/public clouds. This solution eliminates challenges present in traditional approaches regarding data privacy, flexibility, and scalability in large model calculations. From the foundational structure, it eliminates structured data silos, supports data element mobilization on a larger scale, truly achieving 'data usability without visibility,' enabling larger, faster, and more accurate models, and ensuring a secure state of 'No data movement, focusing on computation'.

PieCloudDB kernel technology continues to make breakthroughs and is upgraded again. In this new version, PieCloudDB has undergone extensive upgrades in various modules such as storage, metadata, and executors.

JANM: Self-Developed Data Storage Engine

The vision of OpenPie's self-developed JANM Storage is to leverage cloud-native design and modern hardware and technology to create a data storage engine that meets the needs of high-performance computing systems in different cloud scenarios.

In the era of big data, data is stored in files with specific formats, and lots of companies have made in-depth innovations in storage formats and organizational forms. In order to pursue the ultimate performance, obtain more flexible data units, build file-based statistical information, and closely support file-level query optimization and upper-layer features, OpenPie developed a new storage format "janm". In the preliminary performance comparison of janm and the open source storage format parquet, it can be seen that janm is exponentially improved than parquet in many aspects.

Performance Comparison of janm and parquet File Formats

In addition, in order to simplify various stages of data processing in the big data era, JANM has made lots of effort on the organization of data files to make them more efficient. JANM also took cloud-native design and elastic support on consideration during product design, developed more features to avoid global ordering, make data organization simpler, reduce data movement, improve efficiency, support distributed computing, avoid data skew, support multiple clusters and elasticity, and maximize cluster resource utilization.

JANM is able to automatically and adaptively manage data files on a regular basis, quickly filter out files that need to be reclustered, and incrementally quickly aggregate data into new files based on index columns. Furthermore, JANM also supports for leveraging data files to generate new index forms, thereby enhancing the performance of point queries on indexed columns.

A New Generation of Vectorized Execution Engine

PieCloudDB's new generation vectorized executor adopts a plug-in execution method, which can adaptively select the execution engine based on cost and automatically match the optimal execution engine. The execution engine is based on an efficient memory column storage format and efficiently converts the janm storage format of mixed row and column storage into memory. And it supports most existing types to achieve complete function processing.

Currently the PieCloudDB vectorized executor has completed the transformation of most operators such as sort, agg, join, scan, motion, filter, etc., and efforts are ongoing to further optimize other optimization algorithms such as runtime filter and low cardinality. Today, PieCloudDB vectorized executor has demonstrated impressive performance improvements on TPC-H, a widely adopted decision support benchmark in the industry. In addition, the executor is also equipped with a trace system that enables query visualization and query link traceability.

SIMD Execution Engine Improvement

PieCloudDB vectorized executor will continue to iterate, and there will be more improvements in pipeline, serverless, software and hardware integration, and scheduling in the near future.

MUNDO: The Next Generation Metadata Management System

PieCloudDB's original metadata management system separates metadata, uses the open source KV database FoundationDB to store metadata, transactions and lock data, and uses the global cache system GMEMOS to cache metadata, transaction IDs, snapshots and other data. In the original system, metadata is stored persistently and can support features such as multi-cluster and multi-tenant.

In order to further align with πDataCS's mission of "One Storage, Multiple Engines Computation", PieCloudDB has further evolved to create the next generation metadata management system MUNDO. The new generation of metadata management system is fully self-developed, which can further unleash the advantages of PieCloudDB's storage and computation separation architecture and exert greater value in the mobilization of data elements.

Compared with the previous generation of metadata management system, the performance of MUNDO has been improved by multiples. The overall DDL performance has increased by more than 40 times, the DML metadata query delay has been reduced by 60%, and the number of concurrent connections has increased by 20+ times.

Performance Improvement Compared With Previous Generation Mstore

Architecturally, the new generation metadata management system MUNDO uses a newly designed M (meta) node to replace FoundationDB, adopting a fully modular design and achieving higher performance. It is fully compatible with various tools in the PostgreSQL ecosystem, making it more open and inclusive.

The M node is used to uniformly manage metadata and status information of PieCloudDB, and is connected to the JANM storage base. All storage is unified and used to store catalog data. Created independent lock, transaction and snapshot managers to further improve concurrency performance. In addition, MUNDO supports high availability and incremental backup, the unified cache supports the use of multiple clusters, and supports the executor to directly query metadata and transaction information, reducing executor query latency and system load.

MUNDO Metadata Management System Architecture

The coordinator node (C node) of the MUNDO is responsible for distributing queries to executors and collecting the required information from the metadata cache. The functions of the original QD are simplified and the load on the master node is reduced.

Ecosystem and Platform Evolution

In addition to the iteration of the storage engine, metadata management system and executor module, the PieCloudDB ecosystem and platform have also released a large number of features and updates, including:

Open source table format Iceberg query
csv, json, parquet, orc files can be used for direct SQL query
PieCloudVector enhancements (performance, HA, GPU)
Flink Connector
Spark Connector
Data source consistency check
Full link arm support
Database system and query more complete visual detection

...

In the future, OpenPie will continue to explore the data field in depth, strengthen core technology research capabilities, and work closely with industry and ecological partners to jointly explore the best practices in the data element industry. Through continuous product innovation, we look forward to providing customers with more powerful and reliable data technology support.

Related Blogs:

no related blog