Doubling Efficiency: PieCloudDB Database's Pre-Aggregation Feature Takes Query Speed to A New Height!

OCTOBER 30TH, 2023

In the big data era, effectively managing and processing massive amounts of data has become a core challenge for enterprises. To address this, OpenPie has introduced the first data computing engine, PieCloudDB Database, as a brand new cloud-native virtual data warehouse, aimed at providing a more efficient and flexible data processing solution.

The design of PieCloudDB is rooted in a deep understanding of user experience and query efficiency. While implementing the separation of storage and computation, PieCloudDB has specifically designed and created a new storage engine, "JANM". Tailored for cloud and analytical scenarios, PieCloudDB has also designed efficient pre-aggregation feature. This article will provide a detailed insight into how PieCloudDB leverages pre-aggregation technology to optimize the data processing flow, enhancing the user experience.

As a cloud-native virtual data warehouse, PieCloudDB fully leverages the infrastructure services provided by cloud computing, including large-scale distributed clusters, virtual machines, containers, and more. These features enable PieCloudDB to better adapt to dynamic and evolving workload demands. Simultaneously, PieCloudDB is actively expanding its own capabilities to achieve high availability, easy scalability, and elastic scaling to meet the growing business needs of enterprises.

PieCloudDB has accomplished a significant innovative feature: Pre-Aggregate. This feature, powered by PieCloudDB's new storage engine "JANM," calculates aggregate information for data columns in real-time during data insertion and stores it for future use. This approach eliminates the traditional method of complex calculations during queries, significantly enhancing query speed. Furthermore, as the aggregated data is saved in files, it can be quickly accessed and directly applied to queries.

PieCloudDB automatically generates plans with Pre-Aggregate based on user queries, making the query process as fast and accurate as possible. When aggregated data is needed, the system checks the pre-stored aggregate values and directly retrieves the matching aggregate data. This eliminates the need to scan the entire dataset during queries and greatly improves query speed.

For some blocks that meet the conditions, PieCloudDB reverts to the original processing method to calculate aggregate values. This approach utilizes the pre-aggregated data while only computing the missing parts, reducing computational costs and enhancing operational efficiency.

Principle of Pre-Aggregation

In order to enhance the query performance of Aggregate, PieCloudDB adopts a "space for time" strategy. When writing data, it precomputes and stores relevant Aggregate in the storage layer. This allows for quick retrieval of the needed Aggregate data during queries.

Having addressed the source of Aggregate data, now let's discuss how to obtain the precomputed Aggregate data. To correctly obtain the pushed-down Aggregate data, PieCloudDB's optimizer and executor are further optimized, adding two new Pre-Aggregate computation nodes. The comparison of the plan tree before and after these changes is illustrated in the following diagram:

Comparison of Plan Trees Before and After the Transformation

The storage engine "JANM" updates Aggregate information in real-time when data is inserted. In the figure above, the Pre-Aggregate calculation node retrieves precomputed Aggregate data from the Access Method (AM). If suitable Aggregate data is not found, the Pre-Aggregate calculation node will compute the corresponding Aggregate data from the tuple based on the conditions and return it to the upper-level calculation node. This solves the issue of correctly finding the pushed-down Aggregate data.

Pre-Aggregate is the specific implementation of Zone Maps in OLAP optimization techniques. It involves precomputing aggregates for a batch of tuple attribute values and storing them in advance. The database checks the precomputed aggregate information to determine whether to access the block. As mentioned earlier, if usable Aggregate data is found, it is returned directly; otherwise, the block is accessed to retrieve specific tuples.

For conditional Pre-Aggregate, the effectiveness depends on the data range involved in precomputation. PieCloudDB narrows the pre-aggregation scope down to block files and performs precomputation storage for each block file, ensuring the effectiveness of conditional pre-aggregate queries.

Demonstration of Pre-Aggregate Usage

Below are instructions on how to enable Preagg Block Scan and utilize Preagg Bitmap Block Scan with Block Skipping support. Finally, performance comparison charts are provided.

Usage of Preagg Block Scan

create table t(a int, b int, c int);

insert into t values(1,2,3);
insert into t values(3,3,5);
insert into t values(4,4,6);

set pdb_enable_preagg = on;

explain (costs off) select sum(b), avg(c), count(*) from t;
                   QUERY PLAN
------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Pre-Aggregate Block Scan on t
 Optimizer: Postgres query optimizer
(4 rows)

select sum(b), avg(c), count(*) from t;
 sum |        avg         | count
-----+--------------------+-------
   9 | 4.6666666666666667 |     3
(1 row)

set pdb_enable_preagg = off;

explain (costs off) select sum(b), avg(c), count(*) from t;
                   QUERY PLAN
------------------------------------------------
 Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Seq Scan on t
 Optimizer: Postgres query optimizer
(4 rows)

select sum(b), avg(c), count(*) from t;
 sum |        avg         | count
-----+--------------------+-------
   9 | 4.6666666666666667 |     3
(1 row)

Usage of Preagg Bitmap Block Scan

create table t(a int, b int);
insert into t values(generate_series(1, 20), generate_series(100, 120));
insert into t values(generate_series(21, 60), generate_series(121, 160));

set pdb_enable_preagg = on;

set enable_seqscan = off;
set enable_bitmapscan = on;
set enable_indexscan = on;

explain (costs off) select max(a), sum(a) from t where a > 10 and a < 50;
                          QUERY PLAN
---------------------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Partial Aggregate
               ->  Pre-Aggregate Bitmap Block Scan on t
                     Recheck Cond: ((a > 10) AND (a < 50))
                     ->  Bitmap Index Scan on t
                           Index Cond: ((a > 10) AND (a < 50))
 Optimizer: Postgres query optimizer
(8 rows)

select max(a), sum(a) from t where a > 10 and a < 50;
 max | sum
-----+------
  49 | 1170
(1 row)

set pdb_enable_preagg = off;

explain (costs off) select max(a), sum(a) from t where a > 10 and a < 50;
                          QUERY PLAN
---------------------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Partial Aggregate
               ->  Bitmap Heap Scan on t
                     Recheck Cond: ((a > 10) AND (a < 50))
                     ->  Bitmap Index Scan on t
                           Index Cond: ((a > 10) AND (a < 50))
 Optimizer: Postgres query optimizer
(8 rows)

select max(a), sum(a) from t where a > 10 and a < 50;
 max | sum
-----+------
  49 | 1170
(1 row)

Performance Comparison

Test Table:

create table preaggdata (a int, b int);

Test Statements:

explain analyze select sum(a), avg(a), count(*), max(b) from preaggdata;

The time consumption comparison chart is as follows:

Time Consumption Comparison

From the test data and comparison chart above, it can be observed that without enabling Pre-Agg, as the data volume increases, the time consumption steadily increases, and the rate of increase becomes faster. On the other hand, with Pre-Agg enabled, the time consumption increases steadily but not too rapidly. When the data volume reaches 10,000K, it achieves nearly a 28-fold speed improvement.

The Future Evolution of Pre-Aggregation

Currently, Pre-Aggregate employs a "space for time" strategy to enhance performance efficiency. To expand the scope of Pre-Aggregate, optimize user experience, and provide more diverse and enriched functionalities, we will continuously drive technological research and development.

Whether it's through refining data processing methods, extending support for various function types, or introducing new query processing mechanisms, we are unwavering in our pursuit of this goal. We believe that soon, Pre-Aggregate will be able to offer more efficient and precise solutions for complex query scenarios, thereby progressively deepening its impact in the field of data analysis and processing.

Related Blogs:

no related blog