PieCloudDB Database's Cutting-Edge Vectorized Executor Unleashes Unprecedented Performance Improvements

DECEMBER 20TH, 2023

The importance of data analysis and applications is growing, and for data platforms and computing systems, achieving ultimate performance is a key requirement. To enable more efficient data parallel computing, an excellent executor needs to fully utilize hardware resources, such as the parallel computing capabilities of CPUs and SIMD instruction sets. In addition, optimizing data storage and retrieval, proper task scheduling and resource management, along with continuous improvement and optimization, are critical factors to ensure performance.

PieCloudDB, aiming to help enterprises establish a competitive edge with data assets at the core and provide customers with outstanding performance and efficient data processing capabilities, has disrupted the traditional executor design. It has independently developed an efficient new vectorized executor. The introduction of vectorization technology allows PieCloudDB Database to fully leverage the parallel computing capabilities of modern processors, enabling rapid parallel processing of data.

What is Vectorization?

Vectorization is a design approach for computer processors or computing engines that leverages SIMD instruction sets to operate on vector data. It allows parallel computation on a set of data under the control of a single instruction, thereby enhancing computational efficiency and performance.

CPU Architecture

The dedicated SIMD registers and hardware units supported by modern CPUs provide powerful support for vectorized operations, making parallel computing using SIMD instruction sets more efficient and allowing the full utilization of CPU computational capabilities. Next, let's delve into the architecture of the CPU to help everyone understand.

The modern Von-Neumann Computer Model generally has five core components: arithmetic logic unit (ALU), memory, control unit, input, and output. The CPU execution process typically involves fetching, decoding, executing, and writing back instructions, which are the most basic stages. To improve CPU performance, modern CPUs often introduce techniques such as pipelining and out-of-order execution. For example, a 5-stage pipeline means that within a single CPU cycle, five different operations can be processed.

The TMAM (Top-down Microarchitecture Analysis Method) has two metrics for CPU performance optimization: CPU clock cycles and CPU pipeline slots. This method assumes that each CPU core has a 4-slot pipeline for each clock cycle, meaning the pipeline width of the CPU is 4.

The diagram below illustrates the different states of the four slots in each clock cycle. It is important to note that only the Cycle utilization of Clockticks 4 is 100%, while others experience Cycle Stall (pause, bubble).

Different Statuses of Slot

For a CPU pipeline, there are many dependencies during execution, as illustrated in the diagram below:

Dependencies of CPU Pipeline

From the above diagram, it can be seen that the efficiency of CPU pipeline execution depends on the efficiency of the resources it relies on. This can be summarized as follows:

Retiring: Represents the Pipeline Slot that runs effective uOps and can be used to evaluate the program's relatively real efficiency for the CPU.

Bad Speculation: Represents wasted Pipeline resources due to incorrect predictions, which can occur with if/switch/while/for statements.

Front-End-Bound: Involves fetching instructions, decoding, sending instructions to the back-end, with a maximum of 4 uOps dispatched per cycle.

Back-End-Bound: Involves accepting uOps submitted by the front-end, instruction reordering, fetching data from memory, execution, and submitting results to memory.

Evaluation metrics for CPU pipeline execution can be measured from several dimensions:

Instruction Number: The number of instructions. When we write a CPU program, it is translated into CPU instructions, and the number of instructions generally depends on the complexity of the program.

CPI（Cycle Per Instruction）: The number of cycles needed to execute one instruction.

Clock Cycle Time:The time required for one CPU cycle, which is strongly associated with CPU hardware features.

Clearly, optimization can be done at the code level for the first two dimensions. Common optimization techniques include:

Cache-Friendly, such as adjusting data structure sizes according to cacheline, avoiding cache pseudo-sharing.

Branch prediction optimization, avoiding the use of goto statements.

Reducing instruction data dependencies.

Utilizing new CPU hardware features, such as SIMD.

Compared to scalar computation, SIMD instruction sets can significantly reduce the number of instructions needed to perform the same operation, thus significantly improving performance. This performance improvement is substantial. Next, let's understand what SIMD is.

What is SIMD?

SIMD stands for Single Instruction Multiple Data, which is a computing paradigm. It is a parallel computing technique that efficiently processes data in parallel by simultaneously performing the same operation on multiple data elements with a single instruction.

In SIMD computing, one instruction can operate on a vector or multiple data elements simultaneously. These data elements are typically organized into vector registers, which can contain multiple data values, and these values are processed simultaneously. This approach effectively enhances the performance of parallel computing, especially in cases where the same operation needs to be performed on a large amount of data.

SIMD has advantages such as parallel computing, data locality, hardware acceleration, high-performance applications, and energy efficiency, making it an effective tool for handling large-scale data and high-performance computing tasks.

As shown in the diagram below, scalar operations can only perform addition on one pair of data at a time, while using SIMD instructions allows simultaneous operations on multiple pairs of instructions. Clearly, SIMD instructions can significantly increase the processing speed of data, with tremendous performance advantages when dealing with columnar data.

Scalar Arithmetic vs SIMD Arithmetic

SIMD is widely used in various fields such as image and video processing, signal processing, scientific computing, and databases. It can accelerate common operations such as vector addition, multiplication, average calculation, peak detection, etc., significantly improving computational efficiency.

PieCloudDB has undergone revolutionary improvements and optimizations for the executor to better utilize the capabilities of SIMD in handling large-scale data computing tasks through vectorized calculations. By organizing data in vector form and using SIMD instructions to perform the same operation, PieCloudDB allows the executor to process multiple data elements simultaneously within a single instruction cycle, improving computational efficiency.

Additionally, for different types of data computation problems, PieCloudDB has designed optimized strategies for SIMD. For example, in common data calculation operations such as aggregation, scanning, joining, and filtering, PieCloudDB leverages the parallel computing capabilities of SIMD instructions to optimize key nodes and enhance overall performance.

PieCloudDB's Vectorization Implementation

PieCloudDB's Storage Mode

To maximize the advantages of SIMD and improve query execution efficiency, optimizing the data storage mode is crucial. It is necessary to make data storage more friendly to parallel computing using SIMD instructions.

In relational data storage implementations, there are mainly two ways of storing data:

Row-based: This means organizing and storing data by rows. In row-based storage, all column values for each row are stored together consecutively. Row-based storage has advantages such as a compact data structure, reading the entire row at once, atomicity of transaction modifications, suitability for point queries, etc. However, row-based storage also has some drawbacks, including additional I/O overhead and data fragmentation.

Column-based: This involves storing the data for each column together consecutively. Compared to traditional row storage, column storage has the following advantages:

Data Compression: Column storage can apply more efficient compression algorithms, reducing the space overhead of data storage and increasing the bandwidth of data reading.
Data Locality: Since the data for each column in column storage is stored consecutively, it can better leverage the data locality of the processor, reducing the cache miss rates for instructions and data, and improving access efficiency.
Data Skip: In column storage, queries can operate only on the selected columns, reducing unnecessary data transfer and computation, improving query efficiency.

However, column storage also has some disadvantages, such as reading by rows during selection, which may require multiple I/O operations.

The following provides a simple example of a data table to illustrate the difference between row-based and column-based storage.

Sample Data

If row-based storage is used, the organization of data in memory and on disk is as shown in the following figure:

Row-Based Organization Form

If column-based storage is used, the organization of data in memory and on disk is as shown in the following figure:

Column-Based Organization Form

PieCloudDB currently employs a mixed storage implementation of row and column storage. This hybrid storage approach combines the advantages of row and column storage to adapt to different query patterns and requirements. It allows users to choose different storage methods for maximum efficiency in various business scenarios. Columnar storage organizes data by column, enabling contiguous data to better utilize the parallel computing features of SIMD instructions, fully harnessing the performance benefits of SIMD. To maximize the value of data processing, especially in conjunction with the columnar storage features supported by the JANM storage system, PieCloudDB's executor needs to undergo vectorized computing optimization.

PieCloudDB Executor

PieCloudDB's cloud-native virtual data warehouse adopts a new eMPP (elastic MPP) architecture. As a relational database, its query execution strictly follows relational algebra. Currently, data processing is based on tuples, and the volcano model is used as the specific execution model. This recursive invocation of upper-level operators to retrieve and process tuples has some drawbacks, including a relatively high number of virtual function calls and high instruction or data cache miss rates. At the same time, processing one tuple at a time cannot fully utilize CPU's SIMD instructions for optimization, leading to inefficient query execution.

SIMD instruction sets can process multiple data elements simultaneously. Combined with columnar storage technology, more efficient data processing can be achieved. Columnar storage organizes data by column, allowing contiguous data to better utilize the parallel computing features of SIMD instructions.

To build a vectorized executor and implement SIMD optimization, the following aspects need to be considered:

Vectorization Operations: Redesign and optimize the executor to support vectorized operations of SIMD instruction sets. Refactor the code logic to use SIMD instructions to process entire data columns instead of processing each tuple individually.

Data Layout Optimization: Optimize the storage method and layout of data based on the characteristics of SIMD instruction sets. Further optimize columnar storage to ensure contiguous data can fully leverage the parallel computing capabilities of SIMD instructions.

Platform Adaptation: Adapt and optimize for different hardware architectures and SIMD instruction sets. Consider the restrictions and requirements of specific platforms to ensure that SIMD optimization can be effectively implemented in different environments.

Through these optimization measures, PieCloudDB's executor can better leverage the parallel computing capabilities of SIMD instruction sets, accelerating data processing speed, and improving system performance and efficiency. This enables faster and more efficient data processing in OLAP scenarios, fully realizing the maximum value of data processing.

Vectorized Design Approach of PieCloudDB

There are many ways to implement vectorized processing in a vectorized executor. For example, one can optimize certain key processing workflows with SIMD code, handle data processing, and use Hash Tables that are SIMD-friendly. PieCloudDB ultimately decided to completely rewrite a brand-new vectorized executor, primarily based on the following considerations:

Excellent Current Executor: PieCloudDB's current executor has demonstrated high performance advantages in many OLAP scenarios, completing numerous HTAP functionalities, and showing some advantages in certain OLTP scenarios.

Local Optimization Cannot Fully Exploit the Maximum Advantage of SIMD: By rewriting an entirely new vectorized executor, PieCloudDB can better leverage the parallel computing capabilities of SIMD instruction sets, further enhancing data processing performance.

Further Construction of Computational Forms: In the new executor, PieCloudDB will be able to more conveniently implement the integration of lakehouse and stream-batch computations, adapting better to different computing scenarios and various data processing requirements.

Embrace Big Data Computing Ecosystem: As the first computing engine of the πDataCS —— large-scale data computing system, PieCloudDB, through rewriting a completely new vectorized executor, can better interface and integrate with the big data computing ecosystem, providing users with a more extensive range of data computing solutions.

PieCloudDB team believes that rewriting a completely new vectorized executor is necessary and beneficial. Despite the potential challenges in this process, it allows for better empowerment of data computing, achieving optimal performance. After careful consideration and design iterations, the overall architecture diagram of the PieCloudDB vectorized executor is as follows:

PieCloudDB Vectorized Executor Overall Architecture

After enabling the vectorized executor, when the plan-rewriter receives a query plan from the optimizer, it will replace the vectorized operators and send the modified query plan to the vectorized executor. If vectorization is not performed, the old executor will be used for the query.

Clearly, the core of the vectorized executor is the various operators on the right side of the diagram. Thus, extensive modifications are needed for these operators to fully utilize the CPU's ultimate capabilities. To achieve this goal, we primarily focus on SIMD transformations in the following aspects:

Row Storage to Column Storage: Transforming data from row storage to column storage enhances the efficiency of the SIMD instruction set in data access and processing. This way, continuous data can better leverage the parallel computing characteristics of SIMD.

Row Processing to Column Processing: Converting the originally row-based processing to column-based processing involves adjusting and improving numerous algorithms to adapt to the column-based SIMD parallel computing model. This enhances the efficiency of data processing.

Code-Level Optimization: During the vectorization transformation, code-level optimization is necessary. This includes branch elimination, code structure adjustments, and reducing branch jumps to enhance code continuity, thereby improving the utilization of SIMD instructions.

Data Structure Adjustments: To better adapt to SIMD processing, adjustments to data structures are required. For example, traditional hash tables can be replaced with more SIMD-friendly data structures to improve the efficiency of lookup and insertion operations.

Through these vectorization transformation directions, PieCloudDB will be able to better harness the CPU's potential in data processing, achieving faster and more efficient data calculations.

PieCloudDB Database Vectorized Executor

PieCloudDB's vectorized executor has demonstrated remarkable performance improvements in the widely used decision support benchmark test, TPC-H. Compared to the original executor, it achieved an order of magnitude improvement at critical nodes, including Aggregation (Agg), Scan, Join, Filter, and Expression Computation (Expr Compute).

Agg: The vectorized executor performs aggregation operations much faster, significantly reducing query execution time. This improvement is particularly evident in query scenarios that involve aggregation operations on large datasets.

Scan: The vectorized executor accelerates the scanning and reading of data. By utilizing SIMD instruction sets for vectorized operations, it can process multiple data elements simultaneously, enhancing data access and processing efficiency.

Join: The vectorized executor executes table join operations more rapidly, handling associations between multiple tables. Through parallel computing and vectorized operations, the speed of join operations is increased, improving query execution efficiency.

Filter: The vectorized executor efficiently processes filtering conditions in queries. Using SIMD instruction sets for vectorized operations allows simultaneous condition checks on multiple data elements, reducing the number of loop iterations and speeding up filtering operations.

Expr Compute: The vectorized executor calculates expressions and functions more quickly. By leveraging SIMD instruction sets for vectorized computation, it can simultaneously evaluate expressions for multiple data elements, accelerating the computation process.

These significant improvements in the scale of these key operators enable PieCloudDB to handle TPC-H benchmark tests and complex queries in daily OLAP scenarios more quickly. It provides more efficient and faster decision support capabilities.

Success Belongs to the Persevering

The last part of a journey is its toughest. PieCloudDB's venture into building a vectorized executor is still ongoing. Currently, PieCloudDB has implemented support for the SIMD instruction set, fully capitalizing on the advantages of data parallel computing. By bundling multiple data elements into vectors and simultaneously performing the same operations on them, it has successfully increased computational efficiency and throughput.

However, we are well aware that this is just the beginning. We are committed to further optimizing and enhancing the PieCloudDB vectorized executor to meet the ever-growing demands and the continually changing technological landscape. Our goal is to explore higher-level vectorized operations and stronger support for SIMD instruction sets, aiming to further boost the performance and efficiency of the database. Specific directions include:

Ultimate Performance Optimization
Extreme Resource Management
Multi-Modal Data Computing Capability

The development path of PieCloudDB's vectorized executor is filled with challenges and opportunities. We firmly believe that through continuous innovation and unwavering commitment, we will keep progressing, delivering outstanding performance, and extending the application scenarios for our users.

Related Blogs:

no related blog