Indexes on Cloud: How "JANM" creates a brand-new index for cloud-native scenarios

SEPTEMBER 06TH, 2023

OpenPie's flagship data computing engines, PieCloudDB Database, is a brand-new cloud-native virtual data warehouse. In order to enhance user experience and improve query efficiency while implementing storage-computing separation, PieCloudDB has designed and developed a new storage engine called "JANM" along with other modules. Additionally, it developed efficient "Data Skipping" indexes tailored for cloud and analytical scenarios. This article will provide a detailed overview of PieCloudDB's storage and index design and creation process, and will demonstrate how PieCloudDB uses Data Skipping indexes to accelerate query efficiency through examples.

As a cloud-native virtual data warehouse, PieCloudDB relies on the infrastructure services provided by cloud computing, including large-scale distributed clusters, virtual machines, and containers. By leveraging these services, PieCloudDB can better adapt to dynamic and evolving workload demands and achieve features such as high availability, scalability, multi-region support, and elastic scaling.

Indexes are a key technology for improving query efficiency in a database system, and their design is closely related to storage. In order to better meet the requirements of cloud-native and analytical scenarios, PieCloudDB must use a sensible storage architecture and technology to create a brand-new storage engine. It aims to implement efficient cloud-based indexing technology to fulfill user query needs. PieCloudDB's storage serves as a critical bridge connecting applications and user data, constituting a core component of cloud-native virtual data warehouse applications. The new storage engine, "JANM" is designed specifically for cloud-native and analytical scenarios, aiming to provide exceptional query performance and flexible indexing technology to meet users' data query needs in the cloud. Its name is inspired by "bamboo slips" a reference to ancient Chinese writings.

Storage Engine 「JANM」

Before introducing the cloud-based index "Data Skipping", let's first understand the design logic behind PieCloudDB's storage.

Detailed Storage Design

In order to meet the requirements of different types of applications, the storage system created for PieCloudDB is divided into two storage layers: the persistent layer and the data layer.

Persistent Layer: The persistent layer serves as the foundational storage in PieCloudDB and typically utilizes cloud-native storage solutions like distributed file systems or object storage systems such as AWS S3, Azure Blob Storage, etc. The persistent layer is characterized by high availability and durability, ensuring secure data retention over the long term.

Data Layer: The data layer is the upper-level abstraction in PieCloudDB, providing standardized access interfaces for applications. It encompasses semi-structured data, structured data, and supports SQL for unstructured data storage.

Based on this layered storage design, PieCloudDB is capable of meeting diverse application requirements. Additionally, leveraging cloud computing infrastructure, PieCloudDB achieves containerized deployment, automated operations, microservices architecture, and more. This architecture design provides enterprise users with solutions that are more efficient, reliable, flexible, and cost-effective.

Design of PieCloudDB Data Persistence

For the design of data persistence, there are generally three forms:

N-tuple storage model (often referred to as row-based storage)
Decomposed storage model (often referred to as column-based storage)
Hybrid storage model

PieCloudDB adopts the third approach: the hybrid storage model. In this model, data is horizontally grouped and attributes are vertically partitioned into columns. This storage model enables PieCloudDB to benefit from the efficiency and compression advantages of column-based storage, while retaining the spatial locality advantages of row-based storage, thereby reducing data restructuring overhead. This choice of storage model also influences the index design in PieCloudDB.

PieCloudDB Storage Foundation

PieCloudDB employs object storage technology as the underlying storage foundation for its cloud-native virtual data warehouse. Object storage brings scalability, elastic scaling, and high fault tolerance. However, practical usage also presents certain limitations, including:

Latency: Compared to traditional block storage technologies, object storage often exhibits higher latency, potentially affecting database performance in certain scenarios.

Challenges with Large-scale Rewrite Operations: Implementing complex storage operations (such as large-scale data rewriting) in a distributed system, which object storage is based on, can be challenging. Yet, such operations are common in relational databases.

Transaction Management: Object storage typically provides simple transaction management mechanisms like optimistic concurrency control, but handling distributed transactions can be complex and maintaining global locks or other coordination mechanisms across all nodes is difficult due to their dispersion.

Data Consistency: Despite the high reliability and redundancy of object storage, its asynchronous feature means that maintaining consistency for distributed data requires additional measures. This is more complex and cost-intensive compared to other distributed database solutions.

During the construction of the storage system, PieCloudDB undertook extensive design efforts to address these limitations and ensure a user-friendly experience. For instance, to address the second limitation, PieCloudDB's persistent files cannot be modified in place after generation. Therefore, when performing update/delete operations, PieCloudDB generates new files that include both unmodified data and the newly modified data, while retaining the old data files.

Indexing in PieCloudDB

Based on the characteristics of cloud-based infrastructure and the storage design philosophy of PieCloudDB, there are two crucial features that define PieCloudDB's storage system.

Usage of a Hybrid Storage Model
Persistence of Files without In-Place Modification

These features also influence the approach to building indexes in PieCloudDB. Before delving into the specific features of PieCloudDB's indexes, let's familiarize ourselves with common types of indexes.

Common Types of Indexes

In OLTP scenarios, databases typically handle numerous short-term transactions, requiring efficient execution of individual record read and write operations. To avoid full scans of data, tree-based index structures (such as B+Trees) can accelerate the querying of a small amount of data. These indexes help database engines quickly locate specific records, enhancing read and write performance. As data is incrementally updated, indexes need to be updated to maintain data consistency and performance.

Conversely, in OLAP scenarios, databases often deal with extensive analytical queries involving data warehousing and analysis. Here, individual record lookup is less common; instead, the focus is on aggregation, filtering, and analysis of substantial datasets. Traditional index structures may no longer be suitable as full scans of large datasets can become time-consuming.

To expedite OLAP query execution, PieCloudDB employs Data Skipping technology. Data Skipping is an advanced optimization technique aimed at minimizing I/O overhead during data scanning. The core idea is to skip scanning data blocks or partitions that don't meet query conditions. This effectively reduces I/O operations, thereby accelerating query execution.

Data Skipping Indexes in PieCloudDB

Zone Map indexes and BRIN (Block Range Index) indexes are specific implementations of Data Skipping technology commonly found in OLAP scenarios. They leverage pre-computed statistical information to skip data blocks that don't match query criteria, thereby accelerating query execution.

Zone Map involves storing pre-computed statistical information about selected columns in each data block, such as minimum and maximum values, along with other aggregated data. During queries, the database uses this statistical information to trim the accessed data blocks, reducing unnecessary I/O operations and enhancing query performance. This technique is particularly useful for querying large datasets in OLAP scenarios.

In PieCloudDB, each data block consists of columnar data for a set of records. During data import, corresponding statistical information for the required columns in each data block is computed. Thanks to the storage implementation, both the column-wise statistics computation and storage process have become more streamlined and efficient.

Example

In PieCloudDB, when a user performs a query, for each data block, the statistical information of the corresponding column is used to determine whether it meets the query conditions. If it does, the data block is accessed; if not, the data block is skipped. Next, we will provide a detailed demonstration of PieCloudDB's Data Skipping functionality through an example.

To begin, let's create a table and import some test data step by step.

create table dataskip (a int, b int); 
insert into dataskip select i, i*2 from generate_series(1, 1000)i; 
insert into dataskip select i, i*2 from generate_series(1001, 2000)i; 
insert into dataskip select i, i*2 from generate_series(2001, 3000)i; 
insert into dataskip select i, i*2 from generate_series(3001, 4000)i; 
insert into dataskip select i, i*2 from generate_series(4001, 5000)i; 
insert into dataskip select i, i*2 from generate_series(5001, 6000)i; 
insert into dataskip select i, i*2 from generate_series(6001, 7000)i; 
insert into dataskip select i, i*2 from generate_series(7001, 8000)i; 
insert into dataskip select i, i*2 from generate_series(8001, 9000)i; 
insert into dataskip select i, i*2 from generate_series(9001, 10000)i;

Now let's execute the query:

demo=# explain analyze select * from dataskip where a < 10; 
                                  QUERY PLAN 
--------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=2.00..10.21 rows=3 width=8) (actual time=34.361..36.928 rows=9 loops=1) 
-> Bitmap Heap Scan on dataskip (cost=2.00..10.17 rows=1 width=8) (actual time=16.189..31.790 rows=5 loops=1) 
	Recheck Cond: (a < 10) 
	Rows Removed by Index Recheck: 316 
	-> Bitmap Index Scan on dataskip (cost=0.00..2.00 rows=333 width=0) (actual time=2.908..2.910 rows=1 loops=1) 
	Index Cond: (a < 10) 
Planning Time: 4.259 ms 
  (slice0) Executor memory: 159K bytes. 
  (slice1) Executor memory: 32972K bytes avg x 3 workers, 32972K bytes max (seg0). 
Memory used: 128000kB 
Optimizer: Postgres query optimizer 
Execution Time: 55.895 ms 
(12 rows)

If Data Skipping queries are disabled

demo=# set enable_bitmapscan = off; 
SET 
demo=# explain analyze select * from dataskip where a < 10; 
                                  QUERY PLAN 
--------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..51.71 rows=3 width=8) (actual time=129.916..140.925 rows=9 loops=1) 
	-> Seq Scan on dataskip (cost=0.00..51.67 rows=1 width=8) (actual time=2.939..132.546 rows=5 loops=1) 
	Filter: (a < 10) 
	Rows Removed by Filter: 3292 
Planning Time: 0.099 ms 
  (slice0) Executor memory: 123K bytes. 
  (slice1) Executor memory: 32825K bytes avg x 3 workers, 32825K bytes max (seg0). 
Memory used: 128000kB 
Optimizer: Postgres query optimizer 
Execution Time: 154.416 ms 
(10 rows)

As you can see, when Data Skipping is disabled, the execution time is three times longer than when it is enabled.

Here, we also encounter a challenge for the query optimizer. In complex JOIN queries, it's essential to push down JOIN conditions or WHERE conditions to the scan nodes as much as possible to leverage Data Skipping efficiently. In this regard, PieCloudDB outperforms other products by a significant margin.

demo=# explain analyze select * from dataskip join jtbl on dataskip.a = jtbl.a and jtbl.a < 10; 
                                    QUERY PLAN 
--------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=2.00..15.47 rows=3 width=16) (actual time=33.638..33.712 rows=9 loops=1) 
	-> Nested Loop (cost=2.00..15.43 rows=1 width=16) (actual time=33.300..33.405 rows=5 loops=1) 
	Join Filter: (dataskip.a = jtbl.a) 
	Rows Removed by Join Filter: 20 
	-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..5.21 rows=2 width=8) (actual time=0.003..0.013 rows=5 loops=1) 
	Hash Key: jtbl.a 
	-> Seq Scan on jtbl (cost=0.00..5.17 rows=2 width=8) (actual time=3.144..20.979 rows=3 loops=1) 
	Filter: (a < 10) 
	Rows Removed by Filter: 356 
	-> Materialize (cost=2.00..10.19 rows=1 width=8) (actual time=5.547..5.554 rows=4 loops=6) 
	-> Redistribute Motion 3:3 (slice3; segments: 3) (cost=2.00..10.19 rows=1 width=8) (actual time=33.130..33.269 rows=5 loops=1) 
	Hash Key: dataskip.a 
	-> Bitmap Heap Scan on dataskip (cost=2.00..10.17 rows=1 width=8) (actual time=11.766..24.910 rows=5 loops=1) 
	Recheck Cond: (a < 10) 
	Rows Removed by Index Recheck: 316 
	-> Bitmap Index Scan on dataskip (cost=0.00..2.00 rows=333 width=0) (actual time=2.783..2.784 rows=1 loops=1) 
	Index Cond: (a < 10) 
Planning Time: 6.522 ms 
  (slice0) Executor memory: 220K bytes. 
  (slice1) Executor memory: 79K bytes avg x 3 workers, 80K bytes max (seg0). Work_mem: 17K bytes max. 
  (slice2) Executor memory: 32826K bytes avg x 3 workers, 32826K bytes max (seg0). 
  (slice3) Executor memory: 32975K bytes avg x 3 workers, 32975K bytes max (seg0). 
Memory used: 128000kB 
Optimizer: Postgres query optimizer 
Execution Time: 68.989 ms 
(25 rows)

For the same query, when we disable Data Skipping, the query time increases to three times that of the previous case.

demo=# set enable_bitmapscan = off; 
SET 
demo=# explain analyze select * from dataskip join jtbl on dataskip.a = jtbl.a and jtbl.a < 10; 
                                   QUERY PLAN 
--------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..56.97 rows=3 width=16) (actual time=139.811..139.886 rows=9 loops=1) 
	-> Nested Loop (cost=0.00..56.92 rows=1 width=16) (actual time=139.587..139.691 rows=5 loops=1) 
	Join Filter: (dataskip.a = jtbl.a) 
	Rows Removed by Join Filter: 20 
	-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..5.21 rows=2 width=8) (actual time=0.003..0.011 rows=5 loops=1) 
	Hash Key: jtbl.a 
	-> Seq Scan on jtbl (cost=0.00..5.17 rows=2 width=8) (actual time=1.758..21.023 rows=3 loops=1) 
	Filter: (a < 10) 
	Rows Removed by Filter: 356 
	-> Materialize (cost=0.00..51.69 rows=1 width=8) (actual time=23.262..23.269 rows=4 loops=6) 
	-> Redistribute Motion 3:3 (slice3; segments: 3) (cost=0.00..51.69 rows=1 width=8) (actual time=136.260..139.557 rows=5 loops=1) 
	Hash Key: dataskip.a 
	-> Seq Scan on dataskip (cost=0.00..51.67 rows=1 width=8) (actual time=1.730..134.913 rows=5 loops=1) 
	Filter: (a < 10) 
	Rows Removed by Filter: 3292 
Planning Time: 0.248 ms 
  (slice0) Executor memory: 185K bytes. 
  (slice1) Executor memory: 79K bytes avg x 3 workers, 80K bytes max (seg0). Work_mem: 17K bytes max. 
  (slice2) Executor memory: 32826K bytes avg x 3 workers, 32826K bytes max (seg0). 
  (slice3) Executor memory: 32827K bytes avg x 3 workers, 32827K bytes max (seg0). 
Memory used: 128000kB 
Optimizer: Postgres query optimizer 
Execution Time: 155.026 ms 
(23 rows)

Support for Primary Keys

In typical OLTP databases, Primary Keys serve several purposes, including:

Accelerating Point Lookup.
Ensuring UNIQUE Constraint.
Enforcing NOT NULL Constraint.

However, in OLAP databases, where the focus is on analytical queries and there is less need for Point Lookup, full table scans, and large-scale data skipping techniques become more important. Therefore, some OLAP databases make adjustments when implementing Primary Keys.

For instance, in ClickHouse, the Primary Key is used to sort data during loading, and sorted data significantly improves the performance of Data Skipping. Snowflake uses a similar concept called Cluster Key to give data blocks clustering properties, further enhancing Data Skipping performance through Auto Cluster.

In PieCloudDB, Primary Key support NOT NULL Constraint, but for query acceleration, Data Skipping is generally used. PieCloudDB does not currently support UNIQUE Constraint.

PieCloudDB's Exploration of Indexing

In addition to Data Skipping, PieCloudDB has been researching and implementing a variety of indexes to provide performance improvements for different scenarios. This includes exploration of other index types based on the Data Skipping and indexes tailored for OLTP scenarios.

Exploration of Other Index Types Based on the Data Skipping

In the discussions above, the current indexes in PieCloudDB are categorized as sparse index. In addition to the commonly used Zone Map-type indexes, other implementations include:

Bitmap-based indexes
Bloom filters
Column sketches
Column imprints
…

Indexes for OLTP Scenarios

As mentioned earlier, traditional tree-based indexes like B+Trees are also relevant, often referred to as dense index. For PieCloudDB, we do not rule out their practical significance for queries, and we will continue to explore this area.

Related Blogs:

no related blog