PieCloudDB Database Multiple Compression Methods: Cost Decreasing and Effect Increasing

JANUARY 18TH, 2024

As the scale of enterprise data continues to grow, database users' critical business systems are faced with the challenge of storing and processing huge amounts of data. This not only means more expensive IT and data costs, but also more resource consumption and management complexity. In order to meet the needs of enterprises, the cloud-native virtual data warehouse PieCloudDB Database helps enterprises reduce costs and improve efficiency through a series of innovative technical means.

From product design to product development, PieCloudDB focuses on reducing costs and increasing efficiency for users. PieCloudDB adopts a storage-computing separation architecture, which can significantly reduce costs and increase efficiency in hot and cold data analysis and data peak and trough scenarios. The pay-as-you-go model of PieCloudDB Cloud on Cloud edition can ensure maximum cost-effectiveness for users. In addition, PieCloudDB has also made many optimizations in data compression, creating an adaptive compression solution to significantly reduce storage space requirements, thereby reducing hardware costs. It can also reduce the time and cost of the data backup and recovery process. This article will mainly introduce how PieCloudDB uses a variety of adaptive compression and encoding technologies to reduce costs and increase efficiency for enterprises while ensuring performance.

PieCloudDB Compression and Encoding

ZSTD

ZSTD (Zstandard) is a high-performance lossless compression algorithm, open sourced by Facebook in 2016. The algorithm is designed to provide fast compression and decompression speeds while achieving a high compression ratio. The following are some features of ZSTD:

High Performance: ZSTD provides very fast compression and decompression speeds, in many cases even faster than other popular compression algorithms. It uses optimization technologies such as multi-level search, dynamic dictionary, and predictive models to achieve excellent performance.
Adjustable Compression Ratio: ZSTD supports adjustable compression levels, allowing you to make the trade-off between speed and compression ratio as needed. Lower compression levels provide faster speeds, while higher compression levels result in higher compression ratios.
Compatibility: ZSTD’s compression format is self-contained, which means you can use the compressed data on different platforms and systems without relying on specific libraries or tools.

Based on the above characteristics and the fact that ZSTD is superior to the current PieCloudDB compression algorithm in terms of compression rate, we support ZSTD in PieCloudDB.

Compression Method Selection

When PieCloudDB creates a table, you can select the compression method through the compresstype field. If not set, it defaults to ZSTD:

CREATE TABLE ctbl_none (s text) WITH (compresstype = 'none'); 
CREATE TABLE ctbl_pglz (s text) WITH (compresstype = 'pglz'); 
CREATE TABLE ctbl_zstd (s text) WITH (compresstype = 'zstd');

Compression Efficiency Comparison

For a wide table with 500 columns, the size of the CSV file (536MB) with 86,400 pieces of data randomly generated by the import script in the table with different compression methods:

Without Compression (cluster size is 1):

pglz (cluster size is 1):

zstd(3) (cluster size is 1):

Test Statistics

HLL With Sparse Form

PieCloudDB's storage system JANM uses a mixed row and column storage format, and each file block contains part of the row data of the table. In order to count the number of unique values in each column in the table, PieCloudDB uses the HLL (Hyperloglog) structure to perform cardinality estimation. In the past, HLL only used a form of encoding called "dense", in which an HLL structure occupied about 12KB of space (an HLL 16384 buckets, each bucket 6bit), but for wide tables, each file block The HLL structure in will become very large. For example, if a wide table has 1500 columns, the HLL portion of each file block will take up approximately 18MB of space.

To address this problem, considering that wide tables do not have many rows of data in a file block, so many buckets in HLL are empty, RLE (run length encoding) can be used as the Sparse encoding form of HLL. Basic principles of RLE:

Continuous Sequence Detection: RLE first scans the data sequence to be compressed and detects the continuous occurrence of the same data items.
Count: For each consecutive sequence, RLE counts the number of data items in the sequence and converts it into a count value.
Encoding: RLE replaces the original data sequence with a set of (data item, count value) pairs, representing consecutive occurrences of the data item and its number.
Storage: Finally, the compressed data can be stored as a series of data pairs in sequence to reduce storage space.

Compression Efficiency Comparison

For the initial case, if it is dense form, even if the HLL buckets are all empty, 12KB of space will be needed, while for Sparse form, only 2B will be needed. This reduction in space is very obvious when the base is low. As shown below, for a 1500-column Simplified Chinese table, insert 100 rows of data, and compare the space occupied before and after HLL supports Sparse form:

HLL does not support Sparse form:

HLL supports Sparse form:

Test Statistics

3 Delta Encoding

Delta Encoding is a data compression technology used to store continuous or repetitive data. It reduces storage space by recording differences between adjacent data.

The basic principles are as follows:

Initial Value: Choose an initial value as a baseline. This can be the previous data point, the first data point, or any other appropriate value.
Calculate Difference: For each subsequent data point, calculate the difference from the previous data point. This can be achieved using a simple subtraction operation.
Store Difference: Store the calculated difference value. Typically, the difference value is encoded in binary form and stored in an appropriate data structure (such as array).
Reconstruct the Original Data: When the data needs to be used, the original data is reconstructed by accumulating difference values and baseline values. Starting from the baseline value, and adding the accumulated difference values in sequence, the sequence of the original data can be obtained.

Compression Efficiency Comparison

The advantages of Delta Encoding are its ability to efficiently handle contiguous or repeated data and its performance in terms of storage space. It is particularly suitable for compressing time series data, sorted lists, or other data with increasing or decreasing trends.

For variable-length data storage, PieCloudDB uses offsets to store the length of each data. For some types, such as Decimal, the offset changes between the previous and later data are equal. Using Delta Encoding can significantly reduce the storage space of offsets. Especially for Decimal, a type whose data itself is very short, this may be more obvious.

For example, a wide table with 1500 column types of NUMERIC(20,10) is 550MB without Delta Encoding and 377MB with Delta Encoding. The result is as shown in the figure below:

Test Statistics

Future Outlook

PieCloudDB will create adaptive compression and significantly reduce the storage size of general data, metadata, and string type data by supporting ZSTD, HLL sparse representation, and Delta Encoding. In the future, PieCloudDB will continue to optimize the iterative compression method and support the expansion of more encoding methods, such as Dict Encoding, BIT_PACKED, RLE, etc. According to different data types, select the appropriate encoding method to achieve a better compression ratio.

Related Blogs:

no related blog