As the scale of enterprise data continues to grow, database users' critical business systems are faced with the challenge of storing and processing huge amounts of data. This not only means more expensive IT and data costs, but also more resource consumption and management complexity. In order to meet the needs of enterprises, the cloud-native virtual data warehouse PieCloudDB Database helps enterprises reduce costs and improve efficiency through a series of innovative technical means.
From product design to product development, PieCloudDB focuses on reducing costs and increasing efficiency for users. PieCloudDB adopts a storage-computing separation architecture, which can significantly reduce costs and increase efficiency in hot and cold data analysis and data peak and trough scenarios. The pay-as-you-go model of PieCloudDB Cloud on Cloud edition can ensure maximum cost-effectiveness for users. In addition, PieCloudDB has also made many optimizations in data compression, creating an adaptive compression solution to significantly reduce storage space requirements, thereby reducing hardware costs. It can also reduce the time and cost of the data backup and recovery process. This article will mainly introduce how PieCloudDB uses a variety of adaptive compression and encoding technologies to reduce costs and increase efficiency for enterprises while ensuring performance.
ZSTD (Zstandard) is a high-performance lossless compression algorithm, open sourced by Facebook in 2016. The algorithm is designed to provide fast compression and decompression speeds while achieving a high compression ratio. The following are some features of ZSTD:
Based on the above characteristics and the fact that ZSTD is superior to the current PieCloudDB compression algorithm in terms of compression rate, we support ZSTD in PieCloudDB.
Compression Method Selection
When PieCloudDB creates a table, you can select the compression method through the compresstype field. If not set, it defaults to ZSTD:
CREATE TABLE ctbl_none (s text) WITH (compresstype = 'none');
CREATE TABLE ctbl_pglz (s text) WITH (compresstype = 'pglz');
CREATE TABLE ctbl_zstd (s text) WITH (compresstype = 'zstd');
Compression Efficiency Comparison
For a wide table with 500 columns, the size of the CSV file (536MB) with 86,400 pieces of data randomly generated by the import script in the table with different compression methods:
Without Compression (cluster size is 1):
pglz (cluster size is 1):
zstd(3) (cluster size is 1):
Test Statistics
PieCloudDB's storage system JANM uses a mixed row and column storage format, and each file block contains part of the row data of the table. In order to count the number of unique values in each column in the table, PieCloudDB uses the HLL (Hyperloglog) structure to perform cardinality estimation. In the past, HLL only used a form of encoding called "dense", in which an HLL structure occupied about 12KB of space (an HLL 16384 buckets, each bucket 6bit), but for wide tables, each file block The HLL structure in will become very large. For example, if a wide table has 1500 columns, the HLL portion of each file block will take up approximately 18MB of space.
To address this problem, considering that wide tables do not have many rows of data in a file block, so many buckets in HLL are empty, RLE (run length encoding) can be used as the Sparse encoding form of HLL. Basic principles of RLE:
Compression Efficiency Comparison
For the initial case, if it is dense form, even if the HLL buckets are all empty, 12KB of space will be needed, while for Sparse form, only 2B will be needed. This reduction in space is very obvious when the base is low. As shown below, for a 1500-column Simplified Chinese table, insert 100 rows of data, and compare the space occupied before and after HLL supports Sparse form:
HLL does not support Sparse form:
HLL supports Sparse form:
Test Statistics
Delta Encoding is a data compression technology used to store continuous or repetitive data. It reduces storage space by recording differences between adjacent data.
The basic principles are as follows:
Compression Efficiency Comparison
The advantages of Delta Encoding are its ability to efficiently handle contiguous or repeated data and its performance in terms of storage space. It is particularly suitable for compressing time series data, sorted lists, or other data with increasing or decreasing trends.
For variable-length data storage, PieCloudDB uses offsets to store the length of each data. For some types, such as Decimal, the offset changes between the previous and later data are equal. Using Delta Encoding can significantly reduce the storage space of offsets. Especially for Decimal, a type whose data itself is very short, this may be more obvious.
For example, a wide table with 1500 column types of NUMERIC(20,10) is 550MB without Delta Encoding and 377MB with Delta Encoding. The result is as shown in the figure below:
Test Statistics
PieCloudDB will create adaptive compression and significantly reduce the storage size of general data, metadata, and string type data by supporting ZSTD, HLL sparse representation, and Delta Encoding. In the future, PieCloudDB will continue to optimize the iterative compression method and support the expansion of more encoding methods, such as Dict Encoding, BIT_PACKED, RLE, etc. According to different data types, select the appropriate encoding method to achieve a better compression ratio.