With the advancement of science and technology, particularly information technology, there has been a revolutionary progress in our ability to acquire, store, and analyze data. In recent years, the continuous development of mobile internet IoT, and 5G technologies has led to an exponential growth in the global datasphere. IDC predicts that the global datasphere will grow to 175ZB by 2025, with China's datasphere expected to explode to become the world's largest by the same year. Numerous applications centered around data acquisition, storage, and analysis have also sprung up like mushrooms after a rain. These facts indicate that data is likely to become the next tipping point for scientific and economic progress. As a result, the concept of "data elements" has been proposed, placing data on par with "labor" and "resources" as factors of production, reflecting a fundamental change in the perception of data.
In 2023, the National Data Administration was officially established in China to coordinate the advancement of foundational data systems, integrate and share data resources, and promote the planning and construction of a digital China, digital economy, and digital society. This not only reflects the strategic management and standardized utilization of data resources but also the national level's emphasis on the development of the digital economy and data governance.
Moreover, the rapid development of cloud platforms has also brought new opportunities for data systems. Cloud platforms represent the largest computing power, storage capacity, and horizontal scalability currently available. They provide data systems with virtually unlimited storage and computing resources, making applications like ChatGPT possible. As a result, more and more enterprises are migrating their applications to the cloud, and more data is flowing to the cloud.
To help enterprises optimize computational bottlenecks, fully utilize and leverage the advantages of data scale, build core technical barriers, and better empower business development, the new generation data warehouses need to integrate all enterprise multimodal data resources, provide data computation support under multimodal large models, and be closer to the needs and usage of data scientists.
The Core Value of Data Warehouse Virtualization
The new generation data warehouses needs to adopt leading data warehouse virtualization technology to unify multiple data warehouses into a highly available cloud virtual data warehouse, connect multi-cloud pipelines, and enable data computing resources to be scaled on demand. This enhances the agility and elasticity of the data warehouse, helping enterprises reduce the complexity of data warehouse management, with the advantages of scalability, flexibility and reliability. Typical products include the cloud-native virtual data warehouse PieCloudDB Database from OpenPie.
Through compute and storage separation architecture, physical data warehouses are integrated into a cloud-native data computing platform, supporting the dynamic creation of virtual data warehouses based on data authorization. This breaks data silos, solves the issue of data duplication, achieves more flexible configuration of storage and computing resources in the cloud at a lower cost.
Data computing resources are scaled on demand to optimize the configuration of computing resources, enhance the agility and elasticity of the data warehouse, open up unlimited data computation space, and support the data and computation required for larger models.
The new generation data warehouses are naturally compatible with cloud environments, eliminating the need for additional customization. Enterprises can flexibly expand storage or computing resources at low cost and with high efficiency according to their resource needs, improving resource utilization, saving space costs and energy consumption. They also support efficient scaling with changes in load, easily handling PB-level massive data with high fault tolerance, easy management, and easy observation. Combined with reliable automation, the system can easily make frequent and predictable changes.
The "out-of-the-box" feature of cloud natives saves enterprises a lot of operational costs. Since its computing nodes are deployed in the cloud, it gets rid of physical limitations and potential delays, allowing for easy management through the Internet anytime and anywhere without any hardware. Data is available anytime and anywhere, without dealing with any backend issues, paving the way for enterprises to share and collaborate data across departments and regions, ensuring the globalization process of enterprises.
When data warehouses go to the cloud, the most important concern for users is data security. Traditional data platforms store files and resources on the same host, compensating for node downtime with primary and backup node data, which seriously affects data timeliness and increases operational costs and difficulty. However, data warehouses in the cloud era guarantee that all data is encrypted before being written to disk through technologies such as Transparent Data Encryption(TDE), serverless technologies utilize the unlimited computing resources and elasticity of the cloud to ensure that computing is always available, and S3 storage and cross-cloud disaster recovery capabilities ensure that data is never lost, ensuring business continuity.
The Technological Breakthroughs of Data Warehouse Virtualization
To achieve data warehouse virtualization and provide enterprises with a new cloud-based digital solution, helping enterprises build a competitive barrier centered on data assets, and optimizing cloud resources to achieve unlimited data computation, the new generation of data warehouses needs to achieve the following technological breakthroughs:
In the design and implementation of data warehouses in the cloud era, it is necessary to fully consider the elasticity and distributed features of the cloud platform to achieve the decoupling and separation of metadata, user data, and computing resources. By decoupling metadata, user data, and computing resources, metadata can be regarded as a digital key for the safe, and user data can be regarded as the assets inside the safe. Users only need to exchange the digital key to access the data inside the safe. The cloud's unperceived computing can be regarded as a pile of calculators, and when needed, according to the authorized digital key, pull the corresponding data to calculate.
The new generation of data warehouses needs to adopt an efficient parallel method for data loading and processing, with processing speed increasing with the number of nodes, supporting fast loading of stream data. Through the cloud-native virtual data warehouse's compute and storage separation architecture, it realizes concurrent execution of tasks in multiple clusters, and enterprises can flexibly scale up or down, efficiently scaling with changes in load, easily handling PB-level massive data, with high fault tolerance, easy management, and easy observation. Combined with reliable automation, the system can easily make frequent and predictable major changes.
Object storage is naturally adapted to the cloud-native environment and seamlessly integrates with other cloud-native technologies such as cloud computing platforms and container orchestration technologies. In addition, compared to traditional storage methods, object storage usually has a lower cost, becoming a more advantageous storage choice for data warehouses. However, since object storage is usually based on distributed systems and network storage, data transmission and retrieval usually require the network, so there will be a certain delay compared to local storage. This shortcoming also poses new challenges to the storage engines of data warehouses in the cloud era.
The new generation of data warehouses needs to have a strong storage adaptation interface capability to ensure support for various types of storage and compatibility with different cloud environments. In addition, each computing node needs to design a multi-level cache structure for metadata and user data to avoid network latency and data movement, improve computing efficiency, and ensure the real-time requirements of users. For the underlying object storage, data warehouses needs to design an efficient file format to save network requests while improving computing efficiency.
The optimizer, as a key technology in the database management system, has an important impact on the performance and efficiency of the database. For cloud-native and distributed scenarios, the optimizer needs to implement advanced features such as aggregation pushdown, pre-computation, and Block Skipping to fully meet various complex analytical query needs.
The importance of data analysis and application is growing day by day, and for data platforms, ultimate performance is a key requirement. To achieve more efficient data parallel computing, the new generation of data warehouses needs an excellent executor that can fully utilize hardware resources, such as the parallel computing capabilities of CPUs and SIMD instruction sets. By packaging multiple data elements into vectors and performing the same operations simultaneously, computing efficiency and throughput are improved.
The new generation of data warehouses is an upgrade of the analytical database in the era of Large Models. For large models, the data required by the model has all undergone the vectorization process, and vectorized data can greatly improve the search efficiency of the model and reduce training costs. In the era of large models, data warehouses need to achieve massive vector data storage and efficient queries to help multimodal AI applications, support and cooperate with the Embeddings, provide efficient storage, indexing, and query functions for vectors, and have features such as efficient storage and retrieval of vector data, similarity search, vector indexing, vector clustering and classification, high-performance parallel computing, strong scalability, and fault tolerance, helping the basic model to quickly adapt and redevelop in the AI scene.
In order to accelerate the performance of big data processing and computing, the new generation of data warehouses in the cloud era needs to fully rely on new hardware for asynchronous computing, such as GPUs, FPGAs, etc. By fully utilizing the new generation of hardware accelerators, data warehouses can achieve higher computing performance, lower latency, and better scalability. This will make big data processing and computing more efficient and reliable, promoting further development of data analysis and decision support capabilities in the cloud era.
About PieCloudDB
PieCloudDB, a cloud-native virtual data warehouse, follows a technical approach based on the abstraction and reuse of design principles from top databases. It has achieved the virtualization of analytical data warehouses in the cloud, integrating physical data warehouses and dynamically creating virtual data warehouses based on data authorization. It offers flexible computing on demand, breaking down data silos and supporting the data and computation required for larger models. With PieCloudDB, storage and computation are each treated as independent variables, each scaling elastically in the cloud. It addresses the shortcomings of traditional MPP architectures, not only enabling instant scaling but also supporting users to simultaneously run multiple clusters for data computation in the cloud. It can continuously store all data in the cloud, truly achieving data sharing for existing and future applications, helping enterprises maximize data value.
With the rapid development of cloud computing technology, artificial intelligence (AI), and other technologies, data systems are also showing trends of cloud-native and intelligence. Data analysis technology is also evolving from traditional BI to support new applications such as deep learning and large language models. Cloud-native and AI-oriented of data systems is the new trend in the development of data systems.
What people usually refer to as a database is the Relational Database Management System (RDBMS). In addition to relational databases, there are also databases for various unstructured data, such as Document databases, Graph databases, Streaming databases, etc.
Enterprise data platforms are generally composed of multiple databases, and different types of databases handle different data. To unify management, people have proposed the concept of "enterprise data lake". Generally, an enterprise data lake is composed of multiple data sources and processing tools for different data sources. In the data lake, data is divided into "cold-warm-hot" data according to the frequency of use. "Hot" data is generally stored in the data warehouse, when dealing with "warm" data or "cold" data, there is generally an ETL operation of "extract - transform - load" from the lake to the warehouse.
In order to break the boundaries of various data types and reduce the cost of ETL operations for moving data, people have proposed the "lakehouse" architecture to solve issues such as transaction support, data modeling, and data governance. The development of cloud-native technology and artificial intelligence technology has brought subtle influences to the architecture of databases and data lakes.
A new type of data architecture, the "data computing system" has been proposed. Data and computation are two independent subsystems of the data computing system, with data as the core and computation as the means to create data value.
Data Subsystem
Computing Subsystem
Different computing engines are used for different types of data:
In addition, the data computing system has the following features:
ACID support for transactions ensures the consistency and correctness of concurrent data access.
Ensuring data integrity and having a sound governance and audit mechanism.
Support for using BI tools directly on source data, which can speed up analysis efficiency and reduce data latency.
Adopting open, standardized storage formats with rich API support, so that various tools and engines (including machine learning and Python/R libraries) can efficiently access data directly.
Multimodal data support capability, supporting data types including structured data, semi-structured data, unstructured data, and binary data.
Support for various workload types including data science, machine learning, SQL queries and analysis.
About PieDataCS
OpenPie’s PieDataComputing System (PieDataCS) revolutionizes data storage and computation with cloud-native technology. With “one-storage, multi-engine data computation” , it fully upgrades big data systems to the era of large models. This ensures that PieDataCS, which is independently controllable, remains at the forefront globally. It not only serves as the foundational technology for AI but also pioneers a new paradigm in AI technology. PieDataCS is designed to help enterprises optimize computational bottlenecks, fully leverage the advantages of data scale, and build core technical barriers. It empowers large model technologies to comprehensively enhance AI scenario applications, creating greater commercial value for enterprises.