PieCloudDB Database's brand-new eMPP (Elastic MPP) architecture: Beyond MPP with super elasticity

OCTOBER 19TH, 2022

The concept of the "Big Data Era" was first proposed by the renowned consulting firm McKinsey. McKinsey stated, "Data has permeated every industry and business function today and has become an important production factor." Data mining through sophisticated algorithms and data analysis has become crucial, and a consensus has been reached: "Data computing can lead to new discoveries."

In the book " The Mathematical Corporation: Where Machine Intelligence and Human Ingenuity Achieve the Impossible " Josh Sullivan, a partner at Booz Allen Hamilton, mentions that his team studied hundreds of organizations and identified the key elements that constitute successful organizations of the future, known as "Data Companies." The key to evolving into a "digital company" is that "organizations are data-driven." In the era of big data, companies no longer simply delete data but instead store it for analysis. Databases have become an indispensable part of enterprise infrastructure.

What is MPP?

MPP (Massive Parallel Processing) has long been recognized as the mainstream architecture for today's databases and is widely used in various database products, including Greenplum, Teradata, Vertica, and others. MPP databases are optimized for analytical workloads, catering to the needs of users for aggregating and processing large datasets. MPP analytical databases distribute tasks in parallel across multiple servers and nodes, returning and consolidating results after completing computations to accomplish the analysis of massive data.

Advantages of MPP databases

MPP database clusters offer scalability, high availability, high performance, and many other advantages. The advent of MPP databases addresses the challenge of storing massive data that cannot be accommodated by a single SQL database and the difficulty of fulfilling analytical demands on a single physical machine.

Capability for processing massive data

MPP architecture databases scale storage and computing through clusters of PC servers, as shown in the diagram below. For example, if a wide table has 300 million records, an MPP database would distribute approximately 100 million records across the hard drives of each PC server. During data computation, all machines perform parallel calculations simultaneously, theoretically reducing the computation time to 1/n (n is the number of machines) compared to a single machine deployment. This significantly saves processing time for massive data.

Traditional MPP Database Architecture

Perfect compatibility with SQL

Most traditional MPP databases have achieved perfect compatibility with SQL, including the ANSI SQL 2008 standard and SQL 2003 OLAP extensions. The comprehensive support for SQL allows MPP databases to seamlessly integrate with common Extract/Transform/Load (ETL) and Business Intelligence (BI) tools in the industry. They fully support and certify standard database interfaces. With minimal integration efforts, enterprises can use existing analysis tools that use standard SQL structures and interfaces to run applications on the database, avoiding vendor lock-in and helping businesses suppress operational risks while driving innovation.

Highly parallelized computing

The MPP architecture brings great elasticity to database concurrency. The architecture enables automatic parallelization of data and queries within the database. Data can be automatically partitioned across all nodes of the database and queries are planned and executed in a highly coordinated manner using all nodes. Enterprises can scale the cluster according to their concurrency requirements to meet the desired levels of concurrency.

Horizontal scalability

MPP databases have excellent horizontal scalability. Enterprises can increase the number of servers and use more nodes to support larger analytical demands based on their business needs.

Bottlenecks of Traditional MPP Databases

While MPP databases have numerous advantages and have become the mainstream architecture for many analytical databases, they also have several bottlenecks and limitations.

Coupling of Storage and Computation

In traditional data warehouses, computation and storage are tightly coupled, with computing resources and storage resources bound together in a certain ratio. Therefore, when scaling up, users must scale both computing and storage resources simultaneously, which poses challenges in terms of scalability, operations, and migration. Due to the uncertainties in business development, traditional data warehouses may fail to scale resources in a timely manner during peak load periods, resulting in delays in analyzing business data and missing out on potential business opportunities for extracting the full value of data.

Business Limitations

Although traditional MPP databases have achieved horizontal scalability, the coupling of storage and computation makes the process of horizontal scaling complex and slow. As the data volume increases, the addition of nodes during each scaling operation leads to a large number of I/O requests that impact the processing speed of the business and have an effect on its continuity. The inability to quickly increase computing power in response to sudden increases in workload or to shrink resources during reduced loads restricts businesses from dynamically allocating resources according to actual needs, thereby limiting their operations.

High Costs

The high costs of traditional database software and hardware result in significant upfront investments. As storage and workload requirements grow, database scaling and upgrades become costly and time-consuming for enterprises. Due to the tight coupling of storage and computation in traditional MPP database architectures, enterprises often face significant maintenance and time costs during these processes, making them cumbersome to operate.

Bucket Effect

The traditional MPP database architecture suffers from the "bucket effect," where the overall execution speed of the database depends on the performance of the "weakest link" or the slowest-performing node. Failure of a single node can significantly impact the overall performance of the database and slow down query speeds. Therefore, traditional MPP architectures often require new PC machines to have the same old configurations as the existing ones. Otherwise, the performance of the entire database is affected, which means that despite Moore's Law, the MPP cluster's storage and performance are limited by the lowest-performing machines.

Data Silos

With business growth, increasing data volume, and the need for information technology development, enterprises often build corresponding business information systems for different departments. However, the horizontal "scaling" capability of MPP and the practical "static" project implementation are contradictory. "Scaling" is theoretically a concept associated with time, but MPP design based on PC machines is not compatible with time. Due to the aforementioned coupling of storage and computation and the "bucket effect," enterprises often opt to start anew and create a new cluster when purchasing new machines, resulting in "data silos" that severely hinder the enterprise's ability to achieve their big data goals.

The Advanced Version of Traditional MPP: eMPP

Facing the limitations of traditional MPP databases, the OpenPie team has created PieCloudDB, a cloud-native database, introducing a brand new eMPP distributed architecture that serves as an engine for data computation in an analytical distributed database platform.

What is eMPP?

eMPP, developed by the OpenPie team, stands for Elastic Massive Parallel Processing.

It goes beyond the traditional MPP architecture and aligns better with the requirements of the cloud era. Cloud platforms have revolutionized the field of information technology, offering not only convenience and speed but also significant flexibility and configurability. Users can define the configuration and quantity of cloud instances, easily scaling them up or down. In other words, cloud platforms provide businesses with great elasticity in their application architectures.

By combining the MPP architecture with cloud platforms, eMPP was born. To adapt to the elasticity of cloud platforms, the new eMPP architecture achieves a separation of storage and computation in the cloud. This means that computing resources and storage resources can be independently horizontally scaled in the cloud.

Advantages of eMPP

The separation of storage and computation empowers eMPP databases with true elasticity. The eMPP architecture inherits all the advantages of traditional MPP databases mentioned earlier while fundamentally avoiding their shortcomings. Here are several advantages it possesses:

Elastic Scalability

Based on cloud computing platforms and the separation of storage and computation, the eMPP architecture offers multi-dimensional and intelligent elastic scalability, allowing users to scale horizontally or vertically based on their business needs.

On the storage side, eMPP supports standard object storage, leveraging the advantages of cloud computing platforms to provide virtually unlimited storage capacity. This avoids resource waste caused by the coupling of compute and storage resources when scaling clusters, enabling independent expansion of compute or storage resources with cost-effectiveness.

On the computation side, eMPP is designed with statelessness in mind, enabling computing nodes to fully utilize the vast pool of compute nodes in the cloud platform. Enterprises can dynamically adjust the number of computing nodes in the database cluster according to changes in their business and data volume, ensuring the most suitable resource allocation to meet their business needs.

Flexibility and Agility

The separation of computation and storage in the eMPP architecture eliminates resource waste. Enterprises can flexibly and cost-effectively expand storage or compute resources according to their resource requirements, improving resource utilization and saving costs in space and energy consumption.

Cost Reduction and Efficiency Improvement

The dynamic scalability provided by the eMPP architecture allows enterprises to expand resources based on their specific needs, avoiding resource waste. Compared to traditional databases, eMPP offers higher cost-effectiveness.

High Availability

In the eMPP architecture, computing nodes do not store user data, ensuring the statelessness of the computing nodes. Starting and stopping stateless computing nodes is straightforward, and enterprises can launch sufficient redundant computing nodes according to their needs to ensure high availability of the eMPP database. In eMPP databases, user data is stored in the cloud platform's object storage, leveraging the advantages of cloud storage to ensure high data availability.

PieCloudDB Database: A New eMPP Architecture Based on Cloud Computing

PieCloudDB Database adopts a new eMPP (Elastic MPP) elastic parallel computing architecture based on cloud computing, integrating numerous advantages of MPP databases and perfectly addressing the shortcomings of traditional PC-based MPP databases. The separation of computation and storage allows them to function as independent variables, enabling independent and elastic scalability in the cloud, thereby avoiding resource waste. Enterprises can flexibly and cost-effectively scale storage or computational resources based on their business needs, improving resource utilization and saving space costs and energy expenses.

The three-layer independent architecture of metadata-computation-data separation enables PieCloudDB to achieve centralized storage of data while keeping metadata stored independently. Enterprises can manage the metadata of their data products similar to managing product data. By storing all data in the cloud, enterprises can truly achieve data sharing for existing and future applications.

Related Blogs:

no related blog