Postgres Conference 2024丨A High-Level Introduction to The Query Planner in PostgreSQL

APRIL 29TH, 2024

The Postgres Conference 2024, one of the largest PostgreSQL conferences globally, was held in San Jose, USA from April 17th to 19th. This conference featured four tracks: Ops, Dev, Essentials, and Google Cloud. The event centered around topics such as PostgreSQL core, database management and applications, user cases and experiences. Esteemed speakers from notable companies like Google, AWS, EDB, Yugabyte, and DBeaver were invited to present at the conference. Leveraging its strong influence in international technical community, OpenPie had the honor of being a sponsor for this prestigious event and delivered a technical presentation.

During the conference, Richard Guo from OpenPie, newly recognized to be a PostgreSQL Contributor, was invited to deliver a technical presentation titled "A high-level introduction to the query planner in PostgreSQL." Utilizing his experience of PieCloudDB database optimizer, he provided a comprehensive explanation of the inner workings of the PostgreSQL optimizer from a developer's standpoint, dive into the intricacies of transforming a query tree into a plan tree. Richard's speech received positive feedback from the attendees, leading to engaging interactions and discussions.

In a DBMS, SQL query processing is a complex and critical process. For PostgreSQL, a SQL query needs to go through the following five main steps from receipt to execution:

Parser: responsible for checking syntax errors and generating a parser tree;
Analyzer: Perform semantic analysis based on the parser tree and generate a query tree;
Rewriter: Rewrite the query tree according to the rules that exist in the system;
Planner: Generate a plan tree with the highest execution efficiency based on the query tree;
Executor: Access tables and indexes in the order in the plan tree and execute the corresponding query.

For the same query, there are generally multiple ways to execute it. As a crucial component of a database, optimizer's role is to identify the query plan with the minimum cost from all possible execution methods and transform it into an executable plan tree.

The following will focus on the planning phase in PostgreSQL query processing, which is also the most important and complex phase in the entire process. The process is generally divided into four stages: preprocessing, scan/join planning, post scan/join planning, and postprocessing.

Preprocessing

In the early stage of the preprocessing phase, queries are generally simplified as much as possible by simplifying scalar expressions (functions, Boolean, CASE, etc.) and expand simple SQL functions in-line. At the same time, the join tree will be simplified by converting IN, EXISTS and other types of subqueries to semi-joins, flattening subqueries, and reducing outer joins (converting them to inner joins or anti-joins).

In addition to these methods, a variety of optimization methods are used in the later preprocessing stage, including:

Distribute WHERE and JOIN/ON qual clauses;
Build equivalence classes for provably-equal expressions;
Gather information about join ordering restrictions;
Remove useless joins;
...

Scan/Join Planning

The scan/join planning mainly processes the FROM and WHERE parts of the query, and also considers the ORDER BY information. This part is all driven by cost.

This stage first determines the scan path for base relations, estimates the cost of the scan path, and then uses dynamic programming or heuristic GEQO method to identify feasible plans for join relations. When searching the join order space, it is also necessary to honor outer-join ordering constraints.

In dynamic programming, join search will proceed as follows:

Generate paths for each base relation;
Generate paths for each possible two-way join;
Generate paths for each possible three-way join;
Generate paths for each possible four-way join;
…
Continue until all base relations are joined into a single join relation.

However, the cost of this process is very high. Theoretically there are n! different join orders for n relations. It is unrealistic to traverse all possible join orders. Therefore, some heuristic methods are usually used to reduce the search space. For relations that do not have join conditions, try not to join them; decompose a large problem into multiple sub-problems to reduce complexity.

Post Scan/Join Planning

At this stage, optimizer will first process GROUP BY, aggregation, window functions and DISTINCT, then process set (UNION/INTERSECT/EXCEPT) operations, and finally process ORDER BY. Each of the above steps will generate one or more paths. The optimizer will filter these paths based on cost and add LockRows, Limit and ModifyTable nodes to the filtered paths.

Postprocessing

At this stage, the optimizer needs to convert the least-cost path into a plan tree and adjust some details in the plan tree:

Flatten subquery rangetables into a single list;
Label Vars in upper plan nodes as OUTER_VAR or INNER_VAR, to refer to the outputs of their subplans;
Remove unnecessary SubqueryScan, Append, and MergeAppend plan nodes.

After completing this step, the optimizer will obtain the complete plan tree, and can hand the plan tree to the executor for execution, and finally obtain the query results.

As a high-tech innovation enterprise, OpenPie has been deeply involved in the international open-source technology and ecosystem through various forms such as code contribution, lecturer presentations, conference sponsorship and participation, and ecosystem collaborations. In the future, OpenPie will continuously broaden its international perspective, actively integrate into the global wave of technological innovation, expand its international influence, and strive to become a tech-driven enterprise with an international footprint.

Related Blogs:

no related blog