Large Models and Databases: Mutual Driving Forces in the AI Era

SEPTEMBER 27TH, 2023

With the advent of the AIGC era, large language models (LLMs) led by GPT have become one of the hottest topics in today's field of artificial intelligence. These powerful models not only excel in tasks such as content creation, language translation, and code assistance but have also brought revolutionary impacts on the development of databases.

Large Language Models: A New Era of Human-Machine Interaction

Throughout the evolution of human civilization, language has been an essential component. From the earliest oral traditions to the emergence of written language, the ways in which language is conveyed and expressed have continuously improved, enabling knowledge and ideas to be passed across time and space.

The ongoing advancement of technology led to one of humanity's greatest inventions – the birth of the computer, giving rise to an entirely new language: machine language. Machine language consists of instructions that computers can understand and execute. While machine language is highly efficient for internal computer operations, it is a tedious and complex task for us to directly write and read machine code. To simplify interaction with computers, people invented assembly language, which represents machine language instructions using mnemonic symbols. However, it still requires a high level of technical expertise to write and understand.

As computer technology continued to evolve, people developed high-level programming languages that closely resemble natural languages, making programming simpler and more user-friendly. However, high-level programming languages are constrained by compilers and interpreters, limiting their ability to express and comprehend complex statements. There was a pressing need to make interactions with machines even simpler, ideally allowing machines to truly understand natural language.

In response to this demand, artificial intelligence emerged. Over the past sixty-plus years since its inception, researchers have been diligently working on Natural Language Processing (NLP) within the field of artificial intelligence. Their goal has been to enable machines to better understand natural language and execute corresponding commands accurately, ultimately achieving more intelligent interactions with humans.

NLP: The Link Between Human-Machine Interaction

（source: easyai.tech）

On November 30, 2022, OpenAI released ChatGPT, a large language model based on GPT technology. It demonstrated astonishing levels of artificial intelligence and quickly became the focus of attention across various sectors of society. Prior to this, there had never been a language model as powerful as ChatGPT. Its release marked the beginning of a new era in human-machine interaction.

The Empowering Potential of Large Language Models

The emergence of ChatGPT has triggered a new wave of AI enthusiasm. More and more technology companies are developing their own large language models to keep pace with the era set in motion by ChatGPT. AI tools generated based on these large models are numerous, spanning various domains including programming, databases, audio, video, language translation, and conversational chat, among many others.

Application Fields of Large Language Models

（source: aigeneration.substack.com）

For example, in the field of programming, Github Copilot and Mintlify are both AI code assistants based on large models. The former can generate code suggestions based on developers' code context and comments, helping developers improve programming efficiency and quality, reduce repetitive and tedious work, and easily implement their ideas.

AI programming assistant Github Copilot

(source: github.blog)

The latter can generate code comments based on the semantics and context of the code, thereby reducing the burden on developers to write comments and improving the readability and maintainability of the code.

Code commenting tool Mintlify

(source: g2.com)

In addition, in other fields, large language models have also had a wide-ranging impact. In writing, large language models can be used for text generation, rewriting paragraphs, intelligent review, and more. In the field of images, large language models can perform functions such as image generation, image repair, and image background removal.

Large language models are not just a technology but also an important driver of digital economic development. With the vigorous development of the digital economy, data has, to a certain extent, surpassed land, labor, technology, and capital to become the fifth most powerful production factor promoting economic growth. In the era of the digital economy, there is a massive generation and processing of data every day, and behind this, there is a technology that is particularly important. It is the "root technology" of the digital economy, an important link connecting upper-level applications and underlying basic resources, and is even known as the "crown jewel" of foundational software. That technology is the database.

When Large Language Models Meet Databases

Databases are a core component of modern information systems, used for storing, managing, and retrieving large amounts of structured and unstructured data. With the explosive growth of data and users' demands for more advanced queries and analytics, traditional database systems face challenges. As a result, databases have begun to integrate and innovate with various emerging technologies such as cloud computing, big data, blockchain, etc., resulting in a series of more powerful new databases that provide modern information systems with more choices and solutions.

So, what kind of creativity can be generated when large language models meet databases?

Applications of Large Models in the Database Domain

Large language models can empower database systems in various ways, resulting in improved performance and intelligence. Below are some dimensions of the applications of large language models in the database field:

NL2SQL (Natural Language to SQL)

Traditional database interactions require the use of structured query language (SQL) or other programming languages, which may pose a certain learning and comprehension challenge for non-technical professionals. NL2SQL refers to the technology of translating natural language (NL) into structured query language (SQL). Its goal is to enable non-technical professionals to interact with databases using natural language, without the need to write complex query statements.

SQL Chat is a conversational SQL client tool based on large models. It provides a user-friendly interface that allows users to interact with databases through natural language conversations.

In comparison to traditional GUI modes, SQL Chat places a stronger emphasis on user-friendliness and naturalness. It simulates conversational exchanges between people, allowing users to inquire in a manner resembling natural language, without the necessity of being familiar with the specific syntax and structure of SQL query statements. This chat-based interaction approach enables users without technical backgrounds to easily communicate with and query databases.

SQL Chat translates natural language into SQL query

By providing a more intuitive and natural mode of interaction, SQL Chat lowers the barrier to using SQL and offers a more convenient and user-friendly database operation experience for non-technical users. This mode of interaction significantly simplifies the user's interaction with the database, enhancing database accessibility and usability.

Database Performance Optimization

Database performance optimization has always been one of the most challenging issues for DBAs and developers. It is a highly complex task that involves multiple aspects, including hardware, system design, database schema design, SQL query optimization, index strategies, cache management, and more.

Among these, SQL query optimization is the most common method used for optimizing database performance and is frequently encountered by developers. The goal of SQL query optimization is to reduce query response times, decrease database loads, and enhance query efficiency through various means.

Typically, the execution speed of an SQL query is related to various factors, such as the quality of the SQL statement itself, the execution plan generated by the database, database caching mechanisms, the size of data tables, and the complexity of query conditions. Execution plans and caching mechanisms are determined by the database's own development and design standards and cannot be easily changed. Therefore, in the same database environment, query execution efficiency depends on the quality of SQL query statements. The performance difference between high-quality and low-quality SQL statements is substantial.

However, many SQL programmers struggle to write high-quality SQL statements, and even experienced DBAs can spend a significant amount of time and effort optimizing complex SQL queries. With the emergence of large language models, SQL tuning is no longer a nightmare for DBA.

Large language models can analyze a given SQL query statement and provide query rewriting and optimization suggestions. They can infer potentially more efficient ways of executing the query based on its structure and semantics and quickly offer relevant optimization recommendations, significantly reducing the burden on developers and maintainers.

Optimizing Query Statements Using SQL Chat

Databases Driving Large Model Advancements

Large language models are fundamentally neural network-based language models that are pretrained on massive datasets and have an enormous number of parameters, typically in the billions or more. Computational power, algorithms, and data, as the three major elements of artificial intelligence, are also crucial factors driving the development of large models.

The training and inference of large language models require substantial computational resources. Improvements in computational power enable models to undergo deeper training on larger datasets, thereby enhancing their language understanding and generation capabilities. Evolving algorithms can optimize model structures and training methods, making them more efficient in utilizing computational resources, accelerating convergence, and improving training efficiency. Data is a critical factor in the emergence of large models, as large language models are entirely data-driven, and the training process demands a substantial amount of data resources. The quantity, quality, and diversity of training data are crucial for training large language models.

Databases, as core tools for storing and managing data, can provide efficient data storage and retrieval capabilities, supporting the training of large language models. By storing data in databases, batch reading and processing can be conveniently performed, enhancing data availability and training efficiency.

Taking the current highly popular large language model, ChatGPT, as an example, the GPT-3 model has as many as 175 billion parameters. Data indicates that training a GPT-3 model consumes a total computational power of 3640 PF-days, costing approximately 12 million USD. What is even more astounding is that, according to information gathered from industry experts, the latest GPT-4 model has a staggering 1.76 trillion parameters. Larger parameter counts result in more intelligent models, but also entail higher costs. Computational power requirements are closely related to parameter scales, and the parameter scale is an essential reference for measuring the quality of large model training. In other words, computational power is the underlying driving force for training large models, and a robust computational foundation can significantly enhance the training effectiveness of large models. The success of ChatGPT is attributed to the powerful cloud computing services provided by Microsoft Azure.

The computational power requirements for training large models

From this, it can be seen that for enterprises seeking to have their own large models, the enormous data computation requirements and high computing costs are two significant hurdles to overcome. Even if one has access to the complex code of a large model, it's not something that everyone can afford to run. Therefore, behind large language models, it's not just the merit of complex algorithms but also relies on the support of cloud computing services, including computing, storage, databases, and various other resource provisions.

Large Models + Databases: 1 + 1 > 2

The fusion of large language models with databases will drive the development of human-machine interaction and database applications. The combination of the two is a win-win situation. By leveraging the language understanding and generation capabilities of large language models, the use and management of databases will become more convenient and intelligent. Databases, in turn, provide high-quality datasets and efficient data management to support the training and application of large language models. The integration of databases and large models is bound to become a major trend in the development of both in the future.

OpenPie’s first data computing engine PieCloudDB Database was first launched on 2022. October 24th this year, OpenPie’s πDataComputing System (πDataCS), a system for data computation with large language models, is set to be launched at the company's annual tech forum. It aims to become the foundational technology base for AI, with strong technological innovation and leading product capabilities highly anticipated by the industry. It is believed that the large model data computing system will usher in a new paradigm for AI technology.

Related Blogs:

no related blog