PieDataCS empowers the Pairing {Large Language Model / Programming} Bootcamp to build an RAG-based policy Q&A intelligent chatbot

DECEMBER 30TH, 2024

Introduction:

As one of the joint initiators, OpenPie collaborates with the 1024Foundation to promote the AI4AI (AI for All Initiative) public welfare initiative. The Pairing {Large Language Model/Programming} Bootcamp is a highlight project of the AI for Young Eagle Program, tailored for students with no prior experience, reflecting the core philosophy of AI4AI—making AI accessible to the general public rather than being exclusive to elites. This project is strongly supported by OpenPie's large language model data computing system, PieDataCS, which provides technical and platform support, offering a "low-barrier" practical experience for participants and efficiently empowering the project's implementation.

In this "Pairing LLM" Bootcamp, two high school students participated, and I served as a mentor to assist in the pairing practice. The project theme is "Hangzhou University Student Entrepreneurship and Innovation Policy Q&A Chatbot," aiming to build an intelligent chatbot using RAG (Retrieval-Augmented Generation) technology that can provide intelligent and precise answers to user questions, along with corresponding policy references. During the bootcamp, students and mentors "pair up" to delve into project development step by step. This "apprenticeship" teaching method not only accelerates the transfer of technical knowledge but also significantly enhances the students' problem-solving abilities. Throughout the project, students started with the basic framework and gradually learned about the RAG process, large pre-trained models, and LoRA fine-tuning techniques. By testing various combinations of large language models and chunking, they successfully improved the model's accuracy and adaptability.

The project received core technical support from OpenPie's DataComputing System (PieDataCS). Among these, the vector computing engine—PieCloudVector—played a crucial role, with its outstanding multimodal computing capabilities providing stable, fast, and accurate data processing for key project components such as data embedding, parallel vector index construction, full-text retrieval based on the BM25 algorithm (Best Matching 25), and mainstream approximate search algorithms. This deep integration with AI large language models addresses issues like model hallucinations and inference security. Additionally, OpenPie's PieAIStudio also deeply empowered this project, providing platform support in areas such as RAG, pre-trained models, and LoRA fine-tuning.

Chapter 1: Building a Policy Q&A Chatbot

In the process of building a policy Q&A chatbot, we employ RAG technology, which is a hybrid natural language processing technique that combines retrieval and generation. It enhances the context understanding capability of the generative model by retrieving relevant information. The main advantage of RAG is its ability to effectively reduce the "hallucination" issue in generative models, where the model generates content that does not align with reality, thereby improving the accuracy and reliability of the answers. We divide the entire construction process into three key stages: data preprocessing, inference, and evaluation.

General Framework of RAG

Data Preprocessing

During the preprocessing stage, we completed data cleaning, tokenization, and feature extraction to ensure data quality. First, we converted PDF policy texts into TXT format, a step implemented using the Simplified Chinese version of the open-source project Tesseract OCR.

def process_pages(pdf_path, start_page, end_page):
    images = convert_from_path(pdf_path, dpi=300, first_page=start_page, last_page=end_page)
    text_pages = {}

    for i, image in enumerate(images, start=start_page):
        gray_image = ImageOps.grayscale(image)
        text = pytesseract.image_to_string(gray_image, lang='chi_sim+eng')
        print(f"\nPage {i} Text:\n{text}")  # Print recognized text
        text_pages[i] = text + "\n"

    return text_pages

During implementation, we selected a policy document consisting of 43 pages. After tokenization using jieba, we obtained a total of 13,194 characters, comprising 471 sentences.

words = jieba.lcut(text)
num_words = len(words)

# Matches Chinese period, exclamation, question marks, and newlines
sentence_delimiters = r'[。！？]'  
sentences = re.split(sentence_delimiters, text)

sentences = [s.strip() for s in sentences if s.strip()]
num_sentences = len(sentences)

Next is the chunking phase. We researched various common chunking methods and conducted experiments with two of them in particular.

● Method One: Equal Character Chunking

This is also the most common chunking method. To accommodate the maximum number of tokens allowed for each input to large language models, and considering that the average sentence length is 28 characters, we adopted a chunking method with a width of 300 and an overlap of 50.

while start < total_words:
        end = start + W
        chunk_words = words[start:end]
        chunk_text = ''.join(chunk_words)  # Concatenate words without spaces
        chunks.append(chunk_text)
        start = end - overlap  # Move the window forward with overlap

● Method Two: Semantic Double-Pass Merging Chunking

Semantic Double-Pass Merging Chunking involves a two-step process. The First Pass aims to accurately identify differences in topics, connecting the most distinct sentences together. The Second Pass then further combines these smaller chunks into larger blocks with varying themes. For determining topic changes, we set a threshold of 0.7.

The sentence tokenizer used here is sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, which generates 84-dimensional vectors for sentences.

chunks = []
    current_chunk = []
    i = 0
    while i < len(sentences):
        sentence = sentences[i]
        is_item = bool(item_pattern.match(sentence))
        if is_item:
            # Start a new chunk for the itemized list
            if current_chunk:
                chunks.append(current_chunk)
            current_chunk = [sentence]
            i += 1
            # Add subsequent itemized entries to the current chunk
            while i < len(sentences) and (bool(item_pattern.match(sentences[i])) or sentences[i].startswith(('（', '('))):
                current_chunk.append(sentences[i])
                i += 1
            # Add the completed itemized list chunk
            chunks.append(current_chunk)
            current_chunk = []
        else:
            # Regular sentence processing with semantic similarity
            if not current_chunk:
                current_chunk = [sentence]
            else:
                # Compute similarity with the previous sentence
                embedding_prev = get_sentence_embedding(current_chunk[-1])
                embedding_curr = get_sentence_embedding(sentence)
                sim = cosine_similarity(
                    embedding_prev.reshape(1, -1),
                    embedding_curr.reshape(1, -1)
                )[0][0]
                if sim >= 0.7:  # Adjust the threshold as needed
                    current_chunk.append(sentence)
                else:
                    chunks.append(current_chunk)
                    current_chunk = [sentence]
            i += 1

    # Add any remaining chunk
    if current_chunk:
        chunks.append(current_chunk)

In practice, we found that using the same large language model, the prediction results from the second semantic chunking method were superior to those from the first equal-character chunking method. This may be because, in the initial chunking process of method two, we paid attention to the situation where sentence blocks were too fragmented:

Regulated information is considered a marker for chunking, but conversely, it should actually be categorized together. Therefore, we ensured that items at the "()" level of itemization were correctly merged.

item_pattern = re.compile(r'^(\(?[一二三四五六七八九十0-9]+\)?[．。、])')

The results obtained from the above two chunking methods are stored in the form of chunks.pkl and chunk_embeddings.pkl.

Inference Phase

During the inference phase, we leverage the deep learning capabilities of large language models to enhance the model's comprehension and response abilities through fine-tuning and optimization. We need to identify the associated text based on the user's question, design prompt words, and then call upon the large language model to provide an answer.

First, tokenize the query to find the top K paragraphs with the highest similarity (K=5):

def get_top_k_chunks(query_embedding, chunk_embeddings, K):
    similarities = []
    for idx, chunk_embedding in enumerate(chunk_embeddings):
        sim = cosine_similarity(
            query_embedding.reshape(1, -1),
            chunk_embedding.reshape(1, -1)
        )[0][0]
        similarities.append((idx, sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    top_k = similarities[:K]
    return top_k

To balance model performance and the potential for parameter optimization, we selected Llama-2-7b-hf as the large language model. After designing a set of prompts, we can begin the Q&A process.

context = ''
    for idx, sim in top_k_chunks:
        chunk_text = ''.join(chunks[idx]) if isinstance(chunks[idx], list) else chunks[idx]
        context += f"【内容{idx+1}】\n{chunk_text}\n\n"
    prompt = f"""你是一名智能助理，请根据以下提供的政策内容，回答用户的问题。请确保回答准确且基于提供的内容。如果无法找到答案，请告知用户。

{context}
用户提问：
{query}
你的回答：
"""

terminal>>
请输入您的问题:杭州市海外高层次人才创新创业有哪些补助?
生成的回答：
参照中国杭州大学生创业大赛在杭落地项目资助条目

While this answer is reasonable, there is still room for improvement. The question is, how can we quantify the accuracy of the model's answers? To address this, we introduced multiple-choice questions (MCQ) as an evaluation set.

Evaluation Phase

Given the uncertainty of natural language answers generated by large language models, quantifying their accuracy is quite challenging. Therefore, constructing an evaluation set that contains definitive answers is particularly crucial. We expect this evaluation set to meet the following characteristics:

First, each question should have one correct answer, and the other three incorrect answers should be plausible within the realm of common sense. This ensures that the judgment is based on the content retrieved from the documents, rather than the model's prior knowledge.

Second, the correct answers should be randomly distributed among the options to prevent overfitting during training. With the help of manual annotation and AI technology, we successfully constructed 30 sets of evaluation questions and answers, ensuring the quality and practicality of the evaluation set.

Here is an example question:

{
        "query": "哪些企业能获得杭州市的创业补助？",
        "options": {
            "A": "所有注册在杭的企业均可申请。",
            "B": "符合政府补助要求的创新型企业。",
            "C": "补助只提供给年收入超过一定标准的企业。",
            "D": "只限于科技创新型企业。"
        },
        "ground_truth": "B"
    },

On this evaluation set, we validated both chunking methods. Since the generated answers do not always strictly follow the instructions to provide only the options A, B, C, or D, we extracted the first valid uppercase letter that appeared in the answer as the predicted answer.

 for char in predicted_answer:
        if char in ['A', 'B', 'C', 'D']:
            return char
    return None

After conducting multiple experiments, the equal-character chunking method achieved an accuracy rate of 13.3% to 20%, while the semantic chunking method achieved an accuracy rate of 26.7% to 50%. Overall, the text generated by the semantic chunking method proved to be more reliable on this evaluation set. In addition to the aforementioned reason of merging itemization, this may also be because the constant top K chunk input width of the equal-character chunking method is too large, making it more difficult for the large model to accurately understand the instructions. Below are examples of one correct and one incorrect prediction case for reference:

Up to this point, we can summarize the performance combinations of different large models and chunking methods on this evaluation set.

Accuracy Rates of Different Large Language Models Combined with Chunking Methods

Note: The accuracy rates displayed are the highest values obtained from multiple experiments. When calling smaller models, we correspondingly changed the chunking strategy. For example, for microsoft/phi-2, we selected W=80, overlap=40. For Open_llama_7b, we selected top K=3.

Chapter 2: LoRA on RAG: From Unified Training and Inference to Deep Learning

As the students went through the process of building a policy Q&A chatbot using RAG technology, they conducted thorough research and debugging, evolving from initial explorers to experts capable of handling complex tasks independently. However, there is still room for improvement in this project.

We noticed that the model's training and inference processes were not completely separated, leading to hyperparameter settings that were too reliant on the initial design, lacking an iterative optimization process, which is a common issue in early machine learning. With the continuous advancement of deep learning technology, various hyperparameter tuning methods have emerged. Among these, LoRA is favored for its ability to achieve local fine-tuning of large models at a lower cost. By introducing LoRA technology, we can optimize the model more effectively, achieve more precise adjustments, and thus enhance overall performance.

What is LoRA?

LoRA (Low-rank adaptation) is an efficient machine learning model fine-tuning technique that can quickly adapt models to new environments. Unlike RAG, which focuses on specific datasets, LoRA enables models to better adapt to specific task requirements. When facing a variety of specific tasks, fully fine-tuning a large model is often too costly, and LoRA provides an economical and rapid solution. By introducing low-rank matrices into the model's QKV (Query, Key, Value) components, in a structure of the form, where r is much smaller than m and n, LoRA only needs to train two residual matrices A and B. This method significantly reduces the number of training parameters while affecting the model's self-attention and cross-attention layers, thereby achieving rapid and effective fine-tuning of the model.

Applying LoRA to the RAG Chatbot

We first split the previous evaluation set dataset.json into train:valid:test with a ratio of 20:5:5. Set the parameters for lora_config.

def fine_tune_lora(model_name, train_dataset, valid_dataset):
    # Load the pre-trained LLaMA model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

    # Apply LoRA to the model
    lora_config = LoraConfig(
        r=8,  # Low-rank approximation factor
        lora_alpha=16,  # Scaling factor for the LoRA weights
        target_modules=["q_proj", "k_proj", "v_proj"],  # Target the attention layers
        lora_dropout=0.1  # Dropout rate for LoRA layers
    )

    model = get_peft_model(model, lora_config)

Set the evaluation_metric to accuracy. Define the trainable parameters and the trainer. To conserve GPU resources, you may reduce the precision.

training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=3,
        per_device_train_batch_size=2,  
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4, 
        fp16=True,  # Enable mixed precision
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir='./logs',
        logging_steps=10,
        save_total_limit=2,
        load_best_model_at_end=True,
        dataloader_num_workers=4,
        push_to_hub=False,
        metric_for_best_model="accuracy",
    )

 trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=valid_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

def compute_metrics(p):
    logits, labels = p
    predictions = torch.argmax(logits, dim=-1)
    loss = torch.nn.CrossEntropyLoss()(logits.view(-1, logits.size(-1)), labels.view(-1))
    return {
        'eval_loss': loss.item(),
        'accuracy': accuracy_score(labels.view(-1).cpu().numpy(), predictions.view(-1).cpu().numpy())
    }

# Train the model
    trainer.train()

    # Save the fine-tuned LoRA model
    model.save_pretrained('fine_tuned_lora_llama')
    tokenizer.save_pretrained('fine_tuned_lora_llama')

Finally, an estimated 64.00MiB of GPU space is required. The technical principles have been explained, but due to computational resource limitations, the engineering part is left for future expansion and practice.

Project Environment

● Data Processing IDE: VSCode;

● Testing Environment: Pie DataComputing System(PieDataCS); PieCloudDB Database; PieCloudVector; PieAIStudio

● Training Environment:

NVIDIA A100-SXM4-80GB;

Driver Version: 535.183.06;

CUDA Version: 12.2

References

1. GitHub - flyyuan/pdf2txt-chinese: 将影印版 PDF 图书转换为文本 TXT，供 GPTs 使用作为知识库

2. Chunking methods in RAG: comparison - BitPeak

3. GitHub - Lightning-AI/litgpt: 20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

Author Biography

Jiayu Zhou (周嘉宇)

AI4AI Volunteer / Pairing Bootcamp Mentor

Majoring in Computer Engineering at the ECE Department of ZJU&UIUC. As a senior undergraduate student, Jiayu's primary research focus is on LLM+Knowledge Graph Reasoning, with three papers submitted or published. He aspires to develop the next-generation multimodal LLM Agent. Jiayu has a diverse technical skill set, having interned in the research department of a securities firm and a database company. He is also passionate about community service both on and off campus. Looking forward to collaborating with like-minded individuals.

Related Blogs:

no related blog