Full project: RAG (Retrieval-Augmented Generation) II

Optimizing Document Retrieval for RAG Systems: Enhancing the Search Process for SMEs Using Metadata and PGVector

For small and medium-sized businesses (SMEs) in Spain, accessing up-to-date, trustworthy information on government grants and financial support is essential. One effective way to help SMEs navigate this complex landscape is through a Retrieval-Augmented Generation (RAG) model. In this project, I used a specialized resource, the Plataforma Pyme guide to government grants, as the source for a RAG system focused on providing tailored financial advice.

The Plataforma Pyme website offers a dynamic and searchable guide to grants and public assistance. By extracting data from this resource, I ensure that the information retrieved is current, accurate, and highly relevant to any business’s needs. This structured approach is perfect for RAG systems because it allows the model to retrieve highly focused information before generating concise, helpful responses to user queries.

https://plataformapyme.es/es-es/AyudasPublicas/GuiasDinamicas/Paginas/GuiaAyudas.aspx?CCAA=0

Document Processing and Metadata Extraction

To maximize the information retrieval process, the project involves processing PDFs from the platform, extracting text, and creating metadata that is stored in PGVector for efficient search. By integrating metadata for document retrieval, we can filter out irrelevant information early in the search process, enhancing both efficiency and accuracy.

Below is the core function that processes every page of the PDF documents. It retrieves the metadata, stores it in the PGVector database, and downloads any linked documents for further processing.

import PyPDF2
import os
from urllib.parse import urlparse

# Crear una colección en PGVector

pathToMetadata = './ayudas/metadatos'
pathToText = './ayudas/texto'

# Función para procesar un PDF
def process_pdf(pdf_path):
    all_pages = []
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)        
        for page_num in range(len(reader.pages)):
            summary_page = reader.pages[page_num]
            text = summary_page.extract_text().replace("\n"," ").replace("\x00", "").strip()
            
            if (text.find("Ayudas e incentivos (detalle)") > -1):
                a = urlparse(pdf_path)
                output_dir = pathToText+"/"+os.path.basename(a.path)+"/"+"Page_"+str(page_num)
                               
                extra_metadata = get_metadata(text)
                donwloaded_url=download_linked_files(summary_page, output_dir)
                extra_metadata['download_url']=donwloaded_url
                pages = load_and_split_text(output_dir)
                
                if(len(pages) > 0):


                    for page in pages:
                        
                        page.page_content=page.page_content.replace("\x00", "").replace("\n"," ").strip()                            
                        page.metadata = {**page.metadata, **extra_metadata}
                all_pages=all_pages+pages
                #limit+=1
    return all_pages            

# Procesar todos los PDFs en una carpeta
import os
limit = 1
allDocs = []

for file in os.listdir(pathToMetadata):
    if file.endswith(".pdf"):
        docs = process_pdf(os.path.join(pathToMetadata, file))
        allDocs= allDocs+docs

import PyPDF2
import os
from urllib.parse import urlparse

# Crear una colección en PGVector

pathToMetadata = './ayudas/metadatos'
pathToText = './ayudas/texto'

# Función para procesar un PDF
def process_pdf(pdf_path):
    all_pages = []
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)        
        for page_num in range(len(reader.pages)):
            summary_page = reader.pages[page_num]
            text = summary_page.extract_text().replace("\n"," ").replace("\x00", "").strip()
            
            if (text.find("Ayudas e incentivos (detalle)") > -1):
                a = urlparse(pdf_path)
                output_dir = pathToText+"/"+os.path.basename(a.path)+"/"+"Page_"+str(page_num)
                               
                extra_metadata = get_metadata(text)
                donwloaded_url=download_linked_files(summary_page, output_dir)
                extra_metadata['download_url']=donwloaded_url
                pages = load_and_split_text(output_dir)
                
                if(len(pages) > 0):


                    for page in pages:
                        
                        page.page_content=page.page_content.replace("\x00", "").replace("\n"," ").strip()                            
                        page.metadata = {**page.metadata, **extra_metadata}
                all_pages=all_pages+pages
                #limit+=1
    return all_pages            

# Procesar todos los PDFs en una carpeta
import os
limit = 1
allDocs = []

for file in os.listdir(pathToMetadata):
    if file.endswith(".pdf"):
        docs = process_pdf(os.path.join(pathToMetadata, file))
        allDocs= allDocs+docs

Metadata Extraction and PGVector for Efficient Search

The key to improving retrieval accuracy in a RAG system lies in metadata extraction. The get_metadata function is pivotal in this process. By using metadata tags such as “Title,” “Organism,” “Sector,” “Recipients,” and “Publication References,” we can efficiently narrow down search queries to only the most relevant documents. This metadata filtering approach ensures that PGVector retrieves data not only based on semantic similarity but also based on contextual relevance.

def get_metadata(text):
    AMBITO ='Ámbito Geográfico'
    INFORMACION='Información Detallada'
    AMBITO_CLEAN='AmbitoGeografico'

    document_tags = ['Referencia','Título','Organismo','Sector','Subsector',
                    AMBITO,'Información Adicional','Tipo','Destinatarios','Plazo de solicitud','Referencias de la Publicación']

    tagIndex = 0
    metadata = {}

    while tagIndex < len(document_tags)-1:
        start = document_tags[tagIndex]
        end = document_tags[tagIndex+1]
        if(start=='Ámbito Geográfico'):
            metadata[AMBITO_CLEAN]=extract_substring_index(text,start,end).replace(AMBITO,'').replace(INFORMACION,'').strip()
        else:
            metadata[start]=extract_substring_index(text,start,end).replace("\x00", "").replace("\n"," ").strip()
        tagIndex+=1            
       

    return metadata


def get_metadata(text):
    AMBITO ='Ámbito Geográfico'
    INFORMACION='Información Detallada'
    AMBITO_CLEAN='AmbitoGeografico'

    document_tags = ['Referencia','Título','Organismo','Sector','Subsector',
                    AMBITO,'Información Adicional','Tipo','Destinatarios','Plazo de solicitud','Referencias de la Publicación']

    tagIndex = 0
    metadata = {}

    while tagIndex < len(document_tags)-1:
        start = document_tags[tagIndex]
        end = document_tags[tagIndex+1]
        if(start=='Ámbito Geográfico'):
            metadata[AMBITO_CLEAN]=extract_substring_index(text,start,end).replace(AMBITO,'').replace(INFORMACION,'').strip()
        else:
            metadata[start]=extract_substring_index(text,start,end).replace("\x00", "").replace("\n"," ").strip()
        tagIndex+=1            
       

    return metadata

Text Chunking for RAG Systems: Improving Search Relevance

Another critical part of this project is text chunking. Large documents often exceed the token limit of language models, which makes it harder to provide precise and relevant responses. To address this, the load_and_split_text function breaks down the PDF text into smaller, more manageable chunks. This ensures that the chunks are semantically intact, which is crucial for both semantic search and the accuracy of the retrieval-augmented generation process.

By using a MarkdownSplitter, we are able to break the document at logical boundaries like headings or paragraphs. This ensures that each chunk of text maintains its context, which is essential for accurate retrieval and response generation in a RAG system.

Improved Search and Retrieval Efficiency

With the combination of metadata filtering and text chunking, we can now efficiently retrieve the most relevant small chunks from the vector database. When a user queries the system, we can pull the most relevant, contextually accurate information based on both semantic similarity and metadata-based filtering.

This two-step retrieval process—first filtering by metadata, then performing a vector similarity search—improves both the speed and accuracy of responses in the RAG model, ensuring that businesses can receive quick, reliable advice on available grants and financial support.

Conclusion

This project demonstrates the power of combining document processing, metadata extraction, and text chunking for creating a more effective RAG system. By utilizing PGVector for efficient vector searches, SMEs can quickly access relevant information from government grant guides and receive precise answers that will help them make informed business decisions.

If you’re looking to enhance your information retrieval process for business applications, integrating metadata and structured text splitting into your RAG system will not only improve performance but also provide your users with the most relevant, up-to-date content.

See the full code here: https://github.com/dorapps/RAG_Project