Building a Retrieval-Augmented Generation (RAG) System for Domain-Specific Document Querying

Building a Retrieval-Augmented Generation (RAG) System for Domain-Specific Document Querying
Source: Freepik via freepik licence

In recent years, Retrieval-Augmented Generation (RAG) has emerged as a powerful method for enhancing large language models with structured access to external document collections. By combining dense semantic search with contextual text generation, RAG systems have proven particularly useful for tasks such as answering questions based on extensive documentation, enabling natural language access to technical materials, and building assistants that cite sources with high factual accuracy. In this post, we walk through the process of building a functioning RAG pipeline from scratch. As a working example, we use the Horizon Europe 2025 Work Programme to demonstrate how RAG can support structured search and retrieval across hundreds of pages of official funding-related content.

Domain Definition and Document Collection

The first step in building a functional RAG system is to clearly define the knowledge domain and assemble a representative and comprehensive document collection. In our case, the target domain was the Horizon Europe 2025 Work Programme—a dense corpus of official EU documents that governs eligibility, funding rules, thematic calls, and implementation frameworks across Europe’s flagship research and innovation programme. The corpus comprised twelve core PDF documents, ranging in length from several dozen to several hundred pages. These included the general introduction, cluster-specific work programmes, cross-cutting sections, and legal annexes.

Vectorisation and Chunking

Once the document corpus had been assembled, the next stage involved converting the raw text into a format that supports efficient semantic retrieval. Using the PyPDFLoader module from LangChain, we extracted text from each PDF while preserving page-level metadata. Basic cleaning was applied to remove headers, footers, and other non-informative elements. The processed documents were then split into manageable units—text chunks of 1,000 characters with 100-character overlaps. This chunk size was chosen to balance contextual coherence with indexing efficiency, ensuring that individual answers could draw on meaningful and self-contained content segments.

For vectorisation, we used OpenAIEmbeddings, which converts each chunk into a high-dimensional embedding based on its semantic content. These embeddings were stored in a local FAISS index, enabling fast similarity-based search. This setup allows the system to retrieve the most relevant sections of the source documents in response to a user query—laying the foundation for accurate, grounded response generation.

System Implementation and Repository Structure

We implemented the entire RAG pipeline as a standalone GitHub repository to ensure transparency and reproducibility. This served both as a development environment and a structured reference implementation. The repository includes all core components: the document ingestion and preprocessing script (index_pdfs.py), the FAISS-based vectorstore directory (vectorstore/), model dependencies (requirements.txt), and the main application logic in a modular Python script. The repository is structured to separate concerns clearly—handling PDF processing, indexing, and inference independently. This makes it easy to inspect, modify, or extend any part of the system.

Horizon-Navigator-WP2025 GitHub repository

Once the backend pipeline was functional and thoroughly tested, we added a Streamlit-based user interface (horizon_rag_streamlit.py). This connects the vectorstore and LLM components to an interactive web front end, enabling natural language queries and source-aware responses via a lightweight interface. The system was then deployed on Streamlit Community Cloud for public access.

This architecture reflects a typical pattern in RAG development: modular backend construction in a version-controlled environment, followed by deployment through a lightweight web interface. In our case, this ensured that the entire document pipeline—from PDF parsing to interactive querying—remains open, inspectable, and reproducible.

Prompt Design

Our RAG system uses the default prompt provided by LangChain’s load_qa_chain with the stuff chain type. No additional system or user prompt was configured. The chain automatically instructs the language model to answer based solely on the content of the retrieved documents, avoiding reliance on external knowledge.

To support this configuration, we selected OpenAI’s gpt-4-1106-preview model, accessed via the Chat API, as the core language model for generation. This version was chosen for its strong performance in factual reasoning, consistent output style, and extended context window capacity—an essential feature when working with multi-paragraph content from policy documents. The temperature was set to zero to ensure deterministic, reproducible results aligned with the source materials. By combining precise retrieval with a deterministic, high-performing model, the system consistently generates citation-style answers based on the original EU documents—without requiring fine-tuning or custom prompt engineering.

User Interface (Streamlit)

To make the system easily accessible without technical setup, we deployed the RAG pipeline as a simple web application using Streamlit. The interface includes a single text input field for user queries, followed by two output sections:

  • the generated answer, based on retrieved document chunks;
  • and a list of the source fragments used, including file names and page numbers.

The application runs entirely on a prebuilt FAISS index stored locally in the repository. No user data is uploaded or stored, and the underlying documents are not reprocessed at runtime. This approach ensures rapid responses and predictable query costs, making the system suitable for repeated expert use in information-heavy contexts.

Outputs and Evaluation

We tested the system on a diverse set of queries to evaluate our RAG-based assistant. Below we highlight three examples that demonstrate its ability to produce accurate, document-grounded answers as well as to handle unanswerable prompts.

Example 1

Horizon Navigator '25 by poltextLAB

The assistant correctly interpreted the term “widening countries” and cited specific eligibility rules from the relevant WIDERA call. It retrieved precise conditions from the call HORIZON-WIDERA-2025-05-ACCESS-01, including regional authority participation requirements and coordinator budget allocation rules. The output included structured bullet points and direct excerpts from the original programme document.

Example 2

User question:
“Under what specific conditions can United States-based partners receive funding under the Horizon Europe 2025 programme, and are there any limitations or eligibility criteria they must meet?”

Horizon Navigator '25 by poltextLAB

The assistant returned a comprehensive eight-point list summarising US-based entities' legal and administrative conditions. These included eligibility recognition, consortium obligations, use of Copernicus data, legal setup of grant agreements, and funding restrictions on European communication networks. Each point aligned with the General Annexes and referenced the appropriate provisions.

Example 3

As a final test, we submitted a question for which no answer was available in the indexed documents. The system correctly recognised that this information was not present, and did not attempt to fabricate a response or retrieve data from external sources. Instead, it acknowledged the limitation and referred the user to official channels for further guidance.

Horizon Navigator '25 by poltextLAB

Cost and Limitations

The current Horizon Navigator ’25 runs as a stateless RAG application without memory. This design ensures low and predictable costs—around $0.01–$0.03 per query—since the language model only processes the immediate question and the retrieved chunks, without storing past interactions.

While a memory-enabled (chat-style) version is technically feasible—capable of remembering the last 4–5 queries—it incurs significantly higher costs, typically $0.10–$0.20 per question, due to repeated context reconstruction and increased token usage. This limitation reflects a conscious trade-off: the current version prioritises cost-efficiency and fast factual retrieval over sustained multi-turn interactions.

Source: https://platform.openai.com/docs/pricing

Recommendations

The Horizon Navigator ’25 RAG system shows how Retrieval-Augmented Generation can support structured information access across large, multi-document funding programmes. Built entirely from open components and deployed via a lightweight web interface, it enables accurate, source-based answers without requiring any manual prompt design or model fine-tuning. Unlike Custom GPTs—which are limited to 20 uploaded files and offer less control over document processing—RAG systems scale more easily and allow full transparency over indexing, embedding, and retrieval. While our current version runs without conversational memory to ensure low and predictable query costs, memory-enabled variants can also be developed for more interactive use cases.

Both the RAG and Custom GPT approaches offer valuable tools for navigating complex research and funding environments. Choosing between them depends on the desired trade-off between scalability, customisability, and ease of deployment.

The authors used GPT-4 [OpenAI (2025), GPT-4 (accessed on 17 May 2025), Large language model (LLM), available at: https://openai.com] to generate the output.