data engineering projects

Production RAG Pipeline for Document Intelligence

Building an enterprise-grade Retrieval-Augmented Generation system with dbt, Qdrant, and LangChain

Dec 12, 2024

The Problem

Enterprise knowledge is trapped in PDFs, wikis, and Slack threads. Traditional search returns documents — but users want answers. This project builds a RAG (Retrieval-Augmented Generation) pipeline that ingests 10K+ documents and provides intelligent Q&A with citations.

Architecture Overview

Documents → Chunking → Embeddings → Vector DB → Retrieval → Reranking → LLM → Answer

Each component is carefully designed for production reliability.

The Data Pipeline

Document Ingestion

Unstructured.io handles complex PDFs with tables, images, and multi-column layouts
Extracted text is cleaned and normalized before chunking

Chunking Strategy

After extensive testing, I settled on recursive character splitting with:

512 tokens per chunk
50 token overlap
Preserved metadata: source file, page number, section headers

dbt Transformation Layer

One of the key production additions was a dbt transformation layer for:

Document metadata aggregation (chunks per doc, avg chunk length)
Evaluation metrics computation (faithfulness scores over time)
Data quality checks (null chunks, encoding issues)

Retrieval: Hybrid is the Answer

Pure semantic search misses exact keyword matches. Pure BM25 misses semantic similarity. I implemented hybrid retrieval:

Dense retrieval: Sentence-Transformers embeddings → Qdrant ANN search
Sparse retrieval: BM25 keyword matching
Fusion: Reciprocal Rank Fusion (RRF) to combine results

Cross-Encoder Reranking

The hybrid retriever returns 50 candidates. A cross-encoder (ms-marco-MiniLM) reranks them to the final top 5, improving answer precision by 25%.

Evaluation: RAGAS Framework

You can’t improve what you don’t measure. I integrated RAGAS for automated evaluation:

Faithfulness: Does the answer stay grounded in context?
Answer Relevance: Does it actually address the question?
Context Precision: Are the retrieved chunks useful?

This runs nightly, catching regressions before they reach users.

Key Learnings

Chunking is underrated — bad chunks = bad retrieval = bad answers
Hybrid search beats pure dense for enterprise docs with technical terminology
dbt for ML pipelines works surprisingly well for metrics and metadata management

Tech Stack

LangChain Qdrant dbt Sentence-Transformers FastAPI RAGAS Airflow

← Back to Projects