Retrieval-Augmented Reasoning with Lean Language Models
This technical report introduces a novel approach to combine reasoning and retrieval-augmented generation (RAG) within a single, lean language model architecture. The system utilizes fine-tuned Qwen2.5-Instruct models with a dense retriever, leveraging synthetic data and reasoning traces from frontier models. It aims to provide performant and privacy-preserving solutions deployable in resource-constrained or secure environments, demonstrating substantial gains in answer accuracy and consistency. ✨
Article Points:
1
Novel approach combines reasoning and RAG in a single lean LLM.
2
Addresses demand for performant, privacy-preserving local solutions.
3
Uses fine-tuned Qwen2.5-Instruct models with dense retrieval.
4
Leverages synthetic data and reasoning traces from frontier models.
5
Achieves substantial accuracy gains, approaching frontier performance.
6
Demonstrates feasibility for local deployment in resource-constrained settings.
Retrieval-Augmented Reasoning with Lean Language Models
Problem Statement

Large model limitations

Privacy & resource constraints

Integration challenge

Proposed Approach

Lean LLM architecture

Reasoning & RAG integration

Domain-specific fine-tuning

System Architecture

Lean Language Models

Retrieval System

Synthetic Data Generation

Reasoning Traces

Fine-tuning Process

Conversational Interface

Evaluation

NHS A-to-Z Case Study

Retrieval Performance

Accuracy Metrics

Comparison to Baselines

Distillation Impact

Key Findings

Substantial accuracy gains

Feasible local deployment

Outperforms general lean models

Comparable to frontier models

Future Directions

Further model size reduction

Alternative trace generation

Broader domain application