mailrag

Your email, answerable on your own hardware.

Turns a mail archive into a queryable knowledge base — hybrid dense+sparse retrieval, thread-aware answers assembled from whole conversations, optional local-LLM cleanup. Runs on your hardware, on open models, with nothing required to leave your network.

Python 3.11+ LlamaIndex bge-m3 hybrid self-hosted recall@5 46→93% Apache-2.0

View on GitHub → Frédéric Masi · LinkedIn

thread-aware contextual RAG over the public Enron demo

$ git clone https://github.com/fmasi/mailrag && cd mailrag
$ make demo                # Qdrant + a thread-aware index over 100 public Enron emails
$ ./mailrag ask "who approved the Q3 budget, and when?"
  → retrieves the matching message, expands to its full thread,
    and answers from the whole conversation — with citations.

Why it exists

Making your mailbox searchable by an AI usually means handing your entire email history to someone else's servers. For your real correspondence, that's a non-starter.

So I built the opposite: an email RAG you run yourself — on your hardware, on open models, with nothing required to leave your network. And the result isn't just search — these emails are private context my own AI agents can draw on for total recall, without renting my memory to anyone. mailrag covers email; parley covers calls and meetings. Independent tools; my agents know about both and reach for whatever fits.

What it does

Thread-aware answers

The flagship: match a single message, then answer from its entire conversation. Lifts recall@5 from 64% (message-level) → 93% (thread-level) — the single biggest lever, and it needs no LLM.

Hybrid retrieval

bge-m3 dense + learned sparse vectors, RRF-fused in Qdrant. Gets both the concept and the rare exact token — acronyms, IDs, reference numbers.

Email-aware preprocessing

Reply-chain stripping, calendar-invite collapsing, noise/newsletter filtering, and exact-text chunk dedup.

Attachments, searchable

Text pulled from PDFs, Office files, HTML and images — with OCR for scans (local Tesseract or a local vision model). Each attachment is chunked by its own structure: spreadsheet rows with a repeated header, PDF pages, deck slides — so a number buried in a 500-row sheet stays findable. Chunks trace back to their email and thread.

Local-LLM summaries

Optional summarize step: a local LLM writes a per-email summary + noise judgement, content-addressed and cached so re-runs are free. Local by default — but point it at any OpenAI-compatible server (LM Studio, Ollama, vLLM, NVIDIA NIM, OpenAI) with one env var.

Pluggable loaders

Public Enron corpus, local .eml archives, or Azure Blob — behind one EmailLoader interface, ready for live sources next.

Measured methodology

A 360-query eval that prices every technique, controls for confounds, reports significance — and in several cases overturned the intuitive choice.

Agents connect over MCP

A multi-collection, read-only Model Context Protocol server (./mailrag mcp): any agent can discover your indexed corpora, search threads, ask grounded questions, and read attachment text — five tools, no internals exposed.

How it works — the hard parts

Thread-aware reconstruction is the biggest single win — and needs no LLM. Matching a single message and returning its whole conversation lifts recall@5 from 64% (message-level) → 93% (thread-level) (+29 points) — bigger than the embedder and reranker choices combined. The target shifts from one message to its thread by design: for a conversation, the thread is the right unit of truth. Most of the headline gain comes from here, not the model.
Cheap regex for the obvious bulk; the LLM only where it earns its keep. Corpus-derived sender/subject rules catch ~65% of the noise at high precision. The other ~35% is interleaved mixed-domain noise you can't write a rule for — that's the local LLM's unique contribution, plus the per-email summaries that power contextual retrieval.
The noise rubric does not port across corpora. The corporate rubric flagged 87.6% of a personal archive as noise — it would have deleted real receipts and bank statements — versus 61.5% from a rubric calibrated for that corpus. A ~200-email calibration caught the gap before a multi-hour run, all on a local model with no cloud spend.
Reranking helps pointed questions but hurts thread-spanning ones. A cross-encoder reranker adds only +2.5 recall@5 overall and actively demotes the answer on questions that span several emails — because no single message looks like the whole answer. That's exactly the case thread reconstruction fixes; the two are complementary. (Query-side HyDE never beat the raw query on this entity-rich corpus either.) Both stay in-tree, off by default.
The ceiling is retrieval, not the model. With the answer in context, even a 4 B model answered ~88% correctly; the lost points are queries where retrieval never surfaced the thread. Model size was second-order.

Architecture

From a raw mailbox to an answered question — every stage runs on your own hardware.

mailrag pipeline: loaders (.eml / Enron / Azure Blob) to clean, chunk, embed (bge-m3), Qdrant hybrid store, thread-aware retrieval, and a local-LLM answer — with llm-none / llm-verify / llm-all personas, all on your own machine

The guided wizard

./mailrag wizard is a full-screen terminal app (Textual): pick a cost-ordered persona, scope your folders on a tree, review the exact plan, then watch every step run live. Two human checkpoints — the calibrate gate and confirm-before-spend — keep you in control before any LLM cost.

The live run screen: the recipe as a step ladder with tick marks on the left and a streaming log on the right, an overall progress bar beneath

Screenshots are auto-generated from the real app against a synthetic mailbox — see the full walkthrough.

The compound effect

Stacking the ladder — each technique added one at a time and individually measured on 360 real-email questions. recall@5 = how often the right email lands in the top 5 results:

Technique	recall@5	gain
plain dense (baseline)	46%	—
+ learned sparse	49%	+3
+ contextual summary	62%	+13
+ reranking	64%	+2
+ thread reconstruction ★	93%	+29

★ The final step switches the goal from “find the exact email” to “find its thread” — a legitimately easier, and more useful, target. The two biggest levers (thread reconstruction +29, contextual summary +13) are both about understanding the conversation — not a fancier embedding model. Read the full benchmark →

Measured on a real work mailbox — 360 questions, each with a known correct answer (a hard label, no subjective judging) — then cross-checked on the public Enron-QA set (same ordering, so it isn't a quirk of one inbox). Run the identical comparison on legal e-discovery (the TREC Legal benchmark) and the order flips: a general-purpose dense+rerank stack (NVIDIA's retrieval NIMs) wins there instead — same systems, opposite result, because the task is different. All references anonymized; the public make demo reproduces the method, not the figures. Full write-up in the benchmark post and the case study.

On the roadmap

mailrag is built to be one node in a private context stack — so the next steps make it easier for agents to reach, and keep its memory current:

MCP server — shipped ✓

A multi-collection stdio server exposes list_collections, search_email, answer_question, and attachment fetch over the Model Context Protocol — so any agent, yours or a teammate's, can query your mail without touching the internals.

Live ingestion

Move from one-time imports to incremental ingest of incoming mail, so the index stays current — a living context source, not a static snapshot.

Guided TUI — shipped ✓

./mailrag wizard is now a full-screen terminal app (Textual): pick a persona, scope folders on a tree, review the plan, and watch the run live — with the calibrate and confirm-before-spend gates as dialogs.