Turns a mail archive into a queryable knowledge base — hybrid dense+sparse retrieval, thread-aware answers assembled from whole conversations, optional local-LLM cleanup. Runs on your hardware, on open models, with nothing required to leave your network.
$ git clone https://github.com/fmasi/mailrag && cd mailrag $ make demo # Qdrant + a thread-aware index over 100 public Enron emails $ mailrag ask "who approved the Q3 budget, and when?" → retrieves the matching message, expands to its full thread, and answers from the whole conversation — with citations.
Making your mailbox searchable by an AI usually means handing your entire email history to someone else's servers. For your real correspondence, that's a non-starter.
So I built the opposite: an email RAG you run yourself — on your hardware, on open models, with nothing required to leave your network. And the result isn't just search — these emails are private context my own AI agents can draw on for total recall, without renting my memory to anyone. mailrag covers email; parley covers calls and meetings. Independent tools; my agents know about both and reach for whatever fits.
The flagship: match a single message, then answer from its entire conversation. Lifts recall@5 from 62% → 93% — the single biggest lever, and it needs no LLM.
bge-m3 dense + learned sparse vectors, RRF-fused in Qdrant. Gets both the concept and the rare exact token — acronyms, IDs, reference numbers.
Reply-chain stripping, calendar-invite collapsing, noise/newsletter filtering, and exact-text chunk dedup.
Optional summarize step: a local LLM writes a per-email summary + noise judgement, content-addressed and cached so re-runs are free. Point it at a model on 127.0.0.1.
Public Enron corpus, local .eml archives, or Azure Blob — behind one EmailLoader interface, ready for live sources next.
A 360-query eval that prices every technique, controls for confounds, reports significance — and in several cases overturned the intuitive choice.
From a raw mailbox to an answered question — every stage runs on your own hardware.
Stacking the ladder — each technique added one at a time and individually measured on 360 real-email questions. recall@5 = how often the right email lands in the top 5 results:
| Technique | recall@5 | gain |
|---|---|---|
| plain dense (baseline) | 46% | — |
| + learned sparse | 49% | +3 |
| + contextual summary | 62% | +13 |
| + reranking | 64% | +2 |
| + thread reconstruction ★ | 93% | +29 |
★ The final step switches the goal from “find the exact email” to “find its thread” — a legitimately easier, and more useful, target. The two biggest levers (thread reconstruction +29, contextual summary +13) are both about understanding the conversation — not a fancier embedding model. Read the full benchmark →
Measured on a real work mailbox — 360 questions, each with a
known correct answer (a hard label, no subjective judging) — then cross-checked on the public
Enron-QA set (same ordering, so it isn't a quirk of one inbox). Run the identical comparison
on legal e-discovery (the TREC Legal benchmark) and the order flips: a general-purpose dense+rerank
stack wins there instead — same systems, opposite result, because the task is different. All
references anonymized; the public make demo reproduces the method, not the
figures. Full write-up in the
benchmark post and the
case study.
mailrag is built to be one node in a private context stack — so the next steps make it easier for agents to reach, and keep its memory current:
Expose search, ask, and attachment fetch over the Model Context Protocol, so any agent — yours or a teammate's — can query your mail without touching the internals.
Move from one-time imports to incremental ingest of incoming mail, so the index stays current — a living context source, not a static snapshot.
A full-screen terminal UI for the cleanup pipeline: pick a persona, watch the funnel, and approve the calibrate gate — replacing today's prompt-by-prompt flow.