Building email search that actually finds the answer · part 1: retrieval
I rebuilt email search from scratch and measured every technique on 360 real questions. Recall went from 46% to 93% — and the two biggest wins weren’t a fancier model. (I also used NVIDIA’s general-purpose stack as a yardstick — more on that below.)
★ The final step changes the question from “find the exact email” to “find its conversation” — a different, easier (and more useful) measure. Detail and caveats below.
Ranked by how much each technique added. The fancy embedding model barely matters — understanding the conversation does.
Recall@5 points gained. ★ thread reconstruction measured as thread-recall — see methodology.
Everything below is scored one way: you ask a question, the system returns a ranked list of emails, and we check whether the right one is near the top. The headline metric is recall@5 — how often the correct email lands in the top 5 results. 46% means “right answer in the top 5, just under half the time.” 93% means “almost always.”
The modern starting point. An embedding turns a piece of text into a long list of numbers — a “vector” — that captures its meaning. Two emails about the same thing land near each other, even if they use different words. We call this dense search because every number in the vector carries a little signal. Ask a question, embed it, return the nearest emails.
It’s good, and it’s the floor we build on. Its weakness: by capturing the gist, it blurs the specifics — a dozen emails about “contract terminations” look almost identical, so the one you want doesn’t stand out.
Sparse search is the opposite idea: match the actual words. Old-school keyword search (think Ctrl-F or classic BM25) is sparse but dumb — it weights words by raw frequency. Learned sparse is smarter: the model learned which words carry signal, so it locks onto the rare, high-value tokens — your project codenames, product SKUs, ticket numbers, acronyms — that a meaning-vector smears together. (One honest note: the model is pre-trained, not trained on your mailbox; it’s good at your jargon because jargon is rare and specific, exactly what exact-matching is for.)
Run dense and learned-sparse together and you get hybrid search — the best of both: meaning and exact terms. On its own the bump is modest (+3.3), but it’s the difference-maker on jargon-heavy questions, and it stacks.
The biggest single embedding-side win, and an easy one to miss. Half of email is context-free fragments: “Approved, go ahead.” “See attached.” “+1.” There’s nothing in that text to search — the meaning lives in the thread around it. So before indexing each email, we prepend a one-line summary that restores its topic and context (a fictional example: “Re: Project Falcon budget — finance signs off the extra hardware spend”). Now the terse reply has something to match.
This single move added +12.8 recall points — bigger than the embedder and reranker choices combined. It’s the first sign of the post’s real lesson: the wins are in understanding the conversation.
Retrieval is fast but rough. A reranker is a slower, smarter model that takes the top ~20 candidates and re-reads each one together with the question, scoring how well it actually answers. You only run it on the shortlist, so the cost is contained.
It sharpens the top of the list (+2.5) — but with a twist worth knowing: a reranker judges each email on its own, so for questions that need piecing together several emails (“what’s the current status, including the agreed changes?”) it can actually demote the right answer, because no single email looks like the whole answer. Which sets up the last step.
Stop searching for the email. Find the conversation.
The leap. Every technique so far chases the one perfect email. But you almost never need the email — you need the conversation it’s in. Thread reconstruction flips the goal: retrieve any email from the relevant thread, then pull in the whole reply chain and read it.
Why it’s huge: finding the exact email is hard (61.7% of the time), but finding some email in the right conversation is easy (89%+), and the conversation contains the answer. You trade “find the needle” for “find the right haystack-section” — a far easier target — and it rescues exactly the multi-email questions reranking struggled with.
Reranking helps you find the right email. Thread reconstruction means you don’t have to.
Step back and rank the techniques by how much each added (the chart up top). The result genuinely surprised me: the choice of embedding model — the thing everyone obsesses over — barely moved the needle. The two biggest levers were both about understanding the conversation: reconstructing the thread (active, at search time) and contextual summaries (passive, baked into the index). Together they dwarf everything else.
The wins are in understanding the conversation, not in picking a fancier embedding model.
Since I’d built the system to be backend-agnostic (swap the search models by config), I could run a clean side-by-side: my email-tuned, local bge-m3 hybrid against NVIDIA’s hosted retrieval stack (their NeMo Retriever embedding service plus their reranking service). Same questions, same everything else.
A fair note up front: NVIDIA’s stack is a general-purpose retriever — designed to search any documents well, not tuned for email. It’s a strong yardstick precisely because it’s good at the general job. So the question isn’t “is mine better” — it’s “how much does specialising for email buy you?” On email questions, the email-specialised hybrid does measurably better — even without a reranker:
| setup | recall@5 |
|---|---|
| bge-m3 hybrid (mine, no rerank) | 61.7 |
| NVIDIA dense + rerank | 56.7 |
| NVIDIA dense | 51.9 |
I double-checked on public Enron-QA (360 more questions, independent data) — same ordering. So it’s not my mailbox.
Here’s the part that makes the point. I ran the identical comparison on the kind of task a general retriever is actually designed for: legal e-discovery (the public TREC Legal benchmark — “find every document relating to topic X,” judged by real lawyers). The order flips, and NVIDIA’s stack pulls clearly ahead:
| setup | P@10 (e-discovery) |
|---|---|
| NVIDIA dense + rerank | 0.50 |
| NVIDIA dense | 0.37 |
| bge-m3 hybrid + rerank (mine) | 0.30 |
| bge-m3 hybrid (mine) | 0.20 |
Same systems, opposite results — because the task is different. Email Q&A is pinpoint — one specific email answers, often hinging on a specific name or ID, which is exactly where exact-term (sparse) matching shines. E-discovery is broad — hundreds of documents are relevant and you want them all, which is where semantic (dense) search plus a relevance reranker shines.
There is no universally best retrieval recipe. The right one depends on the shape of the question. For email: hybrid + understand the conversation. For broad document discovery: dense + rerank.
For the readers who want to trust the numbers (you should always ask):
scripts/eval/).