Building email search that actually finds the answer · part 1: retrieval

Two techniques doubled my email search — and neither was a fancier model

I rebuilt email search from scratch and measured every technique on 360 real questions. Recall went from 46% to 93% — and the two biggest wins weren’t a fancier model. (I also used NVIDIA’s general-purpose stack as a yardstick — more on that below.)

plain dense

45.6%

the baseline

Match emails by meaning.

▶

+ learned sparse

48.9%

+3.3

Also match your exact jargon — codenames, IDs, SKUs.

▶

+ contextual summary

61.7%

+12.8

Give terse replies their topic & context first.

▶

+ reranking

64.2%

+2.5

A smarter second pass re-scores the top hits.

▶

+ thread reconstruction

93.3%

the leap ★

Find the whole conversation, not one email.

★ The final step changes the question from “find the exact email” to “find its conversation” — a different, easier (and more useful) measure. Detail and caveats below.

The surprise: what actually moved the needle

Ranked by how much each technique added. The fancy embedding model barely matters — understanding the conversation does.

thread reconstruction

+29 ★

contextual summary

+12.8

learned sparse

+3.3

reranking

+2.5

which embedder you pick

Recall@5 points gained. ★ thread reconstruction measured as thread-recall — see methodology.

The one-minute version. Email search usually means keyword matching, and it usually misses. Modern “meaning-based” search helps — but in my tests the embedding model you choose barely mattered. What mattered was understanding the conversation: writing a short context summary for each email before indexing it, and — the big one — not hunting for the single answer email at all, but finding its whole thread and reading it. Stack everything and the right answer is reachable 93% of the time, up from 46%. Along the way I used NVIDIA’s hosted retrieval stack as a yardstick — it’s a strong general-purpose retriever, built for broad search and not tuned for email — and the takeaway isn’t “who won.” It’s that an email-specialised recipe does better on email, while NVIDIA’s stack does better on the broad document-search task it’s actually built for. Task-fit, not brand.

What we’re measuring (and the one number to know)

Everything below is scored one way: you ask a question, the system returns a ranked list of emails, and we check whether the right one is near the top. The headline metric is recall@5 — how often the correct email lands in the top 5 results. 46% means “right answer in the top 5, just under half the time.” 93% means “almost always.”

The test set: 360 real questions over a work mailbox, each with a known correct answer email. I also re-ran the whole thing on a public dataset (Enron emails) to make sure it wasn’t a quirk of my inbox — same result. Details at the end.

The build-up, one technique at a time

1 · Dense embeddings — search by meaning (45.6%)

The modern starting point. An embedding turns a piece of text into a long list of numbers — a “vector” — that captures its meaning. Two emails about the same thing land near each other, even if they use different words. We call this dense search because every number in the vector carries a little signal. Ask a question, embed it, return the nearest emails.

It’s good, and it’s the floor we build on. Its weakness: by capturing the gist, it blurs the specifics — a dozen emails about “contract terminations” look almost identical, so the one you want doesn’t stand out.

2 · Learned sparse — match your exact words (48.9%, +3.3)

Sparse search is the opposite idea: match the actual words. Old-school keyword search (think Ctrl-F or classic BM25) is sparse but dumb — it weights words by raw frequency. Learned sparse is smarter: the model learned which words carry signal, so it locks onto the rare, high-value tokens — your project codenames, product SKUs, ticket numbers, acronyms — that a meaning-vector smears together. (One honest note: the model is pre-trained, not trained on your mailbox; it’s good at your jargon because jargon is rare and specific, exactly what exact-matching is for.)

Run dense and learned-sparse together and you get hybrid search — the best of both: meaning and exact terms. On its own the bump is modest (+3.3), but it’s the difference-maker on jargon-heavy questions, and it stacks.

3 · Contextual summaries — give terse emails something to match (61.7%, +12.8)

The biggest single embedding-side win, and an easy one to miss. Half of email is context-free fragments: “Approved, go ahead.” “See attached.” “+1.” There’s nothing in that text to search — the meaning lives in the thread around it. So before indexing each email, we prepend a one-line summary that restores its topic and context (a fictional example: “Re: Project Falcon budget — finance signs off the extra hardware spend”). Now the terse reply has something to match.

This single move added +12.8 recall points — bigger than the embedder and reranker choices combined. It’s the first sign of the post’s real lesson: the wins are in understanding the conversation.

4 · Reranking — a smarter second opinion (64.2%, +2.5)

Retrieval is fast but rough. A reranker is a slower, smarter model that takes the top ~20 candidates and re-reads each one together with the question, scoring how well it actually answers. You only run it on the shortlist, so the cost is contained.

It sharpens the top of the list (+2.5) — but with a twist worth knowing: a reranker judges each email on its own, so for questions that need piecing together several emails (“what’s the current status, including the agreed changes?”) it can actually demote the right answer, because no single email looks like the whole answer. Which sets up the last step.

5 · Thread reconstruction — stop hunting the email, find the conversation (93.3%)

Stop searching for the email. Find the conversation.

The leap. Every technique so far chases the one perfect email. But you almost never need the email — you need the conversation it’s in. Thread reconstruction flips the goal: retrieve any email from the relevant thread, then pull in the whole reply chain and read it.

Why it’s huge: finding the exact email is hard (61.7% of the time), but finding some email in the right conversation is easy (89%+), and the conversation contains the answer. You trade “find the needle” for “find the right haystack-section” — a far easier target — and it rescues exactly the multi-email questions reranking struggled with.

Reranking helps you find the right email. Thread reconstruction means you don’t have to.

So what actually mattered?

Step back and rank the techniques by how much each added (the chart up top). The result genuinely surprised me: the choice of embedding model — the thing everyone obsesses over — barely moved the needle. The two biggest levers were both about understanding the conversation: reconstructing the thread (active, at search time) and contextual summaries (passive, baked into the index). Together they dwarf everything else.

The wins are in understanding the conversation, not in picking a fancier embedding model.

A yardstick: general-purpose vs email-specialised

Since I’d built the system to be backend-agnostic (swap the search models by config), I could run a clean side-by-side: my email-tuned, local bge-m3 hybrid against NVIDIA’s hosted retrieval stack (their NeMo Retriever embedding service plus their reranking service). Same questions, same everything else.

A fair note up front: NVIDIA’s stack is a general-purpose retriever — designed to search any documents well, not tuned for email. It’s a strong yardstick precisely because it’s good at the general job. So the question isn’t “is mine better” — it’s “how much does specialising for email buy you?” On email questions, the email-specialised hybrid does measurably better — even without a reranker:

setup	recall@5
bge-m3 hybrid (mine, no rerank)	61.7
NVIDIA dense + rerank	56.7
NVIDIA dense	51.9

I double-checked on public Enron-QA (360 more questions, independent data) — same ordering. So it’s not my mailbox.

The confirmation: on the job it’s built for, NVIDIA’s stack pulls ahead

Here’s the part that makes the point. I ran the identical comparison on the kind of task a general retriever is actually designed for: legal e-discovery (the public TREC Legal benchmark — “find every document relating to topic X,” judged by real lawyers). The order flips, and NVIDIA’s stack pulls clearly ahead:

setup	P@10 (e-discovery)
NVIDIA dense + rerank	0.50
NVIDIA dense	0.37
bge-m3 hybrid + rerank (mine)	0.30
bge-m3 hybrid (mine)	0.20

Same systems, opposite results — because the task is different. Email Q&A is pinpoint — one specific email answers, often hinging on a specific name or ID, which is exactly where exact-term (sparse) matching shines. E-discovery is broad — hundreds of documents are relevant and you want them all, which is where semantic (dense) search plus a relevance reranker shines.

There is no universally best retrieval recipe. The right one depends on the shape of the question. For email: hybrid + understand the conversation. For broad document discovery: dense + rerank.

If you build one of these, measure your chunks against the context window. Every embedding model has a maximum input length, and if a chunk (plus any summary you prepend) is longer, the model silently truncates it — you lose the tail and never see an error. Limits vary wildly (one model I used caps at 512 tokens; another reads 8,192). Measure your real chunk lengths first, and use a fail-loud setting during testing so over-length chunks error instead of quietly losing content. Five minutes that saves a misleading benchmark.

Methodology & honest caveats

For the readers who want to trust the numbers (you should always ask):

The test. 360 real questions over a work mailbox, each with a known correct answer email (a “hard label” — no subjective judging in the retrieval scores). Cross-checked on 360 public Enron-QA questions; same ordering.
Fair comparison. Both sides used the same vector size (1024) and the same single reranker, so the comparison isolates the recipe, not incidental advantages — and NVIDIA’s stack is general-purpose, used here as a reference, not as a system anyone tuned for email.
The 93% is a different (easier) measure. The cumulative climb up to 64.2% is “did we find the exact answer email.” The jump to 93.3% switches to “did we find the answer’s thread” — a legitimately easier target, but the honest framing of thread reconstruction: a thread is a bigger thing to hit, and that’s precisely the point (you read the conversation, not the needle). Whether the assembled thread leads to a correct answer is a separate measurement — one for a possible follow-up.
The first two steps use a no-summary baseline so each technique adds exactly one thing; the summary step is also where my “with-summary” production index begins (its numbers line up with the bake-off table). The contextual-summary lever was isolated by re-running the whole benchmark with summaries off — it was worth +11 to +15 recall points.
E-discovery caveat. The TREC contest is only 3 topics (low statistical power) and absolute scores are low (e-discovery is hard) — but the effect is large and reproduces NVIDIA’s own published direction.
Reproducible. The benchmark scripts are in the repo (scripts/eval/).

Finding the right email is only half the job — the obvious next question is whether the system then answers correctly. That, and more, in the mailrag series — stay tuned, as time allows.

Built with mailrag — a self-hosted, privacy-first RAG-over-email tool. Results above are on the author’s private work corpus plus public Enron data; numbers are real and measured.