Product Bulletin
Building GraphRAG for Legislative Search
Turning a natural-language question into a grounded, citation-ready answer — vector search, graph expansion, and the four-tier design we tore down to two when it blew past a 29-second timeout.
By Ambro Quach
Part 1 of a 5-part series on the engineering behind Arc Radius, a platform that tracks US state legislation affecting LGBTQ+ youth. This post covers the retrieval layer — how we turn a natural-language question into a grounded, citation-ready context block for an LLM.
The problem
Arc Radius lets people ask questions like "which bills restrict gender-affirming healthcare access for minors?" and get back an answer grounded in actual legislative text. That's a retrieval problem before it's a generation problem: the LLM is only as good as the chunks we feed it.
Two things make legislative retrieval harder than a typical document-search setup.
First, bills are long and we chunk them, so a vector search that returns the single most similar chunk often hands the model a fragment stripped of its surrounding context — a definition without the clause it modifies, a penalty without the conduct it applies to. The retrieval needs to return not just the best chunk but enough of its neighborhood to be legible.
Second — and this is the constraint that shaped everything — the whole thing runs behind API Gateway, which enforces a hard 29-second integration timeout. Embed the query, search, expand, rerank, hydrate metadata, and hand off to the generation model, all inside 29 seconds, every time, or the request dies. That ceiling is the antagonist of this entire post.
The design as it stands
The retrieval path is five steps:
- Embed the query with Bedrock Titan v2 (1024-dimensional vectors).
- Search an approximate-nearest-neighbor (ANN) index over chunk embeddings in Neo4j Aura.
- Expand each hit to its sibling chunks — the chunks immediately before and after it in the same document.
- Rerank in Python using state hints pulled from the query text.
- Assemble a metadata-rich context block (bill number, title, status, state, URL) for the LLM.
Here's the shape of it:
natural-language query
|
v
Bedrock Titan v2 ——embed——> 1024-dim query vector
|
v
Neo4j ANN index (chunkEmbeddingIndex, cosine)
|
v
graph expansion: each hit ——> sibling chunks (±2 in same document)
|
v
Python rerank by state hints ——> metadata hydration
|
v
context block ——> generation LLM
The interesting part is step 3, the expansion. After the ANN index returns its hits, a Cypher query walks from each hit back up to its parent Document, then back down to the sibling chunks within ±2 of the hit's position:
The seed chunk keeps its full vector score; each sibling inherits 90% of it. The UNWIND flattens everything into rows, and max(score) dedupes chunks that were reachable from more than one seed, keeping the highest score. If you think in SQL, that whole block is a self-join on chunk keyed by document_id with a window predicate on chunk_index, followed by a GROUP BY chunk_id with MAX(score).
One deliberate structural choice: we retrieve in two queries, not one. The hot ANN-plus-expansion query returns only (node, score) — nothing else. A second METADATA_CYPHER query then hydrates just the winning chunk IDs into full bill metadata. Keeping the hot path skinny means the expensive metadata join only ever runs over the small final result set, not over every candidate the index considered.
The tradeoff that defines this post
Here's the decision worth walking through in detail.
The retrieval used to expand much further. The original design had a four-tier decay: a seed chunk at 1.0, its siblings at 0.9, then topic-related chunks at 0.5 and same-state chunks at 0.3, reached by fanning out through Topic and State relationships in the graph. The idea was good on paper — surface thematically related bills from across the corpus, not just text from within the one bill you happened to hit.
It didn't survive contact with production. With 23,000+ chunks in the graph, fanning out through topic and state relationships pulled in chunks from every bill that shared a topic or a state. That expansion was combinatorially expensive and consistently blew past the 29-second API Gateway timeout. The feature that was supposed to make answers richer was instead making them not arrive at all.
So we tore out the bottom two tiers. The Cypher now implements only seed (1.0) and sibling (0.9). The cross-bill state signal didn't disappear — it moved out of the graph and into Python, as a reranking step:
This is the move we're most willing to defend. The graph traversal was expressive but unbounded; the Python rerank is cheaper, bounded, and tunable — those two constants are environment variables, so we can adjust how aggressively state matching matters without redeploying a line of Cypher. We traded some recall (the graph no longer surfaces cross-bill thematic neighbors in the main path) for the thing that actually mattered: answers that return inside the timeout, reliably.
It's worth being honest about what this cost. The graph is barely used as a graph in this path anymore — the only traversal left is the Document -> Chunk sibling hop. If you squint, the main retrieval path is "vector search plus a self-join." That's a real architectural concession, and the roadmap section is about earning some of that expressiveness back.
A couple of smaller tradeoffs worth keeping in view:
effective_search_ratio = 4. The ANN index over-fetches 4× the requestedtop_kbefore expansion and dedup run. Without this, sibling expansion and deduplication would thin the result set belowtop_kand we'd return fewer chunks than asked for. It's a small knob that quietly protects recall.- Brute-force cosine in two side paths. When a user explicitly names a state, or when we need a single bill's most self-representative chunks, we skip the ANN index and compute cosine similarity directly in Cypher over the filtered subset. That's O(candidates) rather than sub-linear — which is exactly right when the candidate set is small (one state, or one bill), and exactly wrong if you ever pointed it at the full 23k chunks. Knowing which regime you're in is the whole game.
Where it's fragile
The thing we'd flag first in a code review is the state-hint extractor, because it's clever in a way that bites back. It pulls state references out of the raw query with a regex — a word-boundary match on two-letter uppercase tokens (\b[A-Z]{2}\b) against known state codes, plus case-insensitive matching on full state names.
The failure is in the abbreviation pass. An uppercased ordinary English word that happens to be a state code — IN (Indiana), OR (Oregon), OK (Oklahoma) — gets read as a state hint. A query like "bills IN effect OR pending" could quietly conjure Indiana and Oregon out of thin air and then penalize every chunk that isn't from those states. The right fix is a named-entity model, but that's heavier and slower, and on a path this latency-sensitive "heavier" is a real cost, not a free upgrade.
The rest of the fragility is more mundane and mostly handled gracefully:
- Hybrid retrieval degrades silently to vector. We support an opt-in hybrid mode (dense vector + BM25 fulltext), but any exception in the hybrid path is caught, logged, and falls through to pure vector search. Best-effort enrichment, guaranteed baseline — the user never sees an error.
- State pre-filtering has a recall floor. When a named-state pre-filter returns fewer than 3 chunks, we fall back to an unfiltered search at 2×
top_k. Naming a low-coverage state still returns answers instead of an empty result. - The reranking constants are global. That 1.2× boost and 0.8× penalty are fixed multipliers applied everywhere. They're a blunt instrument compared to a calibrated additive bonus or a learned ranking model.
- The sibling window is fixed at ±2. Tightly-chunked documents might want more surrounding context; loosely-chunked ones might pull in noise. It doesn't adapt to chunk granularity.
- The 29-second ceiling hasn't gone away. We're comfortably under it today, but it's a function of corpus size. The design bought headroom, not immunity.
Roadmap
The honest through-line of this system is we cut expressiveness to hit a latency budget, and the roadmap is about buying some of it back without reopening the wound that made us cut it.
The biggest item is restoring cross-bill thematic signal within the latency budget. The four-tier fan-out failed because it traversed unbounded relationships at query time. A precomputed RELATED_TO edge — materialized offline, capped to a fixed number of neighbors per bill — would let us reintroduce a bounded version of that signal as an O(1)-per-seed hop instead of a combinatorial explosion. The expensive part moves to write time, where the 29-second clock isn't running.
Second, replacing the brute-force state scan with index-native metadata filtering. Newer versions of Neo4j's vector index support filtering inside the ANN search itself. That would let the named-state path stay on the fast index instead of dropping to an O(candidates) cosine scan, and would make the whole "which regime am I in" tightrope unnecessary.
Third, hardening the state-hint extractor — most likely a lightweight NER pass, benchmarked against the latency budget before it ships, because on this path nothing is free until the clock says it is.
And finally, adaptive sibling windows keyed to chunk granularity, so densely- and sparsely-chunked documents each get the right amount of surrounding context instead of a one-size-fits-±2 rule.
Next in the series: an event-driven ingestion pipeline that choreographs three AWS Lambdas entirely through S3 — no orchestrator, no queue, just change-data-capture and object keys.
The Series · Part 1 of 5