Product Bulletin

An Event-Driven Pipeline Choreographed Entirely Through S3

Three AWS Lambdas, no orchestrator, no queue. How change-data-capture and a handful of object keys keep a knowledge graph in sync with 50 states — and why at-least-once delivery was the right call.

By Ambro Quach

AWS LambdaCDCS3 ChoreographyEvent-Driven10 min read

Part 2 of a 5-part series on the engineering behind Arc Radius, a platform that tracks US state legislation affecting LGBTQ+ youth. This post covers ingestion — how new and changed bills flow from a legislative data API into a knowledge graph, on a daily cadence, with no orchestrator in the middle.


The problem

Legislation doesn't sit still. Bills get introduced, amended, advanced, and killed across all 50 states, and Arc Radius is only useful if its knowledge graph reflects that churn within a day. So we needed a pipeline that wakes up on a schedule, pulls whatever changed since yesterday, classifies it, embeds it, and writes it into Neo4j — without re-processing the entire historical corpus every single night.

That last clause is the real constraint. There are tens of thousands of bills. Most days, a handful change. A pipeline that reprocesses everything daily would burn money on redundant API calls, model inference, and embedding generation to produce a graph identical to yesterday's. The whole design hinges on one question: how does each stage know what's actually new?

There's a second constraint that shaped the architecture as much as the first. Each stage of this pipeline runs as an AWS Lambda, and Lambda has a hard 15-minute execution ceiling. A stage can be killed mid-run. So the pipeline can't assume any stage finishes — it has to be safe to crash and re-run at any moment.

The design: three Lambdas, no conductor

The pipeline is three stages, mapped to three Lambdas:

  1. Poll — pulls new/changed bills from the LegiScan API, lands raw CSVs in S3.
  2. Classify — runs each bill through a SageMaker LegalBERT endpoint (is this relevant? is it harmful or supportive?), tags issue categories, writes classified CSVs.
  3. Embed — fetches each bill's full text, chunks it, embeds every chunk with Bedrock Titan v2, and upserts the whole Bill → State/Session/Topic/Document → Chunk subgraph into Neo4j.

The thing that makes this architecture interesting is what isn't in it. There's no Step Functions state machine. No SQS queue. No Airflow DAG. There is no central orchestrator at all. EventBridge fires the first Lambda on a daily cron, and from there each stage hands off to the next purely through S3 objects:

EventBridge (daily cron) │ ▼ ┌──────────────────────────────────┐ │ POLL LAMBDA [COLLECT] │ │ LegiScan API → raw CSVs │ └──────────────────────────────────┘ │ writes raw/legiscan-incremental/*.csv │ │ S3 ObjectCreated event │ (prefix=raw/legiscan-incremental/, suffix=.csv) ▼ ┌──────────────────────────────────┐ │ CLASSIFY LAMBDA [TRANSFORM] │ │ SageMaker LegalBERT → labels │ └──────────────────────────────────┘ │ writes processed/classified/matches_*.csv │ │ S3 ObjectCreated event │ (prefix=processed/classified/) ▼ ┌──────────────────────────────────┐ │ EMBED LAMBDA [COLLECT+ │ │ TRANSFORM+STORE] │ │ text → chunks → Titan v2 → Neo4j │ └──────────────────────────────────┘ │ ▼ Neo4j Aura (graph + vector index)

The mechanism is simple once you see it: a stage announces "I'm done" by writing an object under the next stage's watched prefix. Poll writes to raw/legiscan-incremental/, which is exactly the prefix that triggers Classify. Classify writes to processed/classified/, which triggers Embed. The S3 prefix is the pipeline stage. The output object of one stage is the trigger of the next.

No file in this pipeline imports any other file. Each Lambda is a self-contained module that AWS invokes; the "call graph" of the system isn't in the code at all — it's an S3 choreography. S3 is simultaneously the message bus and the data store.

How the pipeline knows what changed

The constraint that started this whole post — how does each stage know what's new? — is answered by change-data-capture (CDC), and it's the connective tissue of the entire system.

LegiScan gives every bill a change_hash that updates whenever the bill changes. We lean on that. Under pipeline/metadata/ live three small JSON ledgers — one per stage:

known_bill_ids.json ← Poll's ledger classified_bills.json ← Classify's ledger embedded_bills.json ← Embed's ledger

Each ledger is a {bill_id: change_hash} map. Every Lambda reads its ledger at the start of a run and writes it back at the end. The rule each stage applies is the same: if a bill's change_hash matches what's in my ledger, I've already processed this exact version — skip it. A re-run of any stage skips every bill that hasn't actually changed since last time. The common key across all three ledgers is LegiScan's bill_id; the common value is its change_hash. That shared {bill_id: change_hash} contract is the only thing the three otherwise-independent Lambdas agree on.

This is also why the system has a set of one-time seed scripts. Before the Lambdas are ever enabled, the seed scripts initialize all three ledgers from the historical corpus. Without that, the very first scheduled run would see an empty ledger, conclude that every bill is new, and reprocess years of legislation in one shot — blowing straight through the 15-minute ceiling. Seeding makes the first real run a near-no-op instead of a catastrophe.

The tradeoff that defines this post

Here's the decision worth walking through in detail: the pipeline guarantees at-least-once delivery, not exactly-once — and that's deliberate.

Every stage saves its CDC state last — only after the work is done. The Poll Lambda, for instance, doesn't write its known_bill_ids ledger until after it has fetched all the bills. So if a Lambda hits the 15-minute timeout mid-run, the ledger was never updated, and the next run re-detects those bills as new and re-fetches them. Work is retried, never silently dropped.

The cost is real and worth naming: a timeout wastes the whole batch of work up to that point (there's a partial-progress story I'll come to), and the same bill can be processed more than once. So why accept duplicate processing instead of engineering exactly-once?

Because exactly-once across three independent stages, an external API, a model endpoint, and a database is enormously harder than making the work idempotent and letting it run more than once safely. And the terminal stage is already idempotent for free: the Embed Lambda writes to Neo4j with MERGE-based Cypher (upserts keyed on bill_id), so processing the same bill twice produces the exact same graph. Duplicate CSV rows are harmless because the write that consumes them deduplicates by key.

So the system gets at-least-once delivery, deduplicated downstream. That's the whole trade: instead of preventing duplicates, we make duplicates inconsequential. It's the same insight that makes AWS Lambda's own retry policy safe to rely on — when EventBridge re-invokes a failed handler, re-running the entire thing is fine precisely because every stage is idempotent. We didn't fight the at-least-once nature of the platform; we designed so it doesn't matter.

A few supporting decisions fall out of the same philosophy:

  • Checkpoint every 100 bills (Classify). Long-running stages save their ledger incrementally — every 100 bills rather than only at the end. With a 15-minute ceiling, this bounds the blast radius of a timeout: instead of losing a whole run's work, you lose at most the last sub-100 bills of progress. The checkpoint plus CDC is what makes partial runs cheap to resume.
  • A single API choke point (Poll). Every LegiScan call routes through one api_call function that centralizes auth, the 30-second HTTP timeout, error handling, and a throttle delay. One place to reason about rate limits and failures, rather than scattered call sites each getting it slightly wrong.
  • Avoiding the self-trigger loop. This is a subtle one. The Embed Lambda is triggered by objects appearing under the classified prefix — but Embed also needs to consume those objects. If it left them in place, or wrote its own output back under a watched prefix, it would re-trigger itself forever. So each stage moves its consumed inputs to a different prefix once done (copy_object + delete_object, because S3 has no atomic move). Archiving consumed CSVs under a new prefix is what prevents an infinite event loop.

Where it's fragile

The honest weak points all cluster around the same place: the pipeline trades efficiency and retry-safety for throughput, and at scale the throughput ceiling is the thing that bites.

  • Everything is serial and single-threaded. The Classify Lambda calls the SageMaker endpoint one bill at a time — invoke_endpoint per bill — which is O(N) sequential round-trips racing the 15-minute clock. At a rough few-bills-per-second, a single run caps out at a few thousand bills. On a normal incremental day that's fine; on a large backfill it isn't. The fix is batching the inference calls, but that's a non-trivial change to a deliberately simple design.
  • No explicit retry or backoff anywhere. Each Lambda has only a request-level timeout. A SageMaker ThrottlingException isn't retried in-line — it just counts as an error, and the bill gets picked up on the next scheduled run. That's acceptable because the at-least-once design makes a skipped bill self-healing, but it means transient errors quietly defer work to tomorrow rather than resolving it today.
  • A timeout wastes work. The save-state-last ordering is what makes crashes safe, but the flip side is there's no fine-grained checkpoint in the Poll stage — a timeout before the ledger write throws away the whole fetch. Classify's every-100 checkpoint mitigates this; Poll doesn't have an equivalent.
  • No orchestrator means no orchestrator's conveniences. The flip side of "no Step Functions" is no built-in retry policies, no visual run history, no centralized failure dashboard, no easy way to re-run a single stage in isolation. The simplicity is a real virtue, but observability is something you have to build yourself rather than getting for free.

Roadmap

The through-line here is we kept the pipeline radically simple and pushed complexity onto idempotency. The roadmap is about raising the throughput ceiling without giving up that simplicity.

The biggest item is batching inference in the Classify stage. Sending bills to SageMaker one at a time is the dominant cost against the 15-minute ceiling; batched invocation (or an async/queued inference pattern) would let a single run handle far more bills and make large backfills tractable instead of something you have to babysit across multiple runs.

Second, finer-grained checkpointing in the Poll stage, mirroring what Classify already does. A timeout shouldn't be able to discard an entire fetch; periodic ledger saves would bound Poll's blast radius the way the every-100 checkpoint bounds Classify's.

Third, lightweight observability. Going orchestrator-free was the right call for a pipeline this size, but the missing piece is visibility — even a simple run-summary written to S3 (bills seen, processed, skipped, errored, per stage) would replace the orchestrator's run history with something good enough to debug from, without reintroducing a heavyweight scheduler.

And finally, bounded retry with backoff for transient endpoint errors, so a SageMaker throttle resolves within the run instead of silently deferring to the next day — keeping the at-least-once safety net as a backstop rather than the first line of defense.


Next in the series: the model at the heart of the Classify stage — fine-tuning LegalBERT to read legislation, and the data leakage that made our first results look far better than they were.

More From Product Bulletin