Teaching a Model to Read Legislation — and to Clean Its Own Data

Part 3 of a 5-part series on the engineering behind Arc Radius, a platform that tracks US state legislation affecting LGBTQ+ youth. This post covers the classifier: how we decide whether a bill is relevant, whether it's harmful or supportive, and how the model ended up cleaning the very dataset it was trained on.

The problem

Roughly 746,000 bills move through US state legislatures in the window Arc Radius covers. A few thousand of them are relevant to LGBTQ+ youth. That's a needle-in-a-haystack ratio of about 360 to 1, and every relevant bill then needs a second judgment: is it harmful or supportive? A bill restricting healthcare access and a bill establishing anti-discrimination protections both surface on the same keyword search; only one of them is something a worried parent wants flagged.

So there are really two classification problems stacked on top of each other. First, relevance — is this bill about LGBTQ+ issues at all? — which is fundamentally a text problem, since the signal lives in the bill's title and description. Second, stance — harmful or supportive? — which turns out to be less about the text and more about who sponsored it and how it was voted on.

And underneath both sits a quieter problem that ended up shaping the whole project: the training labels weren't clean to begin with.

The design: two stages, two very different models

We split the work to match the structure of the problem.

Stage 1 — relevance — is a fine-tuned LegalBERT. We started from nlpaueb/legal-bert-base-uncased, a BERT variant pretrained on legal text, and fine-tuned it to read a bill's title-plus-description and output a single probability: is this LGBTQ+ relevant? Legal language is its own dialect, so starting from a model that already speaks it beats fine-tuning general-purpose BERT.

Stage 2 — stance — is a logistic regression on political metadata. This is the part that surprises people. Predicting harmful-vs-supportive doesn't use the bill text at all. It uses six features: the sponsor party mix (how many R, D, and other sponsors), the state's R-sponsorship ratio, the dominant party, and the vote margins. A bill's stance correlates far more strongly with its political provenance than with its wording, so a simple linear model on six numbers does the job — no second transformer required.

The two stages are gated: stance only runs on bills that stage 1 flagged as relevant, because "harmful or supportive" is undefined for a bill about agricultural zoning.

  bill (title + description)
          │
          ▼
  ┌─────────────────────────────┐
  │  STAGE 1: LegalBERT          │
  │  text → P(LGBTQ+ relevant)   │
  └─────────────────────────────┘
          │
     relevant? ──no──► "not applicable"
          │
         yes
          ▼
  ┌─────────────────────────────┐
  │  STAGE 2: LogisticRegression │
  │  6 political features →      │
  │  harmful vs supportive       │
  └─────────────────────────────┘

A few decisions inside stage 1 are worth pulling out, because they're where the real engineering happened.

Two-phase fine-tuning. Rather than unfreeze the whole transformer and hope, we trained in two phases. Phase 1 freezes the entire BERT backbone and trains only a small linear classification head on top of the [CLS] token — fast, cheap, and it gets the head into a sane region. Phase 2 then unfreezes just the top two of BERT's twelve encoder layers, at a learning rate 20× smaller, and lets them adapt. The trajectory tells the story: with the backbone frozen, accuracy plateaued around 75% and recall on the relevant class was a dismal 0.31 — the frozen model simply underfit. The moment phase 2 unfroze those top layers, accuracy jumped to ~91% and recall climbed to 0.80. The backbone's general legal knowledge was fine; it was the top layers that needed to learn our specific notion of relevance.

A 2:1 negative-to-positive sampling ratio, not the natural 360:1. You cannot train a useful classifier on the real class balance — a model that predicts "not relevant" every single time would score 99.7% accuracy and be completely worthless. So we undersampled the majority class hard, down to two negatives per positive, to make the problem learnable. The metric we cared about was never accuracy; it was recall and precision on the rare positive class.

A 0.73 decision threshold, not 0.5. The default move is to call anything above 0.5 a positive. We didn't. A precision-recall sweep put the best-F1 threshold around 0.49, but we deliberately chose ~0.73 — a precision-leaning operating point. The reasoning: a false positive puts an irrelevant bill in front of users and erodes trust in the whole feed, while a false negative misses one bill among thousands. We'd rather miss a few relevant bills than cry wolf, so we paid recall to buy precision. At 0.73 the model runs about 0.93 precision and 0.80 recall on the relevant class.

The decision that defines this post

Here's the part worth dwelling on: the model cleaned its own training data.

The relevance labels came from an ACLU-maintained tracker, joined to LegiScan's bill database. But the join key — (bill_number, year, state) — wasn't unique. A single ACLU entry could match multiple LegiScan bills, which meant the "positive" set was contaminated with bills that weren't actually relevant. Garbage in the labels is worse than garbage in the features, because the model learns to reproduce the garbage.

We could have hand-audited every ambiguous match. Instead we did something more interesting and more scalable. We pulled the duplicate-matched rows out of the training set entirely — 119 of them — and trained the relevance model on the clean remainder. Then we turned the freshly-trained model around on those 119 set-aside rows and let it score them. Of the 119, the model judged 50 genuinely relevant and 69 below the 0.73 threshold. Those 69 contaminating bills were dropped, and the cleaned positives became the canonical dataset — matched_lgbtq_bills.csv — that everything downstream consumes.

This is a bootstrap: the model is used to clean the data source that defines the model. It works because relevance learned from the uncontaminated majority generalizes well enough to adjudicate the contaminated minority — the noise was a small fraction of the labels, so the signal survived it. There's something satisfying about a classifier that earns its keep twice: once at inference, and once as a data-cleaning tool before it ever ships.

The other decision worth naming is train/serve parity. The stance model depends on derived political features — the R-sponsorship ratio, the dominant-party rule, the vote percentages — and those derivations have to be computed identically at training time and at serving time, or the model sees different inputs in production than it learned from. This is the classic train/serve skew trap. The fix was to centralize the feature math so the exact same derivations run in the training notebook and in the production inference container, with the feature contract explicitly aligned across both. The model container owns the feature engineering; the pipeline around it is a dumb pipe that just moves bills through.

Where it's fragile

The honest weak points cluster around the rare class and the compounding structure of the two stages.

The supportive class is small and the model knows it. Harmful bills vastly outnumber supportive ones in the data (a reflection of the legislative reality the project exists to track). Stage 2 hits an F1 around 0.97 on harmful bills but only ~0.73 on supportive ones — there simply aren't many supportive examples to learn from, and recall on that class drops accordingly. The thing users might most want to hear about — a protective bill — is the thing the model is least confident about.
Errors compound across the gate. Because stance only runs on bills stage 1 flagged relevant, a relevance miss makes the correct stance label unreachable — you can't classify the stance of a bill you've already discarded. When we measured the full gated pipeline end-to-end, the rare supportive class degraded further (combined F1 around 0.47) precisely because two stages' worth of errors stack. Gating is the right structure, but it concentrates failure on the rare class.
The labels still trace to a single advocacy source. The bootstrap cleaned the duplicate contamination, but the ground-truth notion of "relevant" still originates from the ACLU tracker's coverage. Whatever that source systematically misses, the model never learns to catch.
Stance is inferred from politics, not text. Using sponsor party and vote margins to predict harm is a strong correlation, but it's a proxy. A harmful bill introduced by an unexpected sponsor mix, or a supportive bill with atypical political provenance, is exactly the case the metadata model will misread — because it never reads the bill.

Roadmap

The through-line is we made a hard, imbalanced problem learnable by leaning on precision and structure. The roadmap is about shoring up the rare class and reducing single-source dependence.

The most valuable item is strengthening the supportive class, which is the consistent weak point across both stages. More supportive training examples would help directly; so would techniques aimed at imbalance — class-weighting tuned specifically for that class, targeted augmentation, or a calibrated threshold for the stance model the way relevance already has one.

Second, giving stage 2 access to the text. Predicting stance from political metadata alone leaves real signal on the table — the bill's actual language carries intent that sponsor counts can't. A stance model that reads the text (or blends text features with the political features) would catch the atypical-provenance cases the current proxy misses.

Third, broadening the label sources. The bootstrap cleaning was a good answer to noisy labels; the next step is addressing incomplete ones by reconciling multiple advocacy trackers rather than depending on a single source's coverage — the same entity-resolution problem the ingestion side already wrestles with, applied to the ground truth.

And finally, softening the gate. The hard relevance cutoff means a stage-1 miss is unrecoverable. Carrying a relevance probability forward into the stance decision, rather than a hard yes/no, would let borderline-relevant bills still receive a hedged stance instead of vanishing — keeping the structure that makes the pipeline sensible without letting it swallow the rare cases whole.

Next in the series: how one FastAPI codebase runs two completely different ways — locally under uvicorn and in production on AWS Lambda — and switches its entire serving backend at runtime with feature flags.

Teaching a Model to Read Legislation — and to Clean Its Own Data

The problem

The design: two stages, two very different models

The decision that defines this post

Where it's fragile

Roadmap

More From Product Bulletin

Building GraphRAG for Legislative Search

An Event-Driven Pipeline Choreographed Entirely Through S3

One FastAPI App, Two Runtimes