Skip to content

Add calibration package checkpointing, target config, and hyperparameter CLI#538

Draft
baogorek wants to merge 28 commits intomainfrom
calibration-pipeline-improvements
Draft

Add calibration package checkpointing, target config, and hyperparameter CLI#538
baogorek wants to merge 28 commits intomainfrom
calibration-pipeline-improvements

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Feb 17, 2026

Fixes #533
Fixes #534
Fixes #558
Fixes #559

Summary

  • Calibration package checkpointing: --build-only saves the expensive matrix build as a pickle, --package-path loads it for fast re-fitting with different hyperparameters or target sets
  • Target config YAML: Declarative exclusion rules (target_config.yaml) replace hardcoded target filtering; checked-in config reproduces the junkyard's 22 excluded groups
  • Hyperparameter CLI flags: --beta, --lambda-l2, --learning-rate are now tunable from the command line and Modal runner
  • Modal runner improvements: Streaming subprocess output, support for new flags
  • Documentation: docs/calibration.md covers all workflows (single-pass, build-then-fit, package re-filtering, Modal, portable fitting)
  • At-large district naming fix: H5 filenames for at-large districts now use XX-01 (conventional 1-based) instead of XX-00
  • GCS staging fix: GCS uploads moved from staging phase to promotion phase, so both GCS and HuggingFace are updated together during promote

Note: This branch includes commits from #537 (PUF impute) since the calibration pipeline depends on that work. The calibration-specific changes are in the top commit.

Test plan

  • pytest policyengine_us_data/tests/test_calibration/test_unified_calibration.py — CLI arg parsing tests
  • pytest policyengine_us_data/tests/test_calibration/test_target_config.py — target config filtering + package round-trip tests
  • Manual: make calibrate-build produces package, --package-path loads it and fits

🤖 Generated with Claude Code

Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but generally LGTM, I was also able to run the calibration job in modal (after removing the ellipsis in unified_calibration.py)!

Small note: if im not mistaken this pr addressess issue #534. Seems like #310 was referenced in it as something that would be addressed together, but this pr does not save the calibration_log.csv among its outputs. Do we want to add it at this point?

@juaristi22 juaristi22 force-pushed the calibration-pipeline-improvements branch from 4c51b32 to 61523d8 Compare February 18, 2026 14:46
@baogorek baogorek force-pushed the calibration-pipeline-improvements branch from 61523d8 to 6744481 Compare February 18, 2026 16:47
baogorek and others added 10 commits February 19, 2026 14:33
…ter CLI

- Add build-only mode to save calibration matrix as pickle package
- Add target config YAML for declarative target exclusion rules
- Add CLI flags for beta, lambda_l2, learning_rate hyperparameters
- Add streaming subprocess output in Modal runner
- Add calibration pipeline documentation
- Add tests for target config filtering and CLI arg parsing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Modal calibration runner was missing --lambda-l0 passthrough.
Also fix KeyError: Ellipsis when load_dataset() returns dicts
instead of h5py datasets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Upload a pre-built calibration package to Modal and run only the
fitting phase, skipping HuggingFace download and matrix build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chunked training with per-target CSV log matching notebook format
- Wire --log-freq through CLI and Modal runner
- Create output directory if missing (fixes Modal container error)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set verbose_freq=chunk so epoch counts don't reset each chunk
- Rename: diagnostics -> unified_diagnostics.csv,
  epoch log -> calibration_log.csv (matches dashboard expectation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of creating a new Microsimulation per clone (~3 min each,
22 hours for 436 clones), precompute values for all 51 states on
one sim object (~3 min total), then assemble per-clone values via
numpy fancy indexing (~microseconds per clone).

New methods: _build_state_values, _assemble_clone_values,
_evaluate_constraints_from_values, _calculate_target_values_from_values.
DEFAULT_N_CLONES raised to 436 for 5.2M record matrix builds.
Takeup re-randomization deferred to future post-processing layer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Modal runner: add --package-volume flag to read calibration package
  from a Modal Volume instead of passing 2+ GB as a function argument
- unified_calibration: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  to prevent CUDA memory fragmentation during L0 backward pass
- docs/calibration.md: rewrite to lead with lightweight build-then-fit
  workflow, document prerequisites, and add volume-based Modal usage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek baogorek force-pushed the calibration-pipeline-improvements branch from 59b27a8 to 0a0f167 Compare February 19, 2026 23:07
baogorek and others added 11 commits February 19, 2026 18:26
- target_config.yaml: exclude everything except person_count/age
  (~8,766 targets) to isolate fitting issues from zero-target and
  zero-row-sum problems in policy variables
- target_config_full.yaml: backup of the previous full config
- unified_calibration.py: set PYTORCH_CUDA_ALLOC_CONF=expandable_segments
  to fix CUDA memory fragmentation during backward pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- apply_target_config: support 'include' rules (keep only matching
  targets) in addition to 'exclude' rules; geo_level now optional
- target_config.yaml: 3-line include config replaces 90-line exclusion
  list for age demographics (person_count with age domain, ~8,784 targets)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The roth_ira_contributions target has zero row sum (no CPS records),
making it impossible to calibrate. Remove it from target_config.yaml
so Modal runs don't waste epochs on an unachievable target.

Also adds `python -m policyengine_us_data.calibration.validate_package`
CLI tool for pre-upload package validation, with automatic validation
on --build-only runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Achievability analysis showed 9 district-level IRS dollar variables
have per-household values 5-27x too high in the extended CPS, making
them irreconcilable with count targets (needed_w ~0.04-0.2 vs ~26).
Drop salt, AGI, income_tax, dividend/interest vars, QBI deduction,
taxable IRA distributions, income_tax_positive, traditional IRA.

Add ACA PTC district targets (aca_ptc + tax_unit_count).

Save calibration package BEFORE target_config filtering so the full
matrix can be reused with different configs without rebuilding.

Also: population-based initial weights from age targets per CD,
cumulative epoch numbering in chunked logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PUF cloning already happens upstream in extended_cps.py, so the
--puf-dataset flag in the calibration pipeline was redundant (and
would have doubled the data a second time). Removed the flag,
_build_puf_cloned_dataset function, and all related params.

Added 4 compatible national targets: child_support_expense,
child_support_received, health_insurance_premiums_without_medicare_part_b,
and rent (all needed_w 27-37, compatible with count targets at ~26).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baogorek and others added 5 commits February 23, 2026 19:35
…builder

Pass raw calibration blocks (with "" for inactive) to the takeup
function instead of geography["block_geoid"] (which has fallback
blocks for inactive records). This ensures entity-per-block counts
match the matrix builder, producing identical RNG draw sequences.
Handle "" blocks safely in compute_block_takeup_for_entities.
Fix missing county_fips in TestDoubleGeographyForPuf tests.

Verified: X @ w ratio = 1.0000 for aca_ptc on CD 102.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two clones of the same record could land in the same CD, causing
convert_blocks_to_stacked_format to keep only one clone's block while
convert_weights_to_stacked_format summed both weights. This produced a
~2.2% gap for takeup-dependent variables like SNAP.

Fix: per-clone draws with vectorized collision re-drawing. Also adds a
collision warning in convert_blocks_to_stacked_format as a safety net.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@juaristi22
Copy link
Collaborator

juaristi22 commented Feb 25, 2026

A couple questions on recent changes

  1. Post-cloning PUF imputation remove, is this permanent?

Commit 49a1f66 ("Remove redundant --puf-dataset flag, add national targets") removed the ability to run PUF cloning inside the calibration pipeline, with the rationale that PUF cloning already happens upstream in extended_cps.py.

However, PR #516 specifically designed the pipeline so that PUF + QRF imputation runs after cloning and geography assignment, so that each clone gets geographically-informed imputations (with state_fips as a QRF predictor). As Max described it:

Each geographic clone gets geographically-appropriate PUF tax imputations instead of identical national-average ones duplicated everywhere. State is now a QRF predictor — California clones get California-like tax distributions.

With the current flow (extended_cps.py runs PUF once on base records → calibration pipeline clones 10x), all clones of the same household share identical PUF-imputed values, losing that variability benefit.

Are we planning to bring back the post-cloning PUF imputation once the calibration pipeline is stabilized? Or has the approach changed?

  1. All target variable precomputation moved to county level, is this worth the large increase in computation?

In commit 02f8ad0, _build_county_values was introduced to handle county-dependent variables (specifically aca_ptc, since marketplace premiums vary by county). It ran alongside the existing _build_state_values which handled everything else via 51 state-level simulations — a two-tier design gated by COUNTY_DEPENDENT_VARS = {"aca_ptc"}.

Then in commit 40fb389 (a "checkpoint"), COUNTY_DEPENDENT_VARS was removed and all target variable precomputation was moved to _build_county_values. _build_state_values was demoted to only computing constraint variables.

This means the matrix builder now runs ~1,000-2,000 county-level simulations (one per unique county in the geography assignment) instead of 51 state-level simulations for variables like snap, household_count, etc. that don't depend on county. This is a ~40x increase in compute cost with no accuracy benefit for those variables.

Was this an intentional simplification, or a debugging shortcut that could be reverted to the two-tier approach? Restoring COUNTY_DEPENDENT_VARS and routing only aca_ptc (and any future county-dependent vars) through county precomputation would significantly reduce matrix build time.

The state and county precomputation loops reused one Microsimulation
object across all states, relying on get_calculated_variables +
delete_arrays to clear caches between iterations. This missed
intermediate variables (likely those with string-based adds/subtracts
parameter paths), causing stale values from earlier states to leak
into SNAP/ACA_PTC calculations for later states (~3-4% inflation).

Fix: create a fresh Microsimulation per state in _build_state_values,
and per state-group in _build_county_values. Within-state county
recalculation is clean (confirmed by debug_state_precomp.py Test D),
so counties sharing a state still share a sim.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

@juaristi22 thank you for your thoughtful and excellent comments

  1. By the time this PR is done, I do want the calibration_log.csv to be saved once again! Actually, I do get it in this workflow that I run after make data:
 # 1. Build matrix locally (no GPU needed)
  source ~/envs/sep/bin/activate
  python -m policyengine_us_data.calibration.unified_calibration \
      --build-only \
      --skip-source-impute \
      --target-config policyengine_us_data/calibration/target_config.yaml \
      --package-output /tmp/calibration_package.pkl

  # 2. Push package to Modal volume
  modal volume put calibration-data /tmp/calibration_package.pkl calibration_package.pkl --force

  # 3. Fit on GPU from package
  modal run modal_app/remote_calibration_runner.py \
      --branch calibration-pipeline-improvements \
      --gpu T4 \
      --package-volume \
      --epochs 1000 \
      --beta 0.65 \
      --lambda-l0 1e-7 \
      --lambda-l2 1e-8 \
      --log-freq 500 \
      --target-config policyengine_us_data/calibration/target_config.yaml

That will fit the model on modal and drop a calibration_log.csv right on your local drive. I know one of the Issues was about actually storing it in an archive, and maybe that should be out of scope given this PR's complexity.

  1. You're absolutely right about the counties and I've now made the per-county computation optional. I know that aca_ptc's formula does involve the county, so we may have to live with not getting perfect matches on from X * w to sim.calculate("aca_ptc").sum(). But I saw the perfect ratios with your code so I know we can do it.

  2. Yes, you're right that I lost a bit of the vision with not imputing new PUF values for every clone. Given the brutal difficulty with getting X * w to match sim.calculate().sum() and the speed (which is still slow even after taking out the counties). From Claude:

    • Would it make X*w consistency harder? Yes, significantly. Right now the matrix builder can precompute values once per state and reuse them across all 436 clones — because every clone of household #5000 has the same underlying values regardless of which state it's assigned to. The only things that change per clone are geography inputs and takeup draws.
    • Issue created: Restore post-cloning PUF QRF re-imputation for geographic tax variation #560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants