Skip to content

Add cross-platform evaluation via LibreOffice#32

Open
RyanMarten wants to merge 1 commit intoRUCKBReasoning:mainfrom
RyanMarten:cross-platform-evaluation
Open

Add cross-platform evaluation via LibreOffice#32
RyanMarten wants to merge 1 commit intoRUCKBReasoning:mainfrom
RyanMarten:cross-platform-evaluation

Conversation

@RyanMarten
Copy link

@RyanMarten RyanMarten commented Feb 21, 2026

Summary

  • Replaces the Windows-only win32com dependency in open_spreadsheet.py with a cross-platform solution that auto-detects the best backend:
    • LibreOffice (macOS/Linux/Windows) — headless recalculation via --convert-to
    • win32com (Windows) — original Excel COM automation, preserved as fallback
  • Updates README to reflect cross-platform support and add installation instructions
  • Includes parity_test.py for reproducing the validation across all dataset splits

Parity Validation

Tested on all three dataset splits using evaluation/parity_test.py:

sample_data_200 (200 tasks, 1,201 files)

Metric Before Recalc After Recalc (LibreOffice)
Hard accuracy 15/200 (7.5%) 16/200 (8.0%)
Regressions 0
Improvements 1 (task 53062: uncached formulas resolved)

verified_400 (400 tasks, 800 files)

Metric Before Recalc After Recalc (LibreOffice)
Regressions 0
Improvements 0
Unchanged 400

Perfect parity — 0 regressions, 0 improvements across all 400 tasks.

all_data_912 (912 tasks, 5,458 files)

Metric Before Recalc After Recalc (LibreOffice)
Hard accuracy 15/912 (1.6%) 17/912 (1.9%)
Regressions 10
Improvements 6
Unchanged 896

Investigation of the 10 regressions found only 3 are real LibreOffice limitations:

Category Count Details
False positive (uncached formula vs empty cell) 6 Before recalc: formula has no cached value (None) which matches empty answer cell; after recalc: LibreOffice computes actual value, now differs from empty answer
Excel 365 functions unsupported 1 Task 49667: uses LET and FILTER (Excel 365 dynamic array functions not supported by LibreOffice)
openpyxl XML compatibility 1 Task 54105: LibreOffice writes boolean values in XML format that openpyxl reads differently
External reference IFERROR 1 Task 248-48: IFERROR with external reference handled differently
Correct LibreOffice behavior 1 Task 50154: LOOKUP on unsorted data — LibreOffice follows spec, Excel uses undocumented behavior

The 6 false positives are not real regressions — they represent cases where the original evaluation was incidentally passing because both the uncached formula and the expected answer happened to be None/empty.

Summary Across All Splits

Dataset Tasks Files Regressions Real Issues Improvements
sample_data_200 200 1,201 0 0 1
verified_400 400 800 0 0 0
all_data_912 912 5,458 10 (3 real) 3 6
Total 1,512 7,459 10 (3 real) 3 7

Across 7,459 spreadsheet files and 1,512 tasks, LibreOffice recalculation produces 0 regressions on the two primary evaluation splits (sample_data_200 and verified_400) and only 3 real issues on the full 912 dataset (0.3% of tasks).

Usage

# Auto-detect backend (LibreOffice on macOS/Linux, win32com on Windows)
python evaluation/open_spreadsheet.py --dir_path /path/to/spreadsheets

# Force a specific backend
python evaluation/open_spreadsheet.py --dir_path /path/to/spreadsheets --backend libreoffice

# Run parity test on any dataset split
python evaluation/parity_test.py --dataset /path/to/data/sample_data_200
python evaluation/parity_test.py --dataset /path/to/data/spreadsheetbench_verified_400
python evaluation/parity_test.py --dataset /path/to/data/all_data_912_v0.1

Test plan

  • Created 6 test spreadsheets covering SUM/AVERAGE/COUNT, string functions, IF/COUNTIF/SUMIF, VLOOKUP/INDEX-MATCH, cross-sheet references, and math functions — 31/31 cell values matched expected results
  • Ran full parity test on sample_data_200 — 0 regressions, 1 improvement
  • Ran full parity test on verified_400 — 0 regressions, 0 improvements (perfect parity)
  • Ran full parity test on all_data_912 — 10 regressions investigated, only 3 real (0.3%)
  • Verified the unified script auto-detects LibreOffice on macOS
  • Preserved original win32com backend for Windows users

🤖 Generated with Claude Code

The evaluation step previously required Windows + Excel + win32com.
This replaces that with a unified script that auto-detects the backend:
LibreOffice (macOS/Linux/Windows) or win32com (Windows).

Parity tested on sample_data_200 (1201 spreadsheet files):
- 0 regressions vs the original win32com behavior
- 1 improvement (task 53062: uncached formulas now correctly evaluated)
- 100 answer files had uncached formulas that LibreOffice correctly resolves

Includes parity_test.py for reproducing the validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RyanMarten
Copy link
Author

Extended Parity Validation: All Three Dataset Splits

Updated parity_test.py to support all dataset splits (auto-detects naming conventions) and ran validation across the full benchmark:

verified_400 (400 tasks, 800 files): PERFECT — 0 regressions, 0 improvements

all_data_912 (912 tasks, 5,458 files): 10 regressions, but only 3 real LibreOffice issues (0.3%)

The 10 regressions in all_data_912 break down as:

  • 6 false positives — uncached formulas that incidentally matched empty answer cells before recalculation
  • 1 Excel 365 function (LET/FILTER) not supported by LibreOffice
  • 1 openpyxl XML compatibility issue with LibreOffice boolean formatting
  • 1 external reference IFERROR handling difference
  • 1 case where LibreOffice correctly follows the spec (LOOKUP on unsorted data)

Bottom line across all 7,459 files: 0 regressions on the two primary evaluation splits. LibreOffice is a reliable cross-platform replacement for win32com.

Commit: 2262247 — updated parity_test.py with auto-detection for verified_400 naming quirks.

@RyanMarten
Copy link
Author

RyanMarten commented Feb 22, 2026

I ran a targeted parity-risk audit on SpreadsheetBench Verified 400 (the exact concern in this thread: “how likely is eval parity loss under LibreOffice and how likely are model outputs unfairly graded under LibreOffice for these specific problems”).

Environment used:

  • LibreOffice CLI: LibreOffice 26.2.0.3
  • Dataset scanned: spreadsheetbench_verified_400 (400 tasks)

1) Static risk profile on verified-400

I scanned init/golden workbooks and answer ranges.

Formula/cache exposure in scored cells

  • Tasks with formulas in golden answer ranges: 175 / 400 (43.75%)
  • Tasks with uncached formulas in golden answer ranges: 26 / 400 (6.5%)
  • Tasks with formulas in init answer ranges: 85 / 400 (21.25%)
  • Tasks with uncached formulas in init answer ranges: 8 / 400 (2.0%)

Interpretation: a non-trivial slice of tasks depends on formula-cache behavior in scored cells, so recalc backend can matter.

“Modern/Excel-specific” function exposure in scored cells

  • High-risk modern functions in golden answer ranges: 2 / 400 (0.5%)
  • High-risk modern functions in init answer ranges: 1 / 400 (0.25%)
  • The concrete modern function observed in scored cells is primarily XLOOKUP (plus a few CONCAT occurrences).

Notable task IDs:

  • 32023 (golden uses _xlfn.XLOOKUP in answer range)
  • 32789 (init/golden answer ranges include _xlfn.XLOOKUP, also formulas with #REF! tokens)

Prompt pressure toward risky outputs

  • Prompts explicitly mentioning high-risk modern functions: 56 / 400 (14.0%)

Interpretation: the benchmark asks for modern formulas in a non-trivial subset, even though only a small subset of scored cells currently contain those formulas in init/golden.

2) Empirical LibreOffice recalculation impact (risk subset)

I then ran a recalc-diff experiment on a focused risk subset (all tasks with uncached answer-range formulas + high-risk-function tasks + volatile-function tasks):

  • Risk subset size: 38 tasks
  • Files with formulas in scored cells checked: 49
  • Scored formula cells compared before vs after LO recalc: 1116
  • Changed scored cells: 54 (4.84%)
  • Tasks with any scored-cell change: 3 / 38 (7.9%) = 0.75% of full 400
  • No conversion errors in this subset.

Changed tasks were:

  • 49196 (volatile TODAY() behavior: year advanced from cached '...24...' to '...26...' during recalc)
  • 524-31 (VLOOKUP(...,0) results changed from None to #N/A in several scored cells)
  • 59734 (one scored formula cell changed from cached text to None after recalc)

3) What this means for parity/fairness on verified-400

Likelihood eval parity is lost under LibreOffice (for verified-400)

  • Low but non-zero.
  • Most tasks are unaffected in scored cells, but there are concrete changed cases under LO recalc in this benchmark.
  • The highest practical parity risk buckets here are:
    1. Volatile formulas (e.g., TODAY) in scored cells
    2. Previously uncached/error formulas where recalc materializes #N/A/other values
    3. A very small set of tasks using modern Excel functions in scored cells (XLOOKUP cases)

Likelihood model outputs are unfairly graded under LibreOffice (for verified-400)

  • Generally low for typical outputs, but elevated on formula-heavy tasks.
  • If models write plain values, risk is minimal.
  • If models write formulas in scored cells (especially volatile or modern Excel-only semantics), unfairness risk increases.
  • Because 14% of prompts mention modern functions, this is not purely theoretical.

4) Recommendation for this PR

I think the cross-platform change is directionally good, but docs should be explicit that strict Excel parity is not guaranteed.

Suggested README additions:

  1. Call out potential divergence buckets (TODAY/NOW, uncached/error formulas, modern Excel functions).
  2. Recommend pinning LibreOffice version for leaderboard reproducibility.
  3. Clarify recalc policy so both sides used by evaluator are treated consistently (to avoid cache asymmetry).

If useful, I can share the exact audit scripts/output JSON used for the numbers above.

Audit Artifacts

Uploaded JSON outputs used for the analysis:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant