Add cross-platform evaluation via LibreOffice#32
Add cross-platform evaluation via LibreOffice#32RyanMarten wants to merge 1 commit intoRUCKBReasoning:mainfrom
Conversation
The evaluation step previously required Windows + Excel + win32com. This replaces that with a unified script that auto-detects the backend: LibreOffice (macOS/Linux/Windows) or win32com (Windows). Parity tested on sample_data_200 (1201 spreadsheet files): - 0 regressions vs the original win32com behavior - 1 improvement (task 53062: uncached formulas now correctly evaluated) - 100 answer files had uncached formulas that LibreOffice correctly resolves Includes parity_test.py for reproducing the validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extended Parity Validation: All Three Dataset SplitsUpdated verified_400 (400 tasks, 800 files): PERFECT — 0 regressions, 0 improvements all_data_912 (912 tasks, 5,458 files): 10 regressions, but only 3 real LibreOffice issues (0.3%) The 10 regressions in all_data_912 break down as:
Bottom line across all 7,459 files: 0 regressions on the two primary evaluation splits. LibreOffice is a reliable cross-platform replacement for win32com. Commit: |
|
I ran a targeted parity-risk audit on SpreadsheetBench Verified 400 (the exact concern in this thread: “how likely is eval parity loss under LibreOffice and how likely are model outputs unfairly graded under LibreOffice for these specific problems”). Environment used:
1) Static risk profile on verified-400I scanned init/golden workbooks and answer ranges. Formula/cache exposure in scored cells
Interpretation: a non-trivial slice of tasks depends on formula-cache behavior in scored cells, so recalc backend can matter. “Modern/Excel-specific” function exposure in scored cells
Notable task IDs:
Prompt pressure toward risky outputs
Interpretation: the benchmark asks for modern formulas in a non-trivial subset, even though only a small subset of scored cells currently contain those formulas in init/golden. 2) Empirical LibreOffice recalculation impact (risk subset)I then ran a recalc-diff experiment on a focused risk subset (all tasks with uncached answer-range formulas + high-risk-function tasks + volatile-function tasks):
Changed tasks were:
3) What this means for parity/fairness on verified-400Likelihood eval parity is lost under LibreOffice (for verified-400)
Likelihood model outputs are unfairly graded under LibreOffice (for verified-400)
4) Recommendation for this PRI think the cross-platform change is directionally good, but docs should be explicit that strict Excel parity is not guaranteed. Suggested README additions:
If useful, I can share the exact audit scripts/output JSON used for the numbers above. Audit ArtifactsUploaded JSON outputs used for the analysis: |
Summary
win32comdependency inopen_spreadsheet.pywith a cross-platform solution that auto-detects the best backend:--convert-toparity_test.pyfor reproducing the validation across all dataset splitsParity Validation
Tested on all three dataset splits using
evaluation/parity_test.py:sample_data_200 (200 tasks, 1,201 files)
verified_400 (400 tasks, 800 files)
Perfect parity — 0 regressions, 0 improvements across all 400 tasks.
all_data_912 (912 tasks, 5,458 files)
Investigation of the 10 regressions found only 3 are real LibreOffice limitations:
None) which matches empty answer cell; after recalc: LibreOffice computes actual value, now differs from empty answerLETandFILTER(Excel 365 dynamic array functions not supported by LibreOffice)The 6 false positives are not real regressions — they represent cases where the original evaluation was incidentally passing because both the uncached formula and the expected answer happened to be
None/empty.Summary Across All Splits
Across 7,459 spreadsheet files and 1,512 tasks, LibreOffice recalculation produces 0 regressions on the two primary evaluation splits (sample_data_200 and verified_400) and only 3 real issues on the full 912 dataset (0.3% of tasks).
Usage
Test plan
sample_data_200— 0 regressions, 1 improvementverified_400— 0 regressions, 0 improvements (perfect parity)all_data_912— 10 regressions investigated, only 3 real (0.3%)🤖 Generated with Claude Code