fix: Support CP2K 2025 output format for energy and forces (fixes #850)#947
fix: Support CP2K 2025 output format for energy and forces (fixes #850)#947newtontech wants to merge 4 commits intodeepmodeling:masterfrom
Conversation
…pmodeling#850) This commit adds support for parsing CP2K 2025 version output files: **Changes in CP2K 2025 format:** 1. Energy line format changed from: 'ENERGY| Total FORCE_EVAL ( QS ) energy (a.u.): -7.997403996236343' to: 'ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] -7.364190264587725' 2. Forces output format changed from: 'ATOMIC FORCES in [a.u.]' table with ' Atom Kind Element X Y Z' header to: 'FORCES| Atomic forces [hartree/bohr]' with 'FORCES| Atom x y z |f|' prefix lines **Implementation:** - Detect CP2K 2025 format by checking for 'energy [hartree]' in the content - Parse energy from new '[hartree]' format - Parse forces from new 'FORCES|' prefixed lines - Maintain backward compatibility with CP2K 2023 format **Testing:** - Added test file for CP2K 2025 format (tests/cp2k/cp2k_2025_output/) - Added test case TestCp2k2025Output to verify parsing - Added regression test TestCp2k2023FormatStillWorks to ensure backward compatibility - All existing CP2K tests pass
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughAdds CP2K 2025 output support to dpdata by detecting a "energy [hartree]" token and branching parsing logic to extract energies and forces from the new CP2K 2025 line formats while preserving CP2K 2023 behavior. New test fixtures and unit tests validate parsing and edge cases. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@dpdata/cp2k/output.py`:
- Around line 438-454: In the is_cp2k_2025 branch the fallback loop may leave
energy uninitialized if no numeric token is found; modify the block around
is_cp2k_2025 / parts / hartree_idx so that if neither the "[hartree]" lookup nor
the reversed parts numeric search sets energy you raise a clear exception or log
an error with context (e.g., include the original line string ii and parts)
instead of continuing silently; ensure the raised error (ValueError or custom)
describes that energy could not be parsed from the CP2K 2025 output so the later
assertion yields a helpful message.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
dpdata/cp2k/output.pytests/cp2k/cp2k_2025_output/cp2k_2025_outputtests/cp2k/cp2k_2025_output/deepmd/set.000/box.npytests/cp2k/cp2k_2025_output/deepmd/set.000/coord.npytests/cp2k/cp2k_2025_output/deepmd/set.000/energy.npytests/cp2k/cp2k_2025_output/deepmd/set.000/force.npytests/cp2k/cp2k_2025_output/deepmd/set.000/virial.npytests/cp2k/cp2k_2025_output/deepmd/type.rawtests/cp2k/cp2k_2025_output/deepmd/type_map.rawtests/test_cp2k_2025_output.py
| if is_cp2k_2025: | ||
| # Find the energy value after "[hartree]" | ||
| parts = ii.split() | ||
| try: | ||
| hartree_idx = parts.index("[hartree]") | ||
| energy = parts[hartree_idx + 1] | ||
| except (ValueError, IndexError): | ||
| # Fallback: try to find energy value in the line | ||
| for part in reversed(parts): | ||
| try: | ||
| float(part) | ||
| energy = part | ||
| break | ||
| except ValueError: | ||
| continue | ||
| else: | ||
| energy = ii.split()[8] |
There was a problem hiding this comment.
Potential uninitialized variable in fallback path.
If "[hartree]" is not found in parts and the fallback loop doesn't find any numeric value, energy will not be set for this line, but the code continues silently. While the assertion on line 491 will eventually catch this, the error message won't be helpful for debugging malformed CP2K 2025 output.
Consider adding explicit error handling:
Proposed fix
if is_cp2k_2025:
# Find the energy value after "[hartree]"
parts = ii.split()
try:
hartree_idx = parts.index("[hartree]")
energy = parts[hartree_idx + 1]
except (ValueError, IndexError):
# Fallback: try to find energy value in the line
for part in reversed(parts):
try:
float(part)
energy = part
break
except ValueError:
continue
+ else:
+ # Loop completed without finding energy
+ raise RuntimeError(f"Cannot parse energy from CP2K 2025 line: {ii}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@dpdata/cp2k/output.py` around lines 438 - 454, In the is_cp2k_2025 branch the
fallback loop may leave energy uninitialized if no numeric token is found;
modify the block around is_cp2k_2025 / parts / hartree_idx so that if neither
the "[hartree]" lookup nor the reversed parts numeric search sets energy you
raise a clear exception or log an error with context (e.g., include the original
line string ii and parts) instead of continuing silently; ensure the raised
error (ValueError or custom) describes that energy could not be parsed from the
CP2K 2025 output so the later assertion yields a helpful message.
Merging this PR will not alter performance
|
|
please also address the review comments of the AI |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #947 +/- ##
==========================================
- Coverage 86.37% 86.30% -0.07%
==========================================
Files 86 86
Lines 8086 8105 +19
==========================================
+ Hits 6984 6995 +11
- Misses 1102 1110 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add tests for energy parsing with extra whitespace - Add tests for FORCES| header line filtering (Atom x y z, Atomic forces) - Add integration test for CP2K 2025 format with LabeledSystem - Improve code coverage for CP2K 2025 format support
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/test_cp2k_2025_output.py (2)
175-203: Strengthen force-header edge tests with value assertions.Both tests only assert atom count (
shape[1] == 2). Adding one component check makes header filtering regressions easier to catch.Suggested enhancement
def test_cp2k2025_force_parsing_with_header_lines(self): @@ try: system = dpdata.LabeledSystem(fname, fmt="cp2k/output") self.assertEqual(system.data["forces"].shape[1], 2) + self.assertAlmostEqual(system.data["forces"][0][0][0], -2.94874881, places=5) finally: os.unlink(fname) @@ def test_cp2k2025_force_parsing_with_atomic_forces_line(self): @@ try: system = dpdata.LabeledSystem(fname, fmt="cp2k/output") self.assertEqual(system.data["forces"].shape[1], 2) + self.assertAlmostEqual(system.data["forces"][0][0][0], -2.94874881, places=5) finally: os.unlink(fname)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cp2k_2025_output.py` around lines 175 - 203, Update the two tests test_cp2k2025_force_parsing_with_header_lines and test_cp2k2025_force_parsing_with_atomic_forces_line to also assert actual force values (not just atom count) to catch header-filtering regressions: after creating the LabeledSystem (system = dpdata.LabeledSystem(...)) and before unlinking, fetch forces = system.data["forces"] and add assertions on specific components (e.g., forces[0,0] or forces[0,1] / forces[1,0], comparing to the expected parsed float values from create_cp2k_output_2025's forces_lines) so the tests validate both shape and numeric content.
1-207: Please run the targeted format test module before merge.Given this is a format-specific parser/test addition, run the focused test module as a pre-merge gate.
Based on learnings: For format-specific changes, run relevant test modules:
cd tests && python -m unittest test_<format>*.py; for core system changes runcd tests && python -m unittest test_system*.py test_multisystems.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_cp2k_2025_output.py` around lines 1 - 207, Summary: Reviewer requests running the format-specific tests for CP2K 2025 before merging. Run the targeted test module tests/test_cp2k_2025_output.py (which contains TestCp2k2025Output, TestCp2k2025EdgeCases, and helper create_cp2k_output_2025) locally with unittest (cd tests && python -m unittest test_cp2k_2025_output.py) and confirm all cases pass (energy/forces extraction and edge-case variants); if failures occur, fix the parser code paths referenced by fmt="cp2k/output" and the energy/force parsing logic, re-run the same command, and add this module to the CI pre-merge test pattern so format-specific tests are always executed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_cp2k_2025_output.py`:
- Around line 114-117: Replace truthiness checks with explicit None checks:
change "if energy_line:" to "if energy_line is not None:" and similarly change
any checks that currently use truthiness for override inputs to "is not None"
(e.g., the override input variables used in the second block around the f.write
call). Update both the energy_line branch and the override input branch (the
latter referenced in the comment as lines 127-130) so empty strings/empty lists
are treated as valid overrides instead of falling back to defaults.
- Around line 163-172: The test
test_cp2k2025_energy_parsing_with_extra_whitespace currently only checks for
presence of energies; add an explicit numeric assertion that the parsed energy
equals the expected value from the input string (-7.364190264587725) to ensure
correct parsing. Locate the call that constructs the system via
dpdata.LabeledSystem and replace or augment the non-null check with an assertion
using the test framework (e.g., self.assertAlmostEqual or self.assertEqual with
a tolerance) against the energy entry in system.data["energies"] (e.g., the
first energy element), referencing the helper create_cp2k_output_2025 for the
expected value. Ensure you account for array shape (use the correct index into
system.data["energies"]) and use a small tolerance like 1e-12 to avoid
floating-point comparison issues.
---
Nitpick comments:
In `@tests/test_cp2k_2025_output.py`:
- Around line 175-203: Update the two tests
test_cp2k2025_force_parsing_with_header_lines and
test_cp2k2025_force_parsing_with_atomic_forces_line to also assert actual force
values (not just atom count) to catch header-filtering regressions: after
creating the LabeledSystem (system = dpdata.LabeledSystem(...)) and before
unlinking, fetch forces = system.data["forces"] and add assertions on specific
components (e.g., forces[0,0] or forces[0,1] / forces[1,0], comparing to the
expected parsed float values from create_cp2k_output_2025's forces_lines) so the
tests validate both shape and numeric content.
- Around line 1-207: Summary: Reviewer requests running the format-specific
tests for CP2K 2025 before merging. Run the targeted test module
tests/test_cp2k_2025_output.py (which contains TestCp2k2025Output,
TestCp2k2025EdgeCases, and helper create_cp2k_output_2025) locally with unittest
(cd tests && python -m unittest test_cp2k_2025_output.py) and confirm all cases
pass (energy/forces extraction and edge-case variants); if failures occur, fix
the parser code paths referenced by fmt="cp2k/output" and the energy/force
parsing logic, re-run the same command, and add this module to the CI pre-merge
test pattern so format-specific tests are always executed.
| if energy_line: | ||
| f.write(energy_line + "\n") | ||
| else: | ||
| f.write( |
There was a problem hiding this comment.
Use explicit None checks for override inputs.
Line 114 and Line 127 use truthiness checks, so empty overrides ("" / []) cannot be tested and silently fall back to defaults.
Suggested fix
- if energy_line:
+ if energy_line is not None:
f.write(energy_line + "\n")
else:
f.write(
" ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] -7.364190264587725\n"
)
@@
- if forces_lines:
+ if forces_lines is not None:
for line in forces_lines:
f.write(line + "\n")
else:
f.write(
" FORCES| 1 -5.73440344E-02 2.95274914E-02 -1.50988167E-02 6.62433792E-02\n"
)Also applies to: 127-130
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_cp2k_2025_output.py` around lines 114 - 117, Replace truthiness
checks with explicit None checks: change "if energy_line:" to "if energy_line is
not None:" and similarly change any checks that currently use truthiness for
override inputs to "is not None" (e.g., the override input variables used in the
second block around the f.write call). Update both the energy_line branch and
the override input branch (the latter referenced in the comment as lines
127-130) so empty strings/empty lists are treated as valid overrides instead of
falling back to defaults.
| def test_cp2k2025_energy_parsing_with_extra_whitespace(self): | ||
| """Test energy parsing with extra whitespace around value (coverage for parsing robustness).""" | ||
| fname = self.create_cp2k_output_2025( | ||
| energy_line=" ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] -7.364190264587725 " | ||
| ) | ||
| try: | ||
| system = dpdata.LabeledSystem(fname, fmt="cp2k/output") | ||
| self.assertIsNotNone(system.data["energies"]) | ||
| self.assertEqual(system.data["forces"].shape[1], 2) | ||
| finally: |
There was a problem hiding this comment.
The whitespace-energy test does not validate the parsed energy value.
This test currently only checks non-null fields, so incorrect energy parsing could still pass. Add an explicit numeric assertion for Line 166 input.
Suggested fix
def test_cp2k2025_energy_parsing_with_extra_whitespace(self):
"""Test energy parsing with extra whitespace around value (coverage for parsing robustness)."""
fname = self.create_cp2k_output_2025(
energy_line=" ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] -7.364190264587725 "
)
try:
system = dpdata.LabeledSystem(fname, fmt="cp2k/output")
- self.assertIsNotNone(system.data["energies"])
+ self.assertAlmostEqual(
+ system.data["energies"][0], -200.3898256786414, places=5
+ )
self.assertEqual(system.data["forces"].shape[1], 2)
finally:
os.unlink(fname)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_cp2k_2025_output.py` around lines 163 - 172, The test
test_cp2k2025_energy_parsing_with_extra_whitespace currently only checks for
presence of energies; add an explicit numeric assertion that the parsed energy
equals the expected value from the input string (-7.364190264587725) to ensure
correct parsing. Locate the call that constructs the system via
dpdata.LabeledSystem and replace or augment the non-null check with an assertion
using the test framework (e.g., self.assertAlmostEqual or self.assertEqual with
a tolerance) against the energy entry in system.data["energies"] (e.g., the
first energy element), referencing the helper create_cp2k_output_2025 for the
expected value. Ensure you account for array shape (use the correct index into
system.data["energies"]) and use a small tolerance like 1e-12 to avoid
floating-point comparison issues.
This commit adds support for parsing CP2K 2025 version output files:
Changes in CP2K 2025 format:
Energy line format changed from: 'ENERGY| Total FORCE_EVAL ( QS ) energy (a.u.): -7.997403996236343' to: 'ENERGY| Total FORCE_EVAL ( QS ) energy [hartree] -7.364190264587725'
Forces output format changed from:
'ATOMIC FORCES in [a.u.]' table with ' Atom Kind Element X Y Z' header
to:
'FORCES| Atomic forces [hartree/bohr]' with 'FORCES| Atom x y z |f|' prefix lines
Implementation:
Testing:
Summary by CodeRabbit
New Features
Tests