[Evaluation] Fix red team status tracking, cache key mismatch, and evaluation error handling by slister1001 · Pull Request #45517 · Azure/azure-sdk-for-python

slister1001 · 2026-03-04T21:11:49Z

Fixes three bugs discovered during the red team SDK bug bash:

Bug 1 - Run status stuck at in_progress: _determine_run_status() now treats leftover pending and running entries as failed instead of in_progress. By the time this method runs the scan is finished, so pending entries (from skipped risk categories or Foundry execution failures) indicate failure, not ongoing work.

Bug 2 - ungrounded_attributes silently skipped: _execute_attacks_with_foundry() now uses get_attack_objective_from_risk_category() to build the cache lookup key, matching the caching logic in _get_attack_objectives(). Previously, objectives were cached under isa but looked up under ungrounded_attributes, causing the category to appear to have 0 objectives despite the API returning 100.

Bug 3 - ServiceInvocationException inflating ASR: RAIServiceScorer now detects when the RAI evaluation service returns an error response (properties.outcome == 'error') and raises RuntimeError, causing PyRIT to treat the score as UNDETERMINED. Previously, the erroneous passed=False from error responses was incorrectly treated as attack success, inflating the protected_material ASR from 0% to 50%.

…r handling Bug 1 - Status tracking: _determine_run_status now treats 'pending' and 'running' entries as 'failed' instead of 'in_progress'. By the time this method runs the scan is finished, so leftover 'pending' entries (from skipped risk categories or Foundry execution failures) indicate failure, not ongoing work. Bug 2 - Cache key mismatch: _execute_attacks_with_foundry now uses get_attack_objective_from_risk_category() to build the cache lookup key, matching the caching logic in _get_attack_objectives. Previously, ungrounded_attributes objectives were cached under 'isa' but looked up under 'ungrounded_attributes', causing them to be silently skipped. Bug 3 - Evaluation error handling: RAIServiceScorer now detects when the RAI evaluation service returns an error response (properties.outcome == 'error', e.g. ServiceInvocationException) and raises RuntimeError. This causes PyRIT to treat the score as UNDETERMINED instead of using the erroneous passed=False to incorrectly mark the attack as successful, which was inflating ASR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR fixes three bugs found during the red team SDK bug bash:

Run status stuck at in_progress: Treats leftover pending and running statuses as failed since the scan has already finished.
ungrounded_attributes silently skipped: Fixes a cache key mismatch by using get_attack_objective_from_risk_category() instead of the raw risk value for the baseline cache lookup key.
ServiceInvocationException inflating ASR: Detects error responses from the RAI evaluation service and raises RuntimeError so scores are marked as UNDETERMINED rather than being incorrectly treated as attack success.

Changes:

Updated _determine_run_status() to collapse pending/running into the failure set
Fixed cache key construction in _execute_attacks_with_foundry() to match the caching logic
Added error-outcome detection in RAIServiceScorer._score_piece_async() to prevent false attack-success counts

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`_result_processor.py`	Treats `pending`/`running` as terminal failures in `_determine_run_status()`
`_red_team.py`	Uses `get_attack_objective_from_risk_category()` for consistent cache key lookup
`_rai_scorer.py`	Detects `properties.outcome == "error"` and raises `RuntimeError` for undetermined scoring
`CHANGELOG.md`	Documents the three bug fixes

slister1001 and others added 2 commits March 4, 2026 16:05

Add changelog entries for status tracking, cache key, and scoring fixes

3016c92

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 4, 2026 21:11

slister1001 requested a review from a team as a code owner March 4, 2026 21:11

Copilot AI reviewed Mar 4, 2026

View reviewed changes

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 4, 2026

Copilot started reviewing on behalf of slister1001 March 4, 2026 21:17 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Fix red team status tracking, cache key mismatch, and evaluation error handling#45517

[Evaluation] Fix red team status tracking, cache key mismatch, and evaluation error handling#45517
slister1001 wants to merge 2 commits intomainfrom
fix/redteam-bugbash-status-scoring-cache

slister1001 commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slister1001 commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants