Skip to content

Update pre-recorded evals doc with Universal-3-Pro evaluation guidance#655

Open
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1771624157-update-prerecorded-evals
Open

Update pre-recorded evals doc with Universal-3-Pro evaluation guidance#655
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1771624157-update-prerecorded-evals

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Feb 20, 2026

Expand pre-recorded evals doc with Universal-3-Pro guidance

Summary

Rewrites the pre-recorded audio evaluations page (/docs/evaluations/pre-recorded-audio) based on learnings from the prompt engineering webinar transcript and two open-source repos (prompt-seeker, aai-cli).

Key additions:

  • New metrics section: Introduces Semantic WER and LASER Score as recommended evaluation approaches for Universal-3-Pro, alongside the existing traditional metrics (which are preserved but reorganized under a "Traditional metrics" subsection)
  • Ground truth quality guidance: Warns that human-labeled datasets often contain errors that Universal-3-Pro now catches, with specific examples of common issues
  • Expanded evaluation process: Adds baseline establishment step, prompt crafting principles, dataset diversity table, and normalization caveats for [unclear]/[masked] tags
  • Prompt iteration section: Manual and automated approaches, with reference prompts and tables for what works vs. what to avoid
  • LLM-as-judge caveat: Warns that LLM judges can be misled by translated code-switching that reads well but isn't what was spoken
  • Open-source tools section: Documents aai-cli and prompt-seeker with usage examples

Updates since last revision (Opus feedback)

Addressed all feedback items from @ryanseams's Opus review:

  • Restructured "Vibes vs metrics" → renamed to "Qualitative analysis" and moved after Step 5 (Compare and calculate), before "Iterating on prompts"
  • Added forward references from Semantic WER and LASER descriptions to the open-source tools section with repo links
  • Added dataset size guidance: "at least 25 files", emphasizing diversity of audio conditions over length
  • Replaced reference prompts with two specific prompts: the current system prompt as the evaluation prompt, and a comparison prompt with [unclear] tags for identifying model uncertainty
  • Made insertions warning more actionable: now recommends auditing at least 20 insertions before reporting WER
  • Updated Artificial Analysis reference: linked to https://artificialanalysis.ai/speech-to-text with explanation about creating proprietary datasets with cleaned ground truths
  • Clarified cpWER S_spk definition: explicitly states it counts both word substitutions and correctly transcribed words assigned to the wrong speaker
  • Expanded LASER penalty explanation: added per-word penalty scale (0 = no penalty, 0.5 = minor, 1.0 = major) with examples for each tier
  • Made "examples in prompts" a Warning callout: strong callout explaining that listing specific words causes hallucinations
  • Updated intro CTA: mentions evaluation and prompt optimization help; removed duplicate closing CTA
  • Aligned all prompt language with the prompting guide

Review & Testing Checklist for Human

  • Verify evaluation prompt matches the current system prompt — the doc states this is the "current system prompt" from the prompting guide. If it drifts, this will be stale. Cross-check with /docs/pre-recorded-audio/prompting#system-prompts.
  • Confirm LASER paper citation is correct — cited as Parulekar & Jyothi, EMNLP 2025, linked to aclanthology.org/2025.emnlp-main.1257/. Verify the link resolves and the attribution is accurate.
  • Review the Warning callout about examples causing hallucinations — this is a strong claim ("causes hallucinations"). User requested this be emphatic, but verify it's appropriate for official docs.
  • Confirm linking to external repos (alexkroman/aai-cli and AssemblyAI-Solutions/prompt-seeker) is approved for official documentation.
  • Preview the rendered page via the Fern preview deployment to verify all changes render correctly (tables, Warning callouts, LaTeX formulas, code blocks, links).

Notes

  • Requested by @ryanseams
  • Link to Devin run
  • Existing traditional metrics content is fully preserved, just moved under #### headings within a new ### Traditional metrics subsection. This changes in-page anchor links if anyone deep-links to them.
  • fern check --warnings passes locally. Vale lint warnings are pre-existing heading capitalization issues across the repo, not introduced by this PR.
  • Required CI checks (fern-check, run) both pass.

Co-Authored-By: Ryan Seams <ryan.seams@gmail.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

Co-Authored-By: Ryan Seams <ryan.seams@gmail.com>
@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant