Update pre-recorded evals doc with Universal-3-Pro evaluation guidance by devin-ai-integration[bot] · Pull Request #655 · AssemblyAI/assemblyai-api-spec

devin-ai-integration · 2026-02-20T21:54:29Z

Expand pre-recorded evals doc with Universal-3-Pro guidance

Summary

Rewrites the pre-recorded audio evaluations page (/docs/evaluations/pre-recorded-audio) based on learnings from the prompt engineering webinar transcript and two open-source repos (prompt-seeker, aai-cli).

Key additions:

New metrics section: Introduces Semantic WER and LASER Score as recommended evaluation approaches for Universal-3-Pro, alongside the existing traditional metrics (which are preserved but reorganized under a "Traditional metrics" subsection)
Ground truth quality guidance: Warns that human-labeled datasets often contain errors that Universal-3-Pro now catches, with specific examples of common issues
Expanded evaluation process: Adds baseline establishment step, prompt crafting principles, dataset diversity table, and normalization caveats for [unclear]/[masked] tags
Prompt iteration section: Manual and automated approaches, with reference prompts and tables for what works vs. what to avoid
LLM-as-judge caveat: Warns that LLM judges can be misled by translated code-switching that reads well but isn't what was spoken
Open-source tools section: Documents aai-cli and prompt-seeker with usage examples

Updates since last revision (Opus feedback)

Addressed all feedback items from @ryanseams's Opus review:

Restructured "Vibes vs metrics" → renamed to "Qualitative analysis" and moved after Step 5 (Compare and calculate), before "Iterating on prompts"
Added forward references from Semantic WER and LASER descriptions to the open-source tools section with repo links
Added dataset size guidance: "at least 25 files", emphasizing diversity of audio conditions over length
Replaced reference prompts with two specific prompts: the current system prompt as the evaluation prompt, and a comparison prompt with [unclear] tags for identifying model uncertainty
Made insertions warning more actionable: now recommends auditing at least 20 insertions before reporting WER
Updated Artificial Analysis reference: linked to https://artificialanalysis.ai/speech-to-text with explanation about creating proprietary datasets with cleaned ground truths
Clarified cpWER S_spk definition: explicitly states it counts both word substitutions and correctly transcribed words assigned to the wrong speaker
Expanded LASER penalty explanation: added per-word penalty scale (0 = no penalty, 0.5 = minor, 1.0 = major) with examples for each tier
Made "examples in prompts" a Warning callout: strong callout explaining that listing specific words causes hallucinations
Updated intro CTA: mentions evaluation and prompt optimization help; removed duplicate closing CTA
Aligned all prompt language with the prompting guide

Review & Testing Checklist for Human

Verify evaluation prompt matches the current system prompt — the doc states this is the "current system prompt" from the prompting guide. If it drifts, this will be stale. Cross-check with /docs/pre-recorded-audio/prompting#system-prompts.
Confirm LASER paper citation is correct — cited as Parulekar & Jyothi, EMNLP 2025, linked to aclanthology.org/2025.emnlp-main.1257/. Verify the link resolves and the attribution is accurate.
Review the Warning callout about examples causing hallucinations — this is a strong claim ("causes hallucinations"). User requested this be emphatic, but verify it's appropriate for official docs.
Confirm linking to external repos (alexkroman/aai-cli and AssemblyAI-Solutions/prompt-seeker) is approved for official documentation.
Preview the rendered page via the Fern preview deployment to verify all changes render correctly (tables, Warning callouts, LaTeX formulas, code blocks, links).

Notes

Requested by @ryanseams
Link to Devin run
Existing traditional metrics content is fully preserved, just moved under #### headings within a new ### Traditional metrics subsection. This changes in-page anchor links if anyone deep-links to them.
fern check --warnings passes locally. Vale lint warnings are pre-existing heading capitalization issues across the repo, not introduced by this PR.
Required CI checks (fern-check, run) both pass.

Co-Authored-By: Ryan Seams <ryan.seams@gmail.com>

devin-ai-integration · 2026-02-20T21:54:33Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2026-02-20T21:55:29Z

🌿 Preview your docs: https://assemblyai-preview-9db654c0-a11f-41ef-95a1-9893040856bf.docs.buildwithfern.com/docs

Co-Authored-By: Ryan Seams <ryan.seams@gmail.com>

github-actions · 2026-02-25T18:15:01Z

🌿 Preview your docs: https://assemblyai-preview-c3c3cd04-d1c7-491e-bfc4-e96c7cc19495.docs.buildwithfern.com/docs

Update pre-recorded evals doc with Universal-3-Pro evaluation guidance

4a7d543

Co-Authored-By: Ryan Seams <ryan.seams@gmail.com>

devin-ai-integration bot assigned ryanseams Feb 20, 2026

Address feedback: restructure evals doc, fix prompts, LASER, and CTA

42009ec

Co-Authored-By: Ryan Seams <ryan.seams@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pre-recorded evals doc with Universal-3-Pro evaluation guidance#655

Update pre-recorded evals doc with Universal-3-Pro evaluation guidance#655
devin-ai-integration[bot] wants to merge 2 commits intomainfrom
devin/1771624157-update-prerecorded-evals

devin-ai-integration bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Expand pre-recorded evals doc with Universal-3-Pro guidance

Summary

Updates since last revision (Opus feedback)

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Feb 20, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Feb 20, 2026 •

edited

Loading