Skip to Content
Wisteria is in beta — these docs are evolving fast.
TrainerSetting up an oral question

Setting up an oral question

Oral questions test what a learner can say — useful for customer-facing roles, scripted procedures, soft-skills assessment. They’re the most differentiated question type in Wisteria.

What a learner sees

  1. The question prompt.
  2. A mic button.
  3. They tap, speak, tap stop.
  4. Wisteria sends the audio to OpenAI Whisper, which transcribes it.
  5. Wisteria sends the transcript + your model answer + keywords + threshold to Claude, which grades.
  6. The learner sees their transcript, a pass/fail, and a score.

Total time: about 5–10 seconds per question after they stop speaking.

Setting one up

On the Quiz tab of a module, click + Add question → pick Oral. The oral editor has four fields:

1. Question text

The prompt the learner sees and hears in their head. Make it specific and unambiguous.

Good: “A guest arrives at reception and tells you they can’t find their booking. How do you handle the situation?”

Bad: “What’s good customer service?”

The first prompts a scenario; the second prompts opinion.

2. Model answer

The full ideal response, written out. This is what Claude compares the transcript against semantically.

Don’t paste a script the learner is supposed to memorise — write what a good answer would naturally sound like, in your team’s voice.

Example:

First, I greet them warmly and apologise for the inconvenience. I ask for their name and check the booking system carefully — sometimes bookings are filed under a slightly different name or date. If I can’t find it, I tell them I’ll resolve it and offer them a seat in the lounge while I check with reception manager. I always confirm their booking details once found.

3. Keywords

Three to six required words or phrases. These contribute 40% of the score (the other 60% is semantic similarity).

Good keywords are:

  • Specific (not “good service” — too vague)
  • Distinct (not “greet” AND “say hello” — too redundant)
  • Inevitable for a correct answer (a good response can’t avoid them)

For the example above, keywords might be: greet, apologise, booking system, name, manager, confirm.

4. Pass threshold

Default 70%. Acceptable range 50–95.

  • 70% — reasonable bar for most content. Learners can pass with one minor miss.
  • 85% — high bar; learners need to nail most keywords AND have strong semantic match.
  • 50% — practice mode; useful for early drills where you want to give learners a sense of the task without strict gating.

We don’t recommend below 50% — at that point the test isn’t gating anything.

Scoring math

For a given transcript:

  1. Keyword coverage — count how many required keywords are present (case-forgiving, semi-fuzzy match). Score: (found_keywords / total_keywords) × 40.
  2. Semantic similarity — Claude (haiku) compares transcript vs model answer, returns a 0–100 alignment score. Score: claude_score × 0.60.
  3. Total — sum of the above, capped at 100.

If total ≥ threshold, the learner passes the question.

Retries

A learner gets two attempts in the moment for any oral question. If both fail:

  • The question is appended to the END of the quiz array as a _isRetry copy.
  • When the learner reaches the retry, they get two fresh attempts on the same question.
  • If those also fail, the question is finally marked wrong in the results.

The retry-at-end pattern lets learners cool down and re-attempt after seeing other questions. Anxiety often makes the first attempts worse than the actual capability.

Common mistakes setting up oral questions

Model answer too short

A two-sentence model answer doesn’t give Claude enough signal to grade semantic similarity. Aim for 3–5 sentences minimum.

Model answer too rigid

If your model answer reads like a script (“Step 1: …. Step 2: ….”), Claude penalises learners for any deviation. Write a good answer, not the only answer.

Keywords too generic

good, service, customer — these appear in most answers and don’t discriminate. Use specific terms a correct answer requires.

Threshold too high

Setting threshold to 95% means the learner needs near-perfect transcription + near-perfect keyword coverage. Whisper isn’t 100% accurate — accents, background noise, technical terms can produce minor transcription errors that compound. 70–85% is usually right.

Forgetting that learners will speak fast/quietly/with accents

Test the question yourself before publishing. Speak normally. Check the transcript. If Whisper consistently mishears a key term, consider rewording the model answer to use a more transcribable alternative.

When oral questions are the wrong choice

  • Quiet workplace — learners can’t speak aloud at work (or feel weird doing so). They’ll skip the quiz entirely.
  • Languages Whisper struggles with — Whisper performs best in English. Other languages work but accuracy varies. Test with native speakers before relying on it.
  • Questions where exact wording matters — use Fill in blank instead. Oral grades meaning, not syntax.

Cost

Each oral question grading costs about a tenth of a cent — Whisper transcription is cheap, and Claude Haiku is the cheap Claude. Included in your subscription.

Audit trail

Every oral attempt records: the audio (kept for 30 days for dispute resolution), the transcript, the score, and the pass/fail. Available in the audit log under quiz.attempt.oral.

Privacy

Audio is transmitted to OpenAI Whisper (US-based). Whisper doesn’t retain audio for training. Wisteria stores the audio in Supabase Storage for 30 days, then deletes. Transcripts are retained indefinitely (they’re text and small).

If your privacy regime requires audio to never leave your jurisdiction, talk to us — there’s no current EU/regional Whisper option but it’s on the roadmap.

Last updated on