(https://youtu.be/Jh7gqNhJv8s)
Overview
- Modes of speaking assessment
- Automated speech evaluation
+ Architecture
+ Automarker training and evaluation
+ Limitations
- Assessing the suitability of automated speaking tests
Modes of Speaking Assessment
Comparison between the two green parts
An example of automated speaking test is Linguaskill (e.g. answer phone messege)
Architecture
Let's look at...
(1) Speech recogniser
(Yu & Deng, 2016) "recognize speech" vs. "wreck a nice beach"
(2) WER
We evaluate system by a metric WER (word error rate)
(Knill, 2016) WER and accents
Depends on the make-up of data.
Different varible other than accent: Text type (read or spontaneous), proficiency (CEFR A, B, C).
(3) Scoring features
(Pronunciation, Vocabulary, Grammar, Fluency, Coherence, Topic relevance)
One interesting example for coherence is formulaic sequence (throw a party vs create a party).
Formulating appropriate construct is important for system validity.
(4) Grammatical parse tree
This is how the system puts each component. So, it shows how the system evaluates grammar. Deeper the tree, the complex the grammar.
(5) Training Data
+ Transcripts => done by experts
+ Expert Data could be used to give feedbacks
Automarker Evaluation
How do we know the system's performance?
(1) Human-machine agreement
(Exact agreement, Adjacent agreement (within 1 CEFR level diff), Mismarking (more than 1 CEFR level diff))
(2) Scatterplot
+ We can see the lenient marking by automarker (above the line)
(3) Sensitivity to non-English speech
(4) Automarker confidence => predict automarker reliability
Implementation
Hybrid Marking
(Cambridge English Hybrid Marking Model)
If the machine is pretty much confident, release the score. But if not, re-examine by human rater.
Automated Feedback for Spontaneous Speech
(Under development)
The word "cheaper" has lower confidence => tell the speaker 'you may need to make improvement for the word cheaper'
Limitations of Automated Speaking Tests
- Unable to assess dialogic, interactive speech
- Narrowed language construct
- Automarker reliability highly reliant on the quality of the training data
- Reduced accuracy when audio quality is poor (e.g. noise, multiple speakers)
- Less robust than human in detecting malpractice (e.g. off-topic remarks, jokes)
(Khabbazbashi, et al., 2021)
Assessing the Suitability of Automated Speaking Tests
1. What data has the automaker been trained on?
2. How is the test administered in practice?
3. What speaking tasks are used in the test? (e.g. read-aloud vs spontaneous)
4. What scoring features are extracted to inform a score? (e.g. grammar, pronunciation)
5. What is the potential for cheating on the test?
6. What is the impact of the test on language learning?
7. Is there a good fit between the purpose and stakes of the test and the test used?
References
CTC loss (0) | 2022.05.05 |
---|---|
4. Custom Audio PyTorch Dataset with Torchaudio (0) | 2021.06.18 |
Query, Key, and Value in Attention (0) | 2021.06.10 |
Transformer (0) | 2021.06.03 |
Attention (0) | 2021.05.31 |