Wong, J. H. M., Zhang, H., & Chen, N. F. (2023). Modelling Inter-Rater Uncertainty in Spoken Language Assessment. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 2886–2898. https://doi.org/10.1109/taslp.2023.3297958
In a subjective task, such as Spoken Language Assessment (SLA), the reference scores provided by different human raters may vary. A collection of annotated scores from multiple raters can be interpreted as an expression of data uncertainty. Previous studies often treat SLA as classification or regression tasks, and train and evaluate models against scalar reference scores that were computed from the multiple rater scores, for example by majority voting. However, a scalar representation may not adequately capture information about uncertainty that is expressed by the multiple rater scores. This paper proposes to reformulate this subjective task as a distribution fitting problem, where the model should aim to emulate the uncertainty expressed by the multiple raters. Toward this aim, the model is trained and evaluated by computing a distance between the model's output posterior and the distribution of reference scores from the multiple raters. Different methods to infer a scalar score from the model's output posterior are also considered. This paper also proposes to improve the match between the model and the SLA task, by interpreting the model’s outputs as parameters of a beta density function, to capture both uncertainty and score monotonicity. Finally, ensemble combination is investigated and a novel combination method is proposed, to marginalise out model uncertainty from the combined output distribution. These approaches are evaluated on the speechocean762 dataset and an in-house Tamil dataset.
There was no specific funding for the research done