The Speak & Improve Corpus 2025
Paper: The Speak & Improve Corpus 2025
Description: The Speak & Improve Corpus 2025 contains open (spontaneous) speaking tests by second language (L2) learners of English at various proficiency levels and from across the globe. Speaking tests were performed on the Speak & Improve speaking practice platform, on which automated feedback and marking are provided.
The Corpus will enable research into a range of language learning tasks including assessing speaking proficiency or providing feedback on grammatical errors in a learner's speech. Additionally, the Corpus will support research into the underlying technology required for these tasks, including automatic speech recognition (ASR) of low resource L2 learner English, disfluency detection or spoken grammatical error correction (GEC).
We built the Corpus from speaking tests submitted by individual users of Speak & Improve. Here, the 4 open-speaking parts of the test are used. Each of these parts consists of 1 or more web-application-prompted questions to which users respond in a natural, open-speaking style. There are around 315 hours of recordings (over 45000 utterances) of L2 English learner speech from over 7000 original test submissions.
All the data in the Corpus has been annotated with an indicative CEFR level of holistic speaking proficiency by non-operational examiners. In addition, a subset of the Corpus, around 55 hours, has been manually transcribed, including disfluencies and correction of language errors. The data has been arranged into individual training, development and evaluation sets.
For the Speak & Improve Challenge 2025 the training and development sets are released initially. All 3 sets will be released with annotations in the post-Challenge release of the Corpus. Please bookmark this page and check again in future; or contact us to be notified when a new version of the Corpus is published.
Data security: Please be aware of the problems of leaking benchmark datasets to LLMs (e.g. Balloccu et al, EACL 2024). Please only use this Corpus with LLMs hosted locally (e.g. after download from Hugging Face Transformers) or with no retention of data for training if using LLMs via commercial APIs.
Publication date: 2025
Keywords: Cambridge University Press & Assessment, learners of English as a second language, L2 speech, non-native speech, automatic speech recognition, spoken grammar error correction, language assessment and feedback
Authors and Contributors: Cambridge University Press & Assessment (2025). The Speak & Improve Corpus 2025. See dataset release paper for contributors.
Citing this paper: Kate Knill, Diane Nicholls, Mark J.F. Gales, Mengjie Qian, Pawel Stroinski (2025). The Speak & Improve Corpus 2025. Cambridge University Press & Assessment. https://doi.org/10.17863/CAM.114333
@article{sicorpus25,
author = {Kate Knill and Diane Nicholls and Mark J.F. Gales and Mengjie Qian and Pawel Stroinski},
year = {2025},
title = {{The Speak \& Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback}},
publisher = {Cambridge University Press & Assessment},
url = {https://doi.org/10.17863/CAM.114333}
}
You may publish the results of research using this dataset. In any such publication, you must acknowledge use of the dataset in your research by citing Cambridge University Press & Assessment and the Authors and Contributors as shown.
We ask you to inform us of any such publications by emailing: support@speakandimprove.com
Please report any issues or problems in downloading the dataset by emailing: support@speakandimprove.com
Challenge Licence Agreement
By downloading this dataset and licence, this licence agreement (the “Agreement”) is entered into, effective this date, between you (the “Licensee"), and the Chancellor, Masters and Scholars of the University of Cambridge acting through its department Cambridge University Press & Assessment (the “Licensor”).
- The challenge is an international shared challenge run in connection with the ISCA SLaTE Workshop 2025 (the “Challenge”).
- Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee, nor shall the Licensee have any rights in the dataset other than the right to use the dataset in accordance with this Agreement.
- The Licensor hereby grants the Licensee a non-exclusive, non-transferable right to use the licensed dataset for the sole purpose of the Challenge for non-commercial research and educational purposes only. The Licensee shall not sub-licence or assign the benefit or burden of this Agreement in whole or in part and agrees to permanently delete the dataset and destroy any original and/or hard copies of the same no later than 24 August 2025.
- The Licensee shall expressly acknowledge and reference the Licensor when making use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the relevant publication(s) mentioned next to the download link.
- The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
- The Licensor grants the Licensee this right to use the licensed dataset "as is". Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever. The Licensor has no liability for any loss or damage whatsoever sustained by Licensee as a result of the availability or use of or reliance on the dataset.
The Licensor shall not be liable for any indirect or consequential loss or damage or for any loss of or corruption of data, loss of programs, profit or goodwill (whether direct or indirect) arising out of or in connection with the access, availability, use of or reliance on the dataset.
- The Licensee shall indemnify and hold the Licensor harmless against any loss or damage which it may suffer or incur as a result of the Licensee’s breach of any terms of this Agreement.
- This Agreement constitutes the entire agreement between the parties and supersedes any previous agreement between the parties relating to its subject-matter. Each party acknowledges and agrees that, in entering into this Agreement, it does not rely on, and shall have no remedy in respect of, any statement, representation, warranty or understanding (whether negligently or innocently made) other than as expressly set out in this Agreement.
- This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.
You may download the Public Release of the Speak & Improve Corpus dataset if you agree to the licence above. Publications using this dataset must acknowledge and reference Cambridge University Press & Assessment as the source of the data.