CUP&A Public Research Dataset Releases

The Write & Improve Corpus 2024

Paper: The Write & Improve Corpus 2024

DescriptionThe Write & Improve Corpus 2024 contains essays written by second language (L2) learners of English at various proficiency levels. Essays were submitted to the Write & Improve essay practice platform, on which automated error feedback and marking are provided. The learner can then revise and resubmit their essay in order to request new error feedback and marks.

We built the Corpus with sets of essays submitted by individual users of Write & Improve in response to a given prompt. There are 5050 of these ‘user-prompt sets’ written by 766 learners of L2 English. The final versions of these essay sets amount to 762K word tokens. In total there are 23K essays, including the earlier versions in each essay set.

We have annotated the first and final version in each essay set with grammatical errors, and certified though non-operational examiners have labelled each of the final essay versions with an indicative CEFR label. In this first release of the Corpus, we release the final versions of the essays only, as parallel texts in original and corrected form. This is the format required for the MultiGEC-2025 shared task. In a future release of the Corpus we will include more data and the CEFR labels. Please bookmark this page and check again in future; or contact us to be notified when a new version of the Corpus is published.

Data security: Please be aware of the problems of leaking benchmark datasets to LLMs (e.g. Balloccu et al, EACL 2024). Please only use this Corpus with LLMs hosted locally (e.g. after download from Hugging Face Transformers) or with no retention of data for training if using LLMs via commercial APIs.

Publication date: 2024

Keywords: Cambridge University Press & Assessment, Common European Framework of Reference for Languages, CEFR, learners of English as a second language, essay writing, Write & Improve

Authors and Contributors: Cambridge University Press & Assessment (2024). The Write & Improve Corpus 2024. See dataset release paper for contributors.

Citing this paper:  Diane Nicholls, Andrew Caines, Paula Buttery (2024). The Write & Improve Corpus 2024. Cambridge University Press & Assessment. https://doi.org/10.17863/CAM.112997

@article{wicorpus24,

  author = {Diane Nicholls and Andrew Caines and Paula Buttery},

  year = {2024},

  title = {The {W}rite \& {I}mprove {C}orpus 2024: Error-annotated and {CEFR}-labelled essays by learners of {E}nglish},

  publisher = {Cambridge University Press & Assessment},

  url = {https://doi.org/10.17863/CAM.112997}

}

You may publish the results of research using this dataset.  In any such publication you must acknowledge use of the dataset in your research by citing Cambridge University Press & Assessment and the Authors and Contributors as shown. 

We ask you to inform us of any such publications by emailing: support@englishlanguageitutoring.com   

Please report any issues or problems in downloading the dataset by emailing: support@englishlanguageitutoring.com

 

Licence Agreement 

  1. By downloading this dataset and licence, this licence agreement (the “Agreement”) is entered into, effective this date, between you (the “Licensee"), and the Chancellor, Masters and Scholars of the University of Cambridge acting through its department Cambridge University Press & Assessment (the “Licensor”). 

     

  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee, nor shall the Licensee have any rights in the dataset other than the right to use the dataset in accordance with this Agreement 

 

  1. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes only. The Licensee shall not sub-licence or assign the benefit or burden of this Agreement in whole or in part. 

 

  1. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented. 

 

  1. The Licensee shall expressly acknowledge and reference the Licensor when making use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the paper at the top of the dataset details page.

 

  1. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3. 

 

  1. The Licensor grants the Licensee this right to use the licensed dataset "as is". Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever. The Licensor has no liability for any loss or damage whatsoever sustained by Licensee as a result of the availability or use of or reliance on the dataset. 

 

  1. The Licensor shall not be liable for any indirect or consequential loss or damage or for any loss of or corruption of data, loss of programs, profit or goodwill (whether direct or indirect) arising out of or in connection with the access, availability, use of or reliance on the dataset. 

 

  1. The Licensee shall indemnify and hold the Licensor harmless against any loss or damage which it may suffer or incur as a result of the Licensee’s breach of any terms of this Agreement. 

 

  1. This Agreement constitutes the entire agreement between the parties and supersedes any previous agreement between the parties relating to its subject-matter. Each party acknowledges and agrees that, in entering into this Agreement, it does not rely on, and shall have no remedy in respect of, any statement, representation, warranty or understanding (whether negligently or innocently made) other than as expressly set out in this Agreement. 

 

  1. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction. 

 

You may download this dataset if you agree to the licence terms above and complete the following registration form.  Publications using this dataset must acknowledge and reference Cambridge University Press & Assessment as the source of the data.

Registration form

Name
Title
CAPTCHA
This question is for testing that you are a human visitor and to prevent automated spam submissions.