The 100 idiolects project

Uploaded: 2021-09-08
Languages: English
Collected from: 2020
Access category: Restricted
To: 2021


A multichannel corpus of 100 English speakers

Subject keywords: multimodal corpus, idiolect, genre analysis
Data types: Written, Spoken - transcript
Funders: Research England (UKRI)
Associated AIFL centres: Centre for Forensic Text Analysis (FTA)
License: N/A


This project uses language data from 100+ individuals (undergraduate students at Aston University) in different media and contexts, for different purposes and audiences. 112 individuals have provided samples for the following discourse types: oral interview, university essay, email, text message, image description with speech-to-text software, business memo. In addition, 66 of the 112 individuals provided a handwritten text. All interviews and image description recordings transcribed by the same person for consistency. Full verbatim transcripts include hesitations, false starts, non-standard grammar, etc. All data: XML tags for identifying information for retention of context. In addition to the full version of the 100 idiolects corpus, a limited version is also available, which only uses four of the seven discourse types, including essays, emails, text messages, and business memos. This second version was specifically created for the PAN22 authorship verification task and is split into a training (calibration) and a test dataset. For more details, please see When requesting access to the corpus, please specify which version you are interested in.

Data Donors


Information: This dataset contains sensitive material or data that come from a third party and have some constraints on access and use. Users who wish to access this dataset must make a detailed application to FoLD and the researcher, as well as potentially gain additional agreement from an external organisation before they can be approved for access.

