The 100 idiolects project
A multichannel corpus of 100 English speakers
Data types: Written, Spoken - transcript
Funders: Research England (UKRI)
Associated AIFL centres: Centre for Forensic Text Analysis (FTA)
This project uses language data from 100+ individuals (undergraduate students at Aston University) in different media and contexts, for different purposes and audiences. 112 individuals have provided samples for the following discourse types: oral interview, university essay, email, text message, image description with speech-to-text software, business memo. In addition, 66 of the 112 individuals provided a handwritten text. All interviews and image description recordings transcribed by the same person for consistency. Full verbatim transcripts include hesitations, false starts, non-standard grammar, etc. All data: XML tags for identifying information for retention of context. In addition to the full version of the 100 idiolects corpus, a limited version is also available, which only uses four of the seven discourse types, including essays, emails, text messages, and business memos. This second version was specifically created for the PAN22 authorship verification task and is split into a training (calibration) and a test dataset. For more details, please see https://pan.webis.de/clef22/pan22-web/author-identification.html. When requesting access to the corpus, please specify which version you are interested in.