Wykop authorship corpus

Uploaded: 2021-09-08
Languages: Polish
Collected from: 2021
Access category: Restricted
Email: Not available
To: 2021


Polish Reddit-like forum

Subject keywords: corpus, social media, polish
Data types: Written
Funders: Research England (UKRI)
Associated AIFL centres: Centre for Forensic Text Analysis (FTA)
License: N/A


The present dataset includes 12730802 written posts that were collected from a Polish website structured similarly to Reddit. The posts are authored by 143516 different users of the website. The final corpus size amounts to more that 345 million tokens. The data was collected and structured as an authorship analysis research corpus by the Aston Institute for Forensic Linguistics.

Data Donors


Information: This dataset contains sensitive material or data that come from a third party and have some constraints on access and use. Users who wish to access this dataset must make a detailed application to FoLD and the researcher, as well as potentially gain additional agreement from an external organisation before they can be approved for access.

Request Item