Catalogue of abusive language training data

Uploaded: 2022-02-09
Languages: EnglishArabic, Bengali, Croatian, Danish, Estonian, Greek, Hindi, Indonesian, Latvian, Portuguese, Polish, Russian, Slovene, Ukranian, French, Chinese, German, Spanish, Turkish
Collected from: 2020
Access category: External
To: 2022


A multilingual catalogue of large annotated datasets containing hate speech, abusive, and offensive language (as defined by the authors).

Subject keywords: NLP, abusive language, online language
Data types: Written
Funders: N/A
Associated AIFL centres: Forensic Linguistic Databank (FoLD)
License: Non-Commercial Government Licence for public sector information


The resource catalogues datasets annotated for hate speech, online abuse, and offensive language in many languages. These datasets are intended for computational purposes, e.g. training a natural language processing system, but may be useful to other researchers as well. associated publication: Vidgen B, Derczynski L (2020) Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLoS ONE 15(12): e0243300.

Data Owners


The data is stored externally. Please follow the link below for access.