Description of dataset Greek dialect corpus v1.0 :

A collection of raw text from various Greek dialects. Contains data from the following dialects:

  • Cypriot Greek
  • Cretan Greek
  • Pontic Greek
  • Northern Greek
  • Some part of the Modern Greek wikipedia

The repository contains data collected from the web and other textual resources (blogs, websites, theatrical plays among other things). The folder SMG_CG contains twitter data from Standard Modern Greek and Cypriot that have been originally collected by Hanna Sababa for her project A Classifier to Distinguish Between Cypriot Greek and Standard Modern Greek. Mr Sfakianakis is thanked from providing us with his Cretan translations of a number of Ancient Greek tragedies and comedies. The folder all_dialects contains a zip file that contains all collected data with minimal pre-processing and annotation for the respective dialect.

This dataset was created by Chatzikyriakidis S., Kolokousis, I., Koula, C., Papadakis, D., & Sakellariou, E.

Download from Zenodo.

Suggested citation:

Stergios Chatzikyriakidis et al. (2024). StergiosChatzikyriakidis/Greek_dialect_corpus: v1.0 (v1.0) [Data set]. International Conference on Greek Linguistics (ICGL), Thessaloniki. Zenodo. https://doi.org/10.5281/zenodo.12704721

CC BY-NC-ND

This license enables reusers to copy and distribute the material in any medium or format in unadapted form only, 
for noncommercial purposes only, and only so long as attribution is given to the creator. 
CC BY-NC-ND includes the following elements:

 BY: credit must be given to the creator.
 NC: Only noncommercial uses of the work are permitted.
 ND: No derivatives or adaptations of the work are permitted.
Skip to content