Multilingual News Corpus

Stats

Download

1. Get Access to the Corpus

Since the news articles themselfes are copyrighted to the original publisher, access to this corpus can't be made public. If you want access, drop a short Email to griesshaber@hdm-stuttgart.de.

2. I already have access

Download the newest release here

Classes

The corpus is split into the following classes:
LabelDocuments (EN)Documents (DE)
Ausland1846028033
Uncategorized330991577217
Ignore5113684
Politik1013372905
Wirtschaft335556768
Technologie367530890
Aktuell25662383
Finanzen19213340
Sport407624932
Lokal303919251
Lifestyle1536424655
Kultur195712079
Sonstiges2703056224
Total421349922361

Metainformation

The corpus was build by crawling german and english news sites. To extract the information unfluff was used. Therefore all metainformation is available as it is extracted by unfluff.

Citing

If you find this corpus helpful in your research, consider citing this work:

@misc{Griesshaber2017,
  author = {Grie{\ss}haber, Daniel},
  title = {Multilingual News Corpus},
  year = {2017},
  publisher = {GitLab HdM Stuttgart},
  journal = {Git repository},
  howpublished = {\url{https://gitlab.mi.hdm-stuttgart.de/griesshaber/german-news-corpus}}
}
      

Author

Daniel GrieƟhaber (Twitter/GitHub).

Template

Solo by chibicode