Multilingual News Corpus

Stats

Download

1. Get Access to the Corpus

Since the news articles themselfes are copyrighted to the original publisher, access to this corpus can't be made public. If you want access, drop a short Email to griesshaber@hdm-stuttgart.de.

2. I already have access

Download the newest release here

Classes

The corpus is split into the following classes:
LabelDocuments (EN)Documents (DE)
Wirtschaft251953769
Lokal230918037
Finanzen14613339
Ausland1697225455
Kultur158311714
Ignore3993488
Aktuell23191253
Politik882272414
Technologie328025956
Sport354024704
Lifestyle1363719476
Sonstiges2418053047
Total79706322652

Metainformation

The corpus was build by crawling german and english news sites. To extract the information unfluff was used. Therefore all metainformation is available as it is extracted by unfluff.

Citing

If you find this corpus helpful in your research, consider citing this work:

@misc{Griesshaber2017,
  author = {Grie{\ss}haber, Daniel},
  title = {Multilingual News Corpus},
  year = {2017},
  publisher = {GitLab HdM Stuttgart},
  journal = {Git repository},
  howpublished = {\url{https://gitlab.mi.hdm-stuttgart.de/griesshaber/german-news-corpus}}
}
      

Author

Daniel GrieƟhaber (Twitter/GitHub).

Template

Solo by chibicode