Multilingual News Corpus

Stats

Download

1. Get Access to the Corpus

Since the news articles themselfes are copyrighted to the original publisher, access to this corpus can't be made public. If you want access, drop a short Email to griesshaber@hdm-stuttgart.de.

2. I already have access

Download the newest release here

Classes

The corpus is split into the following classes:
LabelDocuments (EN)Documents (DE)
Kultur176611829
Ignore4543592
Lokal266318561
Lifestyle1443119483
Sonstiges2551954460
Politik940572712
Aktuell24391319
Wirtschaft290154881
Sport378924800
Technologie346327320
Ausland1767126686
Finanzen16813339
Total84669328982

Metainformation

The corpus was build by crawling german and english news sites. To extract the information unfluff was used. Therefore all metainformation is available as it is extracted by unfluff.

Citing

If you find this corpus helpful in your research, consider citing this work:

@misc{Griesshaber2017,
  author = {Grie{\ss}haber, Daniel},
  title = {Multilingual News Corpus},
  year = {2017},
  publisher = {GitLab HdM Stuttgart},
  journal = {Git repository},
  howpublished = {\url{https://gitlab.mi.hdm-stuttgart.de/griesshaber/german-news-corpus}}
}
      

Author

Daniel GrieƟhaber (Twitter/GitHub).

Template

Solo by chibicode