Multilingual News Corpus

Stats

Download

1. Get Access to the Corpus

Since the news articles themselfes are copyrighted to the original publisher, access to this corpus can't be made public. If you want access, drop a short Email to griesshaber@hdm-stuttgart.de.

2. I already have access

Download the newest release here

Classes

The corpus is split into the following classes:
LabelDocuments (EN)Documents (DE)
Technologie288122658
Lokal161316891
Aktuell19561134
Wirtschaft185750904
Ausland1560423121
Finanzen9713339
Ignore3443258
Kultur121411259
Sport306224024
Politik772570901
Sonstiges2158049229
Lifestyle1215519136
Total70088305854

Metainformation

The corpus was build by crawling german and english news sites. To extract the information unfluff was used. Therefore all metainformation is available as it is extracted by unfluff.

Citing

If you find this corpus helpful in your research, consider citing this work:

@misc{Griesshaber2017,
  author = {Grie{\ss}haber, Daniel},
  title = {Multilingual News Corpus},
  year = {2017},
  publisher = {GitLab HdM Stuttgart},
  journal = {Git repository},
  howpublished = {\url{https://gitlab.mi.hdm-stuttgart.de/griesshaber/german-news-corpus}}
}
      

Author

Daniel GrieƟhaber (Twitter/GitHub).

Template

Solo by chibicode