Multilingual News Corpus

Stats

Download

1. Get Access to the Corpus

Since the news articles themselfes are copyrighted to the original publisher, access to this corpus can't be made public. If you want access, drop a short Email to griesshaber@hdm-stuttgart.de.

2. I already have access

Download the newest release here

Classes

The corpus is split into the following classes:
LabelDocuments (EN)Documents (DE)
Politik842872165
Sport337924653
Wirtschaft225852975
Finanzen13313339
Aktuell22341227
Ignore3653427
Lokal204817644
Ausland1651624613
Technologie312424904
Kultur145211613
Sonstiges2320651916
Lifestyle1311919458
Total76262317934

Metainformation

The corpus was build by crawling german and english news sites. To extract the information unfluff was used. Therefore all metainformation is available as it is extracted by unfluff.

Citing

If you find this corpus helpful in your research, consider citing this work:

@misc{Griesshaber2017,
  author = {Grie{\ss}haber, Daniel},
  title = {Multilingual News Corpus},
  year = {2017},
  publisher = {GitLab HdM Stuttgart},
  journal = {Git repository},
  howpublished = {\url{https://gitlab.mi.hdm-stuttgart.de/griesshaber/german-news-corpus}}
}
      

Author

Daniel GrieƟhaber (Twitter/GitHub).

Template

Solo by chibicode