WikiReading Dataset

Extract information from Wikipedia articles (WikiReading dataset repackaged).

This is the dataset described in:

Daniel Hewlett, Alexandre Lacoste, Llion Jones, IlliaPolosukhin, Andrew Fandrianto, Jay Han, MatthewKelcey, and David Berthelot. 2016. Wikireading: Anovel large-scale language understanding task overwikipedia. In Association for Computational Lin-guistics (ACL). pages 1535–1545.

Here, the Wikireading dataset was re-formatted as TSV files. No other changes were made.

Some files are very large. Run the get-data.sh script to get all the data. (But note that they are not needed for the evaluation.)

File format

All files are given in the TSV format (for output and expected the TSV extension is for compatibility with other challenges, but it is actually just one column).

Each input line in in.tsv.xz files represents a single instance and consists of:

  • label to be predicted,
  • MD5 sum of the article,
  • contents of the Wikipedia article.

The output to be generated (and the expected output) is the name of the label, colon and the predicted value with spaces replaced with underscores. You can give more than one value (separate them with spaces).

Directory structure

  • README.md — this file
  • config.txt — configuration file
  • get-data.sh — script for downloading large files
  • train/ — directory with training data
  • train/in.tsv.xz — train set (input)
  • train/expected.tsv.xz — train set (output)
  • dev-0/ — directory with dev (test) data
  • dev-0/in.tsv.xz — input data for the dev set
  • dev-0/expected.tsv.xz — expected (reference) data for the dev set
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set