Arxiv tables challange

Key information extraction for scientific tables. Guess the <mask> token in texts based on tables images and context from text.

The dataset is based on arxiv documents.

Note that images are stored with git-annex, to get the file contents:

./get-annexed-files.sh

Metrics

The task will be evaluated using the following metrics:

  • Accuracy (main metric) — ...

Evaluation

You can carry out evaluation using the GEval, when you generate out.tsv files (in the same format as expected.tsv files):

wget https://gonito.net/get/bin/geval
chmod u+x geval
./geval -t dev-0

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • images/ — images to be processed, referenced in TSV files
  • in-header.tsv — one-line TSV file with column names for input data
  • out-header.tsv — one-line TSV file with column names for the output data
  • train/ — directory with hand-annotated gold-standard OCR train data
  • train/in.tsv — input data for the train set
  • train/expected.tsv — expected (reference) data for the dev set
  • dev-0/ — directory with hand-annotated gold-standard OCR test data
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A/ — directory with hand-annotated gold-standard OCR test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden)

Note that we mean TSV, not CSV files. In particular, double quotes are not considered special characters here! In particular, set quoting to QUOTE_NONE in the Python csv module:


    import csv
    with open('file.tsv', 'r') as tsvfile:
        reader = csv.reader(tsvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
        for item in reader:
            pass

Metadata

Tags: eng, document-understanding