Arxiv tables challange
Key information extraction for scientific tables.
Guess the <mask>
token in texts based on tables images and context from text.
The dataset is based on arxiv documents.
Note that images are stored with git-annex, to get the file contents:
./get-annexed-files.sh
Metrics
The task will be evaluated using the following metrics:
Accuracy (main metric) — ...
Evaluation
You can carry out evaluation using the GEval,
when you generate out.tsv
files (in the same format as expected.tsv
files):
wget https://gonito.net/get/bin/geval
chmod u+x geval
./geval -t dev-0
Directory structure
README.md
— this fileconfig.txt
— GEval configuration fileimages/
— images to be processed, referenced in TSV filesin-header.tsv
— one-line TSV file with column names for input dataout-header.tsv
— one-line TSV file with column names for the output datatrain/
— directory with hand-annotated gold-standard OCR train datatrain/in.tsv
— input data for the train settrain/expected.tsv
— expected (reference) data for the dev setdev-0/
— directory with hand-annotated gold-standard OCR test datadev-0/in.tsv
— input data for the dev setdev-0/expected.tsv
— expected (reference) data for the dev settest-A/
— directory with hand-annotated gold-standard OCR test datatest-A/in.tsv
— input data for the test settest-A/expected.tsv
— expected (reference) data for the test set (hidden)
Note that we mean TSV, not CSV files. In particular, double quotes
are not considered special characters here! In particular, set
quoting
to QUOTE_NONE
in the Python csv
module:
import csv
with open('file.tsv', 'r') as tsvfile:
reader = csv.reader(tsvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
for item in reader:
pass
Metadata
Tags: eng, document-understanding