HWR challenge for index cards (only recognition, no detection)

Handwriting Recognition for Polish index cards.

The data set is based on the index cards from Korpus Frazeologiczny Języka Polskiego.

This is a challenge just for recognition, the lines with HWR are given as input.

Metrics

The task will be evaluated using the following metrics:

  • WER (Word Error Rate) — the equivalent of CER for words (number of words inserted, substituted and deleted divided by the total number of words).
  • CER (Character Error Rate) — the Levenshtein distance between real text and the OCR engine output, divided by the total number of chacracters,

Evaluation

You can carry out evaluation using the GEval, when you generate out.tsv files (in the same format as expected.tsv files):

wget https://gonito.net/get/bin/geval
chmod u+x geval
./geval -t dev-0

Directory structure

  • README.md — this file
  • config.txt — GEval configuration file
  • images/ — images to be processed, referenced in TSV files
  • in-header.tsv — one-line TSV file with column names for input data
  • out-header.tsv — one-line TSV file with column names for the output data
  • train/ — directory with hand-annotated gold-standard OCR train data
  • train/in.tsv — input data for the train set
  • train/expected.tsv — expected (reference) data for the dev set
  • dev-0/ — directory with hand-annotated gold-standard OCR test data
  • dev-0/in.tsv — input data for the dev set
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • test-A/ — directory with hand-annotated gold-standard OCR test data
  • test-A/in.tsv — input data for the test set
  • test-A/expected.tsv — expected (reference) data for the test set (hidden)

Note that we mean TSV, not CSV files. In particular, double quotes are not considered special characters here! In particular, set quoting to QUOTE_NONE in the Python csv module:


    import csv
    with open('file.tsv', 'r') as tsvfile:
        reader = csv.reader(tsvfile, delimiter='\t', quoting=csv.QUOTE_NONE)
        for item in reader:
            pass

Downloading image files

Image files are kept using git-annex. If you need them, install git-annex and run ./annex-get-all.sh.

Format of the test sets

The input file (in.tsv) consists of image file names.

Submission format

Each entry in expected.tsv contains entire text file to be recognized, compressed to one line. In order to achieve best possible results, one should format submitted out.tsv in similar way, i.e. don't forget to encode backslashes and carriage returns:

def encode_text(t):
    return t.replace('\\', '\\\\').replace('\n', '\\n')