GLUE-LM-GAP

GLUE-LM-GAP is LM-GAP challenge base on GLUE benchmark.

LM-GAP challenge is the task to predict token between left and right context of text. Sometimes, left or right context may not exist. Predicted token, left or right context is not tokenized, which means predicted token may be a token with punctuation.

This challenge base on GLUE benchmark data from Python datasets library.

Each examples in the datasets are the separated elements from GLUE benchmark datasets, e.g. questions, answers are the separated examples - are not concatenated as single example.

Examples in the datasets with 3 or fewer tokens were filtered out.

Challenge version: 2.0.

Dataset structure

There are 3 datasets:

train set in train directory: - it's original train set from GLUE benchmark but contain 90% of train data, rest of the data are in validation set,
validation set in dev-0 directory: - it's original train set from GLUE benchmark but contain 10% of train data, rest of the data are in the train set,
test set in test-A directory - this dataset is without expected tokens: - it's original validation set from GLUE benchmark.

IMPORTANT! All data are compressed with xz tools. For compression was used higher compression level = -9 flag.

Structure of each dataset

in.tsv.xz - input TSV file (\t - tabulation is column separator), where: first column is the dataset name, second column is the left context, third column is the right context - name of column is available in in-header.tsv file, each line is a separate example,
expected.tsv.xz - expected token in the gap (for test set is not available), this is one column file - name of column is available in out-header.tsv file, each line is a separate example,
raw_data.txt.xz - raw text, each line is a separate line of text,
dataset_data.json.xz - datasets json data (maybe used with Python datasets library) - it's list of dictionaries with keys:
dataset_name key with dictionary with dataset name, (optionally) dataset configuration and dataset split name,
metric key with metric configuration from Python evaluate library,
feature_definitions key with definition of type data columns from Python datasets library,
data key with dataset from Python datasets library.

Structure of output file

Output file should be saved in file with out prefix (in dataset directory). Lines in the output file should contain predicted tokens with probability separated by the space character. Token and probability should be separated by ":" (colon) character. Additionally, for each line should be added special probability for rest/unknown of tokens (saved as :PROBABILITY, where PROBABILITY is the probability of rest/unknown of tokens) - without this token, finally score can go to infinity (if predicted tokens are incorrect).

Example of output file is:

$ cat dev-0/out.tsv
skills:0.78 qualifications:0.12 qualities:0.09 :0.01
replace:0.52 substitute:0.32 surpass:0.15 :0.01

Example contains 2 examples with 3 best tokens (with probability for rest/unknown of tokens).

Many output files

It's possible to save many output files in dataset directory. To do that, output file should be saved as out-TAGS.tsv where TAGS is key-value pairs separated with , (comma) character. Each key-value pair should be separated = (sign equals).

Example out-model=roberta-base,method=MLM.tsv contains 2 tags: model with roberta-base value and method with MLM value. Value/Key of tag should not contain , (comma) character.

Data statistics

Table represents number of lines in each dataset:

| Dataset name | train | dev-0 | test-A | |--------------------|---------------|-------------|--------------------| | glue-cola | 7240 | 804 | 967 | | glue-mnli | 686797 | 76311 | 38170¹ | | glue-mrpc | 6604 | 732 | 816 | | glue-qnli | 188008 | 20891 | 10890 | | glue-qqp | 649262 | 72140 | 80210 | | glue-rte | 4449 | 496 | 553 | | glue-sst2 | 43928 | 4879 | 869 | | glue-stsb | 10317 | 1146 | 2997 | | glue-wnli | 1104 | 122 | 140 | | Total examples | 1597709 | 177521 | 135612 |

¹ - `test-A` set is split into `glue-mnli-matched` (matched version) with 18_920 samples and `glue-mnli-mismatched` (mismatched version) with 19_250 samples.

Evaluation

Evaluation can be done with geval tools.

Example of dev-0 evaluation can be run:

geval -t dev-0

Evaluation for each dataset

To evaluate each dataset use config-extended.txt configuration file. It should be replaced with config.txt configuration file and run evaluation command line.