RetroC-En temporal classification challenge
Guess the publication year of a English text from the Chronicling America collection (1836-1922).
Only newspapers in English were considered.
For instance, you are expected to guess the publication year of this 500-word text:
and he pledges himself to the frieuds of liberal and judicious education throughout the United Slates that he will produce a work which shall be in every respect worthy of their attention ana patronage. There is a period in the progress from early childhood to maturity, and that by no means a short one, during which the expanding minds of the vuuntr aro seeking in every direction tor frcT'Usot'ul Knowledge, as well as Iutellcctual Entertain- inent.C0 Every book, paper or pamphlet whieh promises eithw, is eagerly read, and every circle te socttfty of a literary or scientific cast is earnestly sought During this period the young person is not sa idticd with that kind of instruction which is given to more children. Something more elevated something nearer the studies and pursuits of active life is required. A mend always at hand who could point out the proper studies to be pursued, he true methods ot development is Literature and Science, the best course of Reading, the surest process ot Investigation, the most recent authori ties hi Experimental, and the most learned in His torical research a friend who could relieve the dryness of abstract truth by a familiar anecdote, narrative or illustration who could scatter a few roses of literature in the rugged paths of severe science, would indeed be invuluable. Such a fiiend not one Youth in a thousand, of either sex, can have. There is no tolerable substi tute to be found in any book we might say in any library. It is proposed in some measure to supply the want of such a friend iu the YOUNG PEO PLE'S BOOK. One of the leading objects of the work will be to point out and illustrate by practical examples the Proper method of in the various departments ot Literature and Art, to suggest an propnate departments ot study and inquiry, to prescribe courses of reeding, and to indicate the progress which may be made in the Science, so far as the limits ot the work will allow. The forms into which the different branches of instruction and entertainment will be thrown, will be regulated by the particular object in view at the time, and the clsss or readerB always addressed. Essays, Narratives, Anecdotes, Tales, Historical Rminiscenses and Sketclies, Critiques, Natural History, Antiquities AND Travels, Biographical Notes and Poems, Will all in turn become the vehicles of intellectual developemeut and entertaitiment. The aid of the Arts of Painting and Engraving will bo invoked, and every sulject susceptible of graphic illustration will be accompanied by WELL EXECUTED PICTURES. Arrangements have been made for receiving, and tho publisher is now in tho actual receipt of peri odical publications of a similar design with that of the xuuau rwi'L.us uvuk, irom France, Germany and other parts of the continent of Europe. From these publications, and from the choicest parts or iureign educational literature in its various departments, translations will be made uf bucIi articles as will serve to promote the maiD design
(Yes, there might be a lot of OCR noise there!)
The perfect answer for this text is 1842.5959 (year with a fraction representing a specific day, August, 6th, 1842 for this example). You could as well return non-integer numbers, for instance if you are sure that the text was published in 1935, but you have no idea on which day, the optimal decision is to return 1935.5.
The metric is root mean squared error (and mean absolute error as a secondary metric).
README.md— this file
config.txt— GEval configuration file
train/— directory with training data
train/train.tsv.xz— train set (compressed with xz, not gzip!)
dev-0/— directory with dev (test) data from sources different from the ones in the train set
dev-0/in.tsv— input text for the dev set
dev-0/expected.tsv— expected data for the dev set (publication years)
dev-0/meta.tsv.xz— metadata (do not use while testing)
test-A— directory with test data
test-A/in.tsv— input text for the test set
test-A/expected.tsv— expected data for the test set (hidden)
test-A/meta.tsv.xz— hidden metadata
Structure of data sets
Dev and tests test sets are balanced for years (or at least it was attempted to balance them for years — for some years there was not enough material).
Format of the test sets
The input file is just a list of ~500-word-long text snippets, each given in a separate line.
expected.tsv file is a list of publication years (with fractions).
Format of the output files
For each input line, publication year should be given (it is the same
expected.tsv files). The name of the output files is