Sentiment by emoticons challenge

Give the probability of a positive sentiment for a short Polish text.

The corpus was created using emoticons: it was assumed that a positive emoticon (e.g. :-)) entails positive sentiment, whereas a negative emoticon (e.g. :-() — a negative sentiment. The emoticons were replaced by <EMOTICON> (so, actually, the challenge is to guess the sentiment at the place where an emoticon was used).

The data sets were prepared using the Common Crawl corpus. The class are balanced (50%/50%).

Log loss is used as the metric.

Classes

  • 1 — positive sentiment
  • 0 — negative sentiment

Directory structure

  • README.md — this file
  • config.txt — configuration file
  • train/ — directory with training data
  • train/in.tsv.xz — train set - input (compressed using xz)
  • train/expected.tsv — train set - expected output (compressed using xz)
  • train/meta.tsv.xz — metadata (do not use during training — this is just for a reference)
  • dev-0/ — directory with dev (test) data
  • dev-0/in.tsv — input data for the dev set (text fragments)
  • dev-0/expected.tsv — expected (reference) data for the dev set
  • dev-0/meta.tsv — metadata (not used during testing)
  • test-A — directory with test data
  • test-A/in.tsv — input data for the test set (text fragments)
  • test-A/expected.tsv — expected (reference) data for the test set (hidden)