TiCC Colloquium: Antske Fokkens

What: Beyond the f-score: Evaluation in NLP
Where: AZ 414
When: Wednesday, 30 November 2016, 12:30 - 13:30 hours


Shared tasks and (shared) corpora have proven themselves highly valuable for NLP. They have allowed us to evaluate our methods and compare them to others helping us, our readers and reviewers to assess the quality of our methods. A downside of the wide-spread approach of comparing results on a gold dataset is that it is relative common practice to draw conclusions based on the highest numbers without looking into what is behind this. However, what goes wrong and why can be highly relevant for end-applications and, specially given the well-known difficulties with reproducing results, looking into the details of how and why results improve (or not) is highly relevant. In this talk, I will propose two ways of looking 'beyond the f-score'; intrinsic evaluation shown through a use case investigating error propagation in parsing and extrinsic evaluation in digital humanities studies.

