00:25:04 Dominik Macháček: we cannot hear Tom much 00:25:18 Dominik Macháček: it was better with the face mask a moment ago 00:26:10 Jakub Náplava: agree, the sound is not good 00:26:41 David: It seems it goes from another mic... 00:28:44 Stanislava Gráf: sound seems better now 00:28:56 Pavel Pecina : Sound's OK 00:29:09 Josef Jon: much better now 00:32:06 Zoom 001: Thanks for the feedback 😊 01:06:46 Zoom 001: if you have any questions for the presenter, please write them in chat and I will read them to the presenter 01:07:23 Dominik Lukes (iPad): What is the unit of evaluation for these metrics? BLEU is on sentence level, I think, but some of the issues may appear on the text level. E.g. anaphora. 01:12:36 Dominik Lukes (iPad): Is there any testing being done on usability of translation. E.g. what sort of misunderstandings have actual MT outputs led to. For example, two translations may have the same score but one gives the opposite meaning - BERT famously not being all that good at “not”. 01:14:31 Mateusz Krubiński: In the Freitag paper that you mentioned, they report that the COMET performs the best when computing the correlation with MQM labels. If I understood correctly, your human scores are the DA ones, right? If so, using the DA labels + pair-wise accuracy gives very similar order of metrics as the MQM + Pearson, any comments on that? 01:18:06 Mateusz Krubiński: Thanks :) 01:21:19 Mateusz Krubiński: As for the systems, how many of them are the incremental improvments, and how many are actually independently trained?