多个维度出发评价生成文本的质量,如一致性、流畅度等等。
每个维度的伪标注样本数量为30K,作者构建的数据集:
we first design specific rules for several commonly evaluated dimensions to construct pseudo data, and then combine them to train the evaluator.
任务形式:summary和dialogue。
实验验证:对比model有BLEU、METHOR、ROUGE、Bertscore....
人工标注的数据:TO verfify the proposed evaluator is qualifited, we need to calculated correlations with human scores in each benchamark.
Train the evaluator for 1-3 epochs. _Supervised method.
Conditional text generation: for example,machine translation, so the goal is to generate a hypothesis (h = h1, · · · , hm) based on a given source text (s = s1, · · · , sn)
require human judgments to train (i.e., supervised metrics): COMET [57], BLEURT [63], or are human judgment-free (i.e., unsupervised): BLEU [51] ROUGE-1 and ROUGE-2, ROUGE-L, CHRF [53], PRISM [66], MoverScore [77], BERTScore [76].
Datasets (such are generate for specific areas from 2015):
TASK | Datasets | Descrip |
SUM | NER 18 | 60个articles |
MT | WMT 19 | |
Factuality | Rank19 | 373 triples of a source sentence with two summary sentences, one correct and one incorrect. |
Factuality | QAGS20 | 235 test outputs on CNNDM dataset from [16] and 239 test outputs on XSUM dataset [48] from BART fine-tuned on XSUM |
Data to Text | BAGEL | 202 samples , each sample consists of one meaning representation, multiple references, and utterances generated by different systems |
Train the evaluator .
dataset: wnt 18和wnt 19
task format :machine translation and image capition
No training.