log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
由于BERT代码当初是用Tensorflow1.x发布的,阅读代码时没太看懂这个loss function,经查证,是:
负对数似然(negative log-likelihood)