多模态知识问答：MMCoQA: Conversational Question Answering over Text, Tables, and Images

多模态知识问答：MMCoQA: Conversational Question Answering over Text, Tables, and Images
论文：MMCoQA: Conversational Question Answering over Text, Tables, and Images

论文核心

面向多模态信息，包括了image/text和table数据，如何开展多轮对话。
这个过程中，需要考虑，encoder如何编码？score如何计算？哪些部分可以复用已有的模型等等。
论文的方法是端到端的知识问答结构，输入的question，产生的是answer span，包括了start和end部分。
整个抽取过程分为了3部分：问题理解，多模态证据检索和答案抽取。
实现的效果，如下：

前言

现有的知识问答，可以分为：基于知识库的问答、基于文件的问答和基于社区的问答。
目前的研究，多是基于单一知识库的问答。

文章工作

 数据集构建

1.产生潜在的对话集
第一步：对于在问题池中的句子，识别问题和答案中的entity.
第二步：从问题池中随机选择包含entity的问题。
2.分解复杂的问题
将复杂的问题分解为多个子问题，数据集中提供了问题的类型（问题类型表示复杂问题的分解问题的逻辑和目标数量）和中间答案
3.将分解后的问题，重新定义为多轮对话。

数据质量的保证：5步——train，annotation,check,modification,re-checking.

数据集分析

问题分析——问题的长度
答案分析——答案的模态类型
模态分析——问题需要什么模态的数据来回答

 模型（MAE模型）

MAE 将 MMCoQA 任务分为三个步骤：对话式问题理解、多模态证据检索和自适应答案提取。

对话式问题理解

Encoder部分。
对于问题，采用bert将其编码
剩余的答案部分的语料，分不同模态做处理。
文本——bert。
表格——linearize tables by rows
图像——Resnet network

（在encoder部分，table/passage/question/answer统一采用的是bert编码，函数——retriever_convert_example_to_feature，img是先transform to tensor ，之后采用函数_image_transform，但是，这好像不是resnet）

query_feature_dict = {‘query_input_ids’: np.asarray(query_feature.input_ids),
‘query_token_type_ids’: np.asarray(query_feature.token_type_ids),
‘query_attention_mask’: np.asarray(query_feature.attention_mask),
‘qid’: qas_id}
```
entry['question_type']=="text"



passage_feature_dict = {'passage_input_ids': np.asarray(passage_feature.input_ids), 
                                     'passage_token_type_ids': np.asarray(passage_feature.token_type_ids), 
                                     'passage_attention_mask': np.asarray(passage_feature.attention_mask),
                                     'retrieval_label': passage_feature.label, 
                                     'example_id': example_id,
                                     'image_input':np.zeros([3,512,512])}
1
2
3
4
5
6
7
8
9
10
```
```
entry['question_type']=="image"

passage_feature_dict = {'passage_input_ids': np.asarray([0]*self._passage_max_seq_length), 
                                     'passage_token_type_ids': np.asarray([0]*self._passage_max_seq_length), 
                                     'passage_attention_mask': np.asarray([0]*self._passage_max_seq_length),
                                     'retrieval_label': 1, 
                                     'example_id': example_id,
                                     'image_input':img}


1
2
3
4
5
6
7
8
9
10
```
```
table_id=entry['table_id']


passage_feature_dict = {'passage_input_ids': np.asarray(table_feature.input_ids), 
                                     'passage_token_type_ids': np.asarray(table_feature.token_type_ids), 
                                     'passage_attention_mask': np.asarray(table_feature.attention_mask),
                                     'retrieval_label': table_feature.label, 
                                     'example_id': example_id,
                                     'image_input':np.zeros([3,512,512])}
1
2
3
4
5
6
7
8
9
```
```
image_encoder=torchvision.models.resnet101(pretrained=True)

self.query_encoder = BertModel(config)

self.passage_encoder = BertModel(config)
1
2
3
4
5
```
多模态证据检索

self._modality_dict={‘text’:0,‘table’:0,‘image’:1}

在model函数中，BertForOrconvqaGlobal函数中提到了
self.modality_detection=nn.Linear(config.proj_size, 3)

以知识问答的范式解决问题时，
self.qa_outputs = nn.Linear(config.hidden_size, config.num_qa_labels)
qa_logits = self.qa_outputs(sequence_output)
start_logits, end_logits = qa_logits.split(1, dim=-1)

最大化inner product search
冻结encoder部分的参数，通过inner product计算question embedding和知识库item之间的相似度。选择top-N .

自适应答案提取

首先，根据问题，做分类模型，选择答案最可能的模态形式。
之后，针对三种模态的信息，建立抽取模型。
textExtractor:
TextExtractor predicts an answer span by computing two scores for each token in a passage in Pr to be the start token and the end token
TableExtractor.we concatenate the question text to the linearized table sequence, and encode them using BERT. T
依旧计算start和end的token。

ImageExtractor：We extract the visual feature vi for an
image with the ResNet, and append the question
text with all the answers in the answer set as a text
sequence
计算start和end token。

答案的分值，最终包括三部分，检索+模态+answer extraction

loss = loss.mean() # mean() to average on multi-gpu parallel (not distributed) training
retriever_loss = retriever_loss.mean()
reader_loss = reader_loss.mean()
qa_loss = qa_loss.mean()
rerank_loss = rerank_loss.mean()

>qa的loss计算部分

qa_loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = qa_loss_fct(start_logits, start_positions)
end_loss = qa_loss_fct(end_logits, end_positions)
qa_loss = (start_loss + end_loss) / 2

>模态计算部分

self.modality_detection=nn.Linear(config.proj_size, 3)
modality_loss_fct =CrossEntropyLoss()
modality_loss = modality_loss_fct(modality_logits, modality_labels)
相关阅读:
（附源码）基于Java SpringBoot的电影院管理系统设计与实现毕业设计 011633
【DevOps核心理念基础】2. 敏捷开发与DevOps关系
 【云原生】Kubernetes----Helm包管理器
 【漏洞复现】Django _2.0.8_任意URL跳转漏洞(CVE-2018-14574)
Elasticsearch：ES|QL 查询语言简介
 1210、MHA集群
 信息安全、网络安全以及数据安全三者之间的区别
 instanceOf原理及手动实现
 【Java|golang】1464. 数组中两元素的最大乘积---取数组第一，第二值，无需排序
 深入解析Java HashMap的Resize源码
原文地址：https://blog.csdn.net/Hekena/article/details/126401777

论文：MMCoQA: Conversational Question Answering over Text, Tables, and Images

论文核心

前言

文章工作

数据集构建

数据集分析

模型（MAE模型）

对话式问题理解

多模态证据检索

自适应答案提取