We construct and publicize a new fake news dataset with social context named MC-Fake2 , which contains 27,155 news events in 5 topics, and their social context composed of 5 million posts, 2 million users and induced social graph with 0.2 billion edges.
We propose a novel Post-User Interaction Network (PSIN), which applies divide-and-conquer strategy to model the heterogeneous relations. Specifically, we integrate the post-post, user-user and post-user subgraphs with three variants of Graph Attention Networks based on their intrinsic characteristics. Additionally, we employ an additionally adversarial topic discriminator to learn topic-agnostic features for veracity classification.
We evaluate our proposed model on the curated dataset in two settings: in-topic split and out-of-topic split. The superior results of our model in both settings reveal the effectiveness of the proposed method.
2 Related work
2.1 Fake News Datasets
BuzzFeedNews specializes in political news published on Facebook during the 2016 U.S. Presidential Election.
LIAR collects 12.8K short statements with manual labels from the political fact-checking website.
FA-KES consists of 804 articles around Syrian war.
CREDBANK contains about 1000 news events and 60 million tweets, labeled by Amazon mechanical Turk.
Twitter15 contains 778 reported events between March 2015 to December 2015, with 1 million posts from 500k users.
FakeNewsNet is a data repository with news content and related posts, containing political news and entertainment news which are checked by politifact and gossiocop.
FakeHealth is collected from healthcare information review website Health News Review, it contains over 2000 news articles, 500k posts and 27k user profiles, along with user networks.
COAID collects 1,896 news, 183,654 related user engagements, 516 social platform posts about COVID-19, and ground truth labels.
FakeCovid is a multilingual cross-domain dataset of 5,182 fact-checked news article for COVID-19 from 92 different fact-checking websites.
MM-COVID is a multilingual and multidimensional COVID-19 fake news data repository, containing 3,981 pieces of fake news content and 7,192 trustworthy information from 6 different languages.
News event can be considered as a heterogeneous graph two types of nodes: post and user, and three types of edges: post-post, user-user and user-post.as shown in Figure 2:
在本文的数据集中,每一个 Ti" role="presentation">Ti 均有一个主题标签 yiC∈{Politics,Entertainment,Health,Covid−19,SryiaWar}" role="presentation">yCi∈{Politics,Entertainment,Health,Covid−19,SryiaWar} 和 groundtruth veracity label yiV∈{F,R}" role="presentation">yVi∈{F,R} (i.e. Fake, news or Real news)。
问题目标:ProbleM 1. Given the training set Ttrain ={Ttrain ,Ytrain V,Ytrain C}" role="presentation">Ttrain ={Ttrain ,YVtrain ,YCtrain } , and the testing set Ttest ={Ttest }" role="presentation">Ttest ={Ttest } , how to learn a classifier f:Ti→yivfromTtrain " role="presentation">f:Ti→yvifromTtrain and then predict the veracity label Ytest " role="presentation">Ytest for Ttest " role="presentation">Ttest .
图结构被划分为三部分:post propagation tree、user social graph、post-user interaction graph
总体框架如下:
4.1 Hybrid Node Feature Encoder
对于 event i" role="presentation">iTi" role="presentation">Ti,节点集合 {p1i,p2i,…pMii,u1i,u2i,…uNii}" role="presentation">{pi1,pi2,…piMi,ui1,ui2,…uiNi},每个节点拥有 textual features 和 meta features。Post 和 user 的 meta feature 如下:
a∈Rd" role="presentation">a∈Rd is a parameter vector
W=[Ws‖Wd]" role="presentation">W=[Ws∥Wd] with Ws" role="presentation">Ws and Wd" role="presentation">Wd are parameter matrices to project source nodes and target nodes
eij" role="presentation">eij and αij" role="presentation">αij are unnormalized and normalized attention
PPC_RNN+CNN [23]: A fake news detection approach combining RNN and CNN, which learns the fake news representations through the characteristics of users in the news propagation path.
RvNN [25]: A tree-structured recursive neural network with GRU units that learn the propagation structure.
Bi-GCN [4]: A GCN-based rumour detection model using bi-directional GCN to represent the propagation structure.
PLAN [17]: A post-level attention model that incorporates tree structure information in the Transformer network.
FANG [28]: A graphical fake news detection model based on the interaction between users, news, and sources. We remove the source network modeling part for fair evaluation.
RGCN [33]: The relational graph convolutional network keeps a distinct linear projection weight for each edge type.
HGT [13]: Heterogeneous Graph Transformer leverages nodeand edge-type dependent parameters to characterize the heterogeneous attention over each edge.
PSIN : Our proposed Post-User Interaction Model.
PSIN(-T): PSIN without the adversarial topic discriminator. We compare it with other baselines to demonstrate the superiority of our network architecture.