Raft论文《CONSENSUS: BRIDGING THEORY AND PRACTICE》的第四章”集群成员变更“中,支持两种集群变更方式:
要保证Raft协议的安全性,就是要保证任意时刻,集群中只有唯一的leader节点。如果不加限制条件,那么动态向当前运行集群增删节点的操作,有可能会导致存在多个leader的情况。如下图所示:(Switching directly from one configuration to another can be unsafe because different servers will switch at different times. In this example, the cluster grows from three servers to five. Unfortunately, there is a point in time where two different leaders can be elected for the same term, one with a majority of the old configuration (Cold) and another with a majority of the new configuration (Cnew))
图中有两种颜色的配置,绿色表示旧的集群配置(
C
o
l
d
C_{old}
Cold),蓝色表示新的集群配置(
C
n
e
w
C_{new}
Cnew),如果不加任何限制,直接将配置启用,由于不同的集群节点之间,存在时间差,那么可能出现这样的情况:
由上图可以看到:如果不加任何限制,直接应用新的集群配置,由于时间差的原因,可能导致集群中出现两个不同leader的情况。
在3.5以前的etcd实现中,“集群节点变更”这一功能,仅支持每次变更一个节点,最新的etcd已经能支持一次变更多个节点配置的功能了。ETCD数据库Cluster membership changes有两个版本的日志:ConfChange和ConfChangeV2。其定义如下所示:
type ConfChange struct {
Type ConfChangeType `protobuf:"varint,2,opt,name=type,enum=raftpb.ConfChangeType" json:"type"`
NodeID uint64 `protobuf:"varint,3,opt,name=node_id,json=nodeId" json:"node_id"`
Context []byte `protobuf:"bytes,4,opt,name=context" json:"context,omitempty"`
// NB: this is used only by etcd to thread through a unique identifier. Ideally it should really use the Context instead. No counterpart to this field exists in ConfChangeV2.
ID uint64 `protobuf:"varint,1,opt,name=id" json:"id"`
}
type ConfChangeV2 struct {
Transition ConfChangeTransition `protobuf:"varint,1,opt,name=transition,enum=raftpb.ConfChangeTransition" json:"transition"`
Changes []ConfChangeSingle `protobuf:"bytes,2,rep,name=changes" json:"changes"`
Context []byte `protobuf:"bytes,3,opt,name=context" json:"context,omitempty"`
}
这里ConfChange结构体就是用于承载每次变更单节点(“One Server Config Change”)的信息,其ConfChangeType有如下取值:ConfChangeAddNode、ConfChangeRemoveNode、ConfChangeUpdateNode和ConfChangeAddLearnerNode。通过名字我们就能知道其相应的含义。“单节点成员变更”,意指每次只添加或删除一个节点,这样就能保证集群的安全性,不会在同一时间出现多个leader的情况。之所以能有这个保证,是因为每次变更一个节点,那么新旧两种配置的半数节点(majorrity)肯定存在交集。以下图来说明:(The addition and removal of a single server from an even- and an odd-sized cluster. In each figure, the blue rectangle shows a majority of the old cluster, and the red rectangle shows a majority of the new cluster. In every single-server membership change, an overlap between any majority of the old cluster and any majority of the new cluster is preserved, as needed for safety. For example in (b), a majority of the old cluster must include two of the left three servers, and a majority of the new cluster must include three of the servers in the new cluster, of which at least two must come from the old cluster.)
上图演示了向偶数或奇数的集群增删一个节点的所有可能情况。不论哪种情况,新旧配置都有交集,在每个任期只能投出一张票的情况下,是不会出现多leader的情况的。有了上面的理论基础,下面来看单节点集群变更的全流程,当下发集群节点变更配置时,新的配置会以一种特殊的日志方式进行提交,即:
其流程如下:将集群配置变更数据,序列化为日志数据,需要将日志类型标记为集群配置变更类的日志,提交给leader节点。leader节点收到日志后,需要存储该日志的索引为未完成的集群配置变更索引,像其它正常日志一样处理:先写本地的日志,再广播给集群的其他节点,半数应答则认为日志达成一致可以提交了。如果提交了这类日志,可以将前面保存的未完成的集群配置变更索引置为空了。集群配置变更日志提交之后,对照新旧的集群变更数据,该添加到集群的添加到集群,该删除的节点停机。
需要注意的是,同一时间只能有唯一一个集群变更类日志存在,怎么保证这一点?就算是在leader收到该类型日志时,判断未完成的集群配置变更索引是否为空。
这里ConfChangeV2结构体就是用于多节点联合共识,即“Joint Consensus”。ConfChangeTransition代表了多节点联合共识支持的操作,ConfChangeTransitionAuto表明如果可能,自动使用简单协议,否则返回ConfChangeJointImplicit,该选项为大多数applications采用。ConfChangeTransitionJointImplicit表明无条件地使用联合共识,并自动从中过渡(通过提出零配置更改),此选项适用于希望最小化joint配置中花费的时间并且不将joint配置存储在状态机中(InitialState之外)的applications。ConfChangeTransitionJointExplicit表明使用联合共识并保持联合配置,直到应用程序提出无操作配置更改,这适用于希望显式控制转换的应用程序,例如(通过上下文字段)使用自定义负载。虽然这里不是很明白这些选择代表什么,后续详细解释。
const (
// Automatically use the simple protocol if possible, otherwise fall back to ConfChangeJointImplicit. Most applications will want to use this.
ConfChangeTransitionAuto ConfChangeTransition = 0
// Use joint consensus unconditionally, and transition out of them automatically (by proposing a zero configuration change).
// This option is suitable for applications that want to minimize the time spent in the joint configuration and do not store the joint configuration in the state machine (outside of InitialState).
ConfChangeTransitionJointImplicit ConfChangeTransition = 1
// Use joint consensus and remain in the joint configuration until the application proposes a no-op configuration change. This is suitable for applications that want to explicitly control the transitions, for example to use a custom payload (via the Context field).
ConfChangeTransitionJointExplicit ConfChangeTransition = 2
)
除了上面的单节点变更,有时候还需要一次提交多个节点的变更。但是按照前面的描述,如果一次提交多个节点,很可能会导致集群的安全性被破坏,即同时出现多个leader的情况。因此,一次提交多节点时,就需要走联合共识。所谓的联合共识,就是将新旧配置的节点一起做为一个节点集合,只有该节点集合达成半数一致,才能认为日志可以提交,由于新旧两个集合做了合并,那么就不会出现多leader的情况了。具体流程如下:
leader收到成员变更请求,新集群节点集合为C_new,当前集群节点集合为C_old,此时首先会以新旧节点集合的交集C_{old,new}做为一个集群配置变更类的日志,走正常的日志提交流程。注意,这时候的日志,需要提交到C_{old,new}中的所有节点。当C_{old,new}集群变更日志提交之后,leader节点再马上创建一个只有C_new节点集合的集群配置变更类日志,再次走正常的日志提交流程。这时候的日志,只需要提交到C_new中的所有节点。当C_new日志被提交之后,集群的配置就能切换到C_new对应的新集群配置下了。而不在C_new配置内的节点,将被移除。
可以看到,多节点联合共识的提交流程分为了两次提交:先提交新旧集合的交集C_{old,new};再提交新节点集合C_new。以下图来说明,这几个阶段中,集群的安全性都得到了保证:(Timeline for a configuration change using joint consensus. Dashed lines show configuration entries that have been created but not committed, and solid lines show the latest committed configuration entry. The leader first creates the Cold,new configuration entry in its log and commits it to Cold,new (a majority of Cold and a majority of Cnew). Then it creates the Cnew entry and commits it to a majority of Cnew. There is no point in time in which Cold and Cnew can both make decisions independently.)
ConfChange和ConfChangeV2,对于这两个结构体ETCD社区认为需要统一化,所以后面的go文件raft\raftpb
对这两个接口统一化处理。
// ConfChangeI abstracts over ConfChangeV2 and (legacy) ConfChange to allow treating them in a unified manner.
type ConfChangeI interface {
AsV2() ConfChangeV2
AsV1() (ConfChange, bool)
}
首先我们需要了解一点ConfChangeV2的ConfChangeSingle切片类型的Changes存放的是每次变更单节点相应的单点操作(ConfChangeAddNode、ConfChangeRemoveNode、ConfChangeUpdateNode和ConfChangeAddLearnerNode)。通过ConfChangeV2可以一起执行多个变更单节点操作。因此也可以将ConfChange结构体封装为ConfChangeV2结构体,后续ConfChangeI提供的接口函数也体现了这一点。
// ConfChangeSingle is an individual configuration change operation. Multiple such operations can be carried out atomically via a ConfChangeV2.
type ConfChangeSingle struct {
Type ConfChangeType `protobuf:"varint,1,opt,name=type,enum=raftpb.ConfChangeType" json:"type"`
NodeID uint64 `protobuf:"varint,2,opt,name=node_id,json=nodeId" json:"node_id"`
}
ConfChange的AsV2函数将ConfChange结构体转换成ConfChangeV2,使用的就是ConfChangeSingle切片类型的成员Changes。ConfChange的AsV1函数转化ConfChange等于没有做操作。
// AsV2 returns a V2 configuration change carrying out the same operation.
func (c ConfChange) AsV2() ConfChangeV2 {
return ConfChangeV2{
Changes: []ConfChangeSingle{{ Type: c.Type, NodeID: c.NodeID, }},
Context: c.Context,
}
}
// AsV1 returns the ConfChange and true.
func (c ConfChange) AsV1() (ConfChange, bool) { return c, true }
ConfChangeV2的AsV2函数等于返回ConfChangeV2,不做任何操作。提供的AsV1函数是无效函数。
// AsV2 is the identity.
func (c ConfChangeV2) AsV2() ConfChangeV2 { return c }
// AsV1 returns ConfChange{} and false.
func (c ConfChangeV2) AsV1() (ConfChange, bool) { return ConfChange{}, false }
MarshalConfChange函数就是将输入形参ConfChangeI接口先尝试将其转换为ConfChange(输入ConfChange结构体),失败后将其转换为ConfChangeV2(输入ConfChangeV2结构体)。最后将结构体序列化为byte切片。
// MarshalConfChange calls Marshal on the underlying ConfChange or ConfChangeV2
// and returns the result along with the corresponding EntryType.
func MarshalConfChange(c ConfChangeI) (EntryType, []byte, error) {
var typ EntryType
var ccdata []byte
var err error
if ccv1, ok := c.AsV1(); ok {
typ = EntryConfChange
ccdata, err = ccv1.Marshal()
} else {
ccv2 := c.AsV2()
typ = EntryConfChangeV2
ccdata, err = ccv2.Marshal()
}
return typ, ccdata, err
}
ConfChangeV2结构体提供两个函数EnterJoint(用于判定是否需要使用Joint Consensus)和LeaveJoint。EnterJoint函数返回两个布尔值,当且仅当此配置更改将使用联合一致性时(Joint Consensus),第二个布尔值为真,如果它包含多个更改或明确要求使用联合一致性,则为这种情况。只有当第二个布尔为真时,第一个布尔才为真,并指示Joint State状态是否将自动保留。 LeaveJoint函数如果配置更改leave joint配置,则LeaveJoint为true。如果ConfChangeV2为零,则会出现这种情况,上下文字段可能除外。
// EnterJoint returns two bools. The second bool is true if and only if this config change will use Joint Consensus, which is the case if it contains more than one change or if the use of Joint Consensus was requested explicitly. The first bool can only be true if second one is, and indicates whether the Joint State will be left automatically.
func (c ConfChangeV2) EnterJoint() (autoLeave bool, ok bool) {
// NB: in theory, more config changes could qualify for the "simple" protocol but it depends on the config on top of which the changes apply. For example, adding two learners is not OK if both nodes are part of the base config (i.e. two voters are turned into learners in the process of applying the conf change). In practice, these distinctions should not matter, so we keep it simple and use Joint Consensus liberally.
if c.Transition != ConfChangeTransitionAuto || len(c.Changes) > 1 {
// Use Joint Consensus.
var autoLeave bool
switch c.Transition {
case ConfChangeTransitionAuto:
autoLeave = true
case ConfChangeTransitionJointImplicit:
autoLeave = true
case ConfChangeTransitionJointExplicit:
default:
panic(fmt.Sprintf("unknown transition: %+v", c))
}
return autoLeave, true
}
return false, false
}
// LeaveJoint is true if the configuration change leaves a joint configuration. This is the case if the ConfChangeV2 is zero, with the possible exception of the Context field.
func (c ConfChangeV2) LeaveJoint() bool {
c.Context = nil // NB: c is already a copy.
return proto.Equal(&c, &ConfChangeV2{})
}