Towards Real-Time Multi-Object Tracking（JDE）

本文我们接着对多目标追踪（MOT）领域常见的模型和算法进行学习，这次的主角是JDE，JDE可不是一个模型的名字，而是一类追踪算法的总称，全称叫做

Jointly learns the Detector and Embedding model (JDE)

什么意思呢？我们之前讨论的一些多目标追踪模型，比如SORT和DeepSORT，都是2015-2018年常见的MOT范式，也就是tracking by detection 。

该类范式因为通俗易懂，且表现出了不俗的追踪精度，在2015年到2018年，一度成为MOT的主流范式。该范式首先通过检测器（detector）检测出画面中物体所在的检测框，然后根据物体检测框移动的规律（运动特征）和检测框中物体的外观特征（通常通过一个ReID网络抽取一个低维的向量，叫做embedding向量）来进行前后帧同一物体的匹配，从而实现多目标追踪。

该类范式将MOT分为了两步，即

物体检测
特征提取与物体关联

该类方法检测与特征提取是分开的，所以又被称为SDE

Separate Detection and Embedding (SDE)

SDE存在的最大缺点就是速度慢，因为将物体检测和（外观）特征提取分开，检测速度自然就下去了。

那么作者考虑到这点不足，提出了JDE范式。论文和代码链接如下：

论文：Towards Real-Time Multi-Object Tracking

代码：Zhongdao/Towards-Realtime-MOT

1 JDE产生背景

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mglIBH7f-1658901120193)(C:\Users\Liujiawang\AppData\Roaming\Typora\typora-user-images\image-20220727134313606.png)]$

其实，作者有对比了前面方法：

1)、SDE（Separate Detection and Embedding）就是两阶段法。检测器和Reid模块是独立开来的，先检测后识别（感觉思想和Fast RCNN很像），这些方法需要分为两部分操作，两部分不相互干扰，精度高，但是耗时长。
2)、作者这里的two-stage实际上应该也是一种端到端的方法，只是不是利用检测的最后的结果，而是利用两阶段检测法中FPN生成的框来做Embedding，这样做可以贡献部分特征，减少计算量。但是二阶段法不来就不是很快，所以整体速度依然达不到实时。

2 JDE的网络结构和损失函数

既然作者提到该方法是基于One-stage检测器学习到物体的embedding的（代码中采用的是经典的YOLO V3模型）。那么JDE范式就应该在检测器的输出（head），多输出一个分支用来学习物体的embedding的。

果不其然，作者就是采用了这种思路。原论文中给出结构图如下

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GL8pBOFT-1658901120194)(C:\Users\Liujiawang\AppData\Roaming\Typora\typora-user-images\image-20220727132529066.png)]$

这个结构图勾勒了作者大致的想法，在Prediction head中多出来了一个分支用于输出embedding。然后使用一个**多任务学习（multi-task learning）**的思路设置损失函数。看的时候觉得如此简单，但是深思了下，发现问题没有这么简单。

我根据源码自己又画了一个图：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SiKrcDIS-1658901120194)(D:\VCM\组会\20220726\JDE.drawio.png)]$

第一个问题就是embedding要怎么学出来？

我们知道，理想情况下同一物体在不同的帧中，被同一跟踪标签锁定（即拥有同一track ID）。我们知道的信息就只有他们的标签索引（同一物体的track ID一致，不同物体的track ID不一样）。

那么网络在训练的过程中，应该需要对embedding进行转化，转化为足够强的语义信息，也就是这个embedding可以轻松的区分检测出来的目标属于哪个track ID的物体，那么这种就需要借鉴物体分类的思路了（将每个track ID当作一个类别），所以作者引入了全连接层将embedding信息转化为track ID分类信息。

第二个问题就是这个全连结层的输出节点到底是多少？

因为刚才我们提到，要将每个track ID当作一个类别，但是这个track ID的数量十分庞大，甚至不可计数。这个输出节点应该如何设置呢？看了一圈代码，代码中设置了14455个输出节点，设置依据为训练集的总的Track ID数量。

那么实际上作者的结构图是进行了省略的，我们不妨手动补全。

注意：在Test的时候实际上是没有Embedding到14455的映射的，prediction head几乎没起什么作用

这样的话就没什么大的疑问了。有关YOLO V3的结构，熟悉目标检测的都不会陌生，我们这里忽略FPN网络的结构定义，直接看predicition head的代码部分，确定上面的分析是合理的。

该predicition head的代码定义在model.py文件下的 YOLOLayer类中。定义如下：

class YOLOLayer(nn.Module):
    def __init__(self, anchors, nC, nID, nE, img_size, yolo_layer):
        super(YOLOLayer, self).__init__()
        self.layer = yolo_layer
        nA = len(anchors)
        self.anchors = torch.FloatTensor(anchors)
        self.nA = nA  # number of anchors (4)
        print('nA',nA)
        self.nC = nC  # number of classes (1)
        print('nC', nC)
        self.nID = nID # number of identities, 14455
        print('nID', nID)
        self.img_size = 0
        self.emb_dim = nE # 512
        print('nE', nE)
        self.shift = [1, 3, 5]

        self.SmoothL1Loss  = nn.SmoothL1Loss() #  for bounding box regression
        self.SoftmaxLoss = nn.CrossEntropyLoss(ignore_index=-1) # foreground and background classification
        self.CrossEntropyLoss = nn.CrossEntropyLoss()
        self.IDLoss = nn.CrossEntropyLoss(ignore_index=-1) # loss of embedding
        self.s_c = nn.Parameter(-4.15*torch.ones(1))  # -4.15
        self.s_r = nn.Parameter(-4.85*torch.ones(1))  # -4.85
        self.s_id = nn.Parameter(-2.3*torch.ones(1))  # -2.3
        
        self.emb_scale = math.sqrt(2) * math.log(self.nID-1) if self.nID>1 else 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

作者简单定义了一些参数和损失函数类型。我已经将一些参数的设定值标注在后面，便于大家理解。作者定义了四个损失函数，但是只用到了三个，分别为

self.SmoothL1Loss 用于检测框的回归
self.SoftmaxLoss 用于前景和背景分类
self.IDLoss 用于计算embedding的损失

当然了，作者在论文中提到，JDE是一个多任务的学习，所以在计算损失函数的时候，需要采用任务独立不确定性（task-independent uncertainty）的自动学习方案进行以上三个损失函数的求和，有关参数在上述代码中已经定义，分别为

self.s_c
self.s_r
self.s_id

这在上图中也有体现。该部分的核心代码如下

    def forward(self, p_cat,  img_size, targets=None, classifier=None, test_emb=False):
        p, p_emb = p_cat[:, :24, ...], p_cat[:, 24:, ...]
        nB, nGh, nGw = p.shape[0], p.shape[-2], p.shape[-1]

        if self.img_size != img_size:
            create_grids(self, img_size, nGh, nGw)

            if p.is_cuda:
                self.grid_xy = self.grid_xy.cuda()
                self.anchor_wh = self.anchor_wh.cuda()

        p = p.view(nB, self.nA, self.nC + 5, nGh, nGw).permute(0, 1, 3, 4, 2).contiguous()  # prediction
        
        p_emb = p_emb.permute(0,2,3,1).contiguous()
        p_box = p[..., :4]
        p_conf = p[..., 4:6].permute(0, 4, 1, 2, 3)  # Conf

        # Training
        if targets is not None:
            if test_emb:
                tconf, tbox, tids = build_targets_max(targets, self.anchor_vec.cuda(), self.nA, self.nC, nGh, nGw)
            else:
                tconf, tbox, tids = build_targets_thres(targets, self.anchor_vec.cuda(), self.nA, self.nC, nGh, nGw)
            tconf, tbox, tids = tconf.cuda(), tbox.cuda(), tids.cuda()
            mask = tconf > 0

            # Compute losses
            nT = sum([len(x) for x in targets])  # number of targets
            nM = mask.sum().float()  # number of anchors (assigned to targets)
            nP = torch.ones_like(mask).sum().float()
            if nM > 0:
                lbox = self.SmoothL1Loss(p_box[mask], tbox[mask])
            else:
                FT = torch.cuda.FloatTensor if p_conf.is_cuda else torch.FloatTensor
                lbox, lconf =  FT([0]), FT([0])
            lconf =  self.SoftmaxLoss(p_conf, tconf)
            lid = torch.Tensor(1).fill_(0).squeeze().cuda()
            emb_mask,_ = mask.max(1)
            
            # For convenience we use max(1) to decide the id, TODO: more reseanable strategy
            tids,_ = tids.max(1) 
            tids = tids[emb_mask]
            embedding = p_emb[emb_mask].contiguous()
            embedding = self.emb_scale * F.normalize(embedding)
            nI = emb_mask.sum().float()
            
            if  test_emb:
                if np.prod(embedding.shape)==0  or np.prod(tids.shape) == 0:
                    return torch.zeros(0, self.emb_dim+1).cuda()
                emb_and_gt = torch.cat([embedding, tids.float()], dim=1)
                return emb_and_gt
            
            if len(embedding) > 1:
                logits = classifier(embedding).contiguous()
                lid =  self.IDLoss(logits, tids.squeeze())

            # Sum loss components
            loss = torch.exp(-self.s_r)*lbox + torch.exp(-self.s_c)*lconf + torch.exp(-self.s_id)*lid + \
                   (self.s_r + self.s_c + self.s_id)
            loss *= 0.5

            return loss, loss.item(), lbox.item(), lconf.item(), lid.item(), nT

        else:
            p_conf = torch.softmax(p_conf, dim=1)[:,1,...].unsqueeze(-1)
            p_emb = F.normalize(p_emb.unsqueeze(1).repeat(1,self.nA,1,1,1).contiguous(), dim=-1)
            #p_emb_up = F.normalize(shift_tensor_vertically(p_emb, -self.shift[self.layer]), dim=-1)
            #p_emb_down = F.normalize(shift_tensor_vertically(p_emb, self.shift[self.layer]), dim=-1)
            p_cls = torch.zeros(nB,self.nA,nGh,nGw,1).cuda()               # Temp
            p = torch.cat([p_box, p_conf, p_cls, p_emb], dim=-1)
            #p = torch.cat([p_box, p_conf, p_cls, p_emb, p_emb_up, p_emb_down], dim=-1)
            p[..., :4] = decode_delta_map(p[..., :4], self.anchor_vec.to(p))
            p[..., :4] *= self.stride

            return p.view(nB, -1, p.shape[-1])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

稍微有点冗长，我们来一一分析。

（1）前向预测信息的划分

首先作者将输出特征图划分为三部分，分别为

包含embedding信息的p_emb
包含检测框位置信息的p_box
包含前景背景分类置信度的p_conf

        p_emb = p_emb.permute(0,2,3,1).contiguous()
        p_box = p[..., :4]
        p_conf = p[..., 4:6].permute(0, 4, 1, 2, 3)  # Conf
1
2
3

这对应上图中的

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fwjPbhmy-1658901120195)(C:\Users\Liujiawang\AppData\Roaming\Typora\typora-user-images\image-20220727132207461.png)]$

（2）监督信息的划分

其次，作者从真实的标签信息中抽取各自的监督信息，分别为

        # Training
        if targets is not None:
            if test_emb:
                tconf, tbox, tids = build_targets_max(targets, self.anchor_vec.cuda(), self.nA, self.nC, nGh, nGw)
            else:
                tconf, tbox, tids = build_targets_thres(targets, self.anchor_vec.cuda(), self.nA, self.nC, nGh, nGw)
            tconf, tbox, tids = tconf.cuda(), tbox.cuda(), tids.cuda()
            mask = tconf > 0
1
2
3
4
5
6
7
8

因为我们知道，想要高效地进行损失函数求解，最好将监督信息对应到网络输出格式。其中tconf=1代表的是anchor中与GroundTruth目标物体最接近的框。注意在Test的时候，是需要对网络输出的框进行NMS（非极大值抑制）来避免重复框的。

（3）检测框回归损失和前景背景分类损失

接着作者计算检测框回归损失和前景背景分类损失。

            # Compute losses
            nT = sum([len(x) for x in targets])  # number of targets
            nM = mask.sum().float()  # number of anchors (assigned to targets)
            nP = torch.ones_like(mask).sum().float()
            if nM > 0:
                lbox = self.SmoothL1Loss(p_box[mask], tbox[mask])
            else:
                FT = torch.cuda.FloatTensor if p_conf.is_cuda else torch.FloatTensor
                lbox, lconf =  FT([0]), FT([0])
            lconf =  self.SoftmaxLoss(p_conf, tconf)
            lid = torch.Tensor(1).fill_(0).squeeze().cuda()
            emb_mask,_ = mask.max(1)
1
2
3
4
5
6
7
8
9
10
11
12

这里值得注意的是mask的设置。作者将被分配为目标（targets）的anchor设置mask，那么只考虑被分配为targets的anchor对应位置上的值来计算损失函数。

（4）embedding损失的计算

接着作者计算embedding损失，值得注意的是，在该过程中作者采用了前面提到的全连结层来获得embedding的高级语义信息（track ID）。然后使用常见用于分类任务的交叉熵损失函数。

具体代码如下：

            # For convenience we use max(1) to decide the id, TODO: more reseanable strategy
            tids,_ = tids.max(1) 
            tids = tids[emb_mask]
            embedding = p_emb[emb_mask].contiguous()
            embedding = self.emb_scale * F.normalize(embedding)
            nI = emb_mask.sum().float()
            
            if  test_emb:
                if np.prod(embedding.shape)==0  or np.prod(tids.shape) == 0:
                    return torch.zeros(0, self.emb_dim+1).cuda()
                emb_and_gt = torch.cat([embedding, tids.float()], dim=1)
                return emb_and_gt
            
            if len(embedding) > 1:
                logits = classifier(embedding).contiguous()
                lid =  self.IDLoss(logits, tids.squeeze())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

（5）总损失函数

最后，作者按照公式

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ilsZtfb5-1658901120196)(C:\Users\Liujiawang\AppData\Roaming\Typora\typora-user-images\image-20220727133857058.png)]$

对上面的三个损失进行求和。实现代码如下

            loss = torch.exp(-self.s_r)*lbox + torch.exp(-self.s_c)*lconf + torch.exp(-self.s_id)*lid + \
                   (self.s_r + self.s_c + self.s_id)
            loss *= 0.5
1
2
3

至此，有关predicition head的部分就讲解结束了。JDE的主体部分就介绍完了，其他细节，大家可以看一下原论文和代码，进行探索。

3 匹配

1.如何匹配

匹配是根据prediction head输出的embedding来进行匹配的。

2.变量说明

Tracks：表示一个被记录的物体（instance，同一个人）

Detections：表示当前帧中检测的是前景（比如人）的物体（会有对应的边框坐标和embedding）

不同状态的Tracks会有四个序列来分别进行存储：

Activated：表示当前帧中出现某个Tracks记录的人，则Tracks状态变为Activated
Refined：处于Lost状态的Tracks记录的人出现在当前帧
Lost：处于Lost状态的Tracks，但是并没有被删除（Remove）
Removed（删除）：剔除序列的Tracks

3.匹配细则

当下一帧来到时，如何在处于Activated状态中的Tracks找到与其匹配的detection：

对处于Activated的Tracks，使用卡尔曼滤波预测物体当前帧的位置。
通过余弦相似性来计算Activated Tracks与Detections之间的appearance affinity matrix AE；通过马氏距离来计算Activated Tracks与Detections之间的motion affinity matrix AM。综合AE和AM得到最终的Cost Matrix，通过使用匈牙利算法来根据cost matrix进行Track与detection间的最佳匹配。
对于匹配到的且处于Activated 状态的Tracks，状态依旧处于Activated；对于新的detections，则新建一个处于Activated的Track；对于处于Lost状态的Tracks将重新处于Refined状态。
对于匹配失败的Track，则采用IOU距离度量指标重新进行匹配。
在IOU距离匹配下：匹配成功的Tracks处于Activated状态，失败的处于Lost状态。

关于匹配的详细代码：

class JDETracker(object):
    def __init__(self, opt, frame_rate=30):
        self.opt = opt
        self.model = Darknet(opt.cfg, nID=14455)
        # load_darknet_weights(self.model, opt.weights)
        self.model.load_state_dict(torch.load(opt.weights, map_location='cpu')['model'], strict=False)
        self.model.cuda().eval()

        self.tracked_stracks = []  # type: list[STrack]
        self.lost_stracks = []  # type: list[STrack]
        self.removed_stracks = []  # type: list[STrack]

        self.frame_id = 0
        self.det_thresh = opt.conf_thres
        self.buffer_size = int(frame_rate / 30.0 * opt.track_buffer)
        self.max_time_lost = self.buffer_size

        self.kalman_filter = KalmanFilter()

    def update(self, im_blob, img0):
        """
        Processes the image frame and finds bounding box(detections).

        Associates the detection with corresponding tracklets and also handles lost, removed, refound and active tracklets

        Parameters
        ----------
        im_blob : torch.float32
                  Tensor of shape depending upon the size of image. By default, shape of this tensor is [1, 3, 608, 1088]

        img0 : ndarray
               ndarray of shape depending on the input image sequence. By default, shape is [608, 1080, 3]

        Returns
        -------
        output_stracks : list of Strack(instances)
                         The list contains information regarding the online_tracklets for the recieved image tensor.

        """
        # 定义不同的序列存放不同的frame
        self.frame_id += 1
        activated_starcks = []      # for storing active tracks, for the current frame
        refind_stracks = []         # Lost Tracks whose detections are obtained in the current frame
        lost_stracks = []           # The tracks which are not obtained in the current frame but are not removed.(Lost for some time lesser than the threshold for removing)
        removed_stracks = []

        t1 = time.time()
        ''' Step 1: Network forward, get detections & embeddings'''
        with torch.no_grad():
            pred = self.model(im_blob)
        # pred is tensor of all the proposals (default number of proposals: 54264). Proposals have information associated with the bounding box and embeddings
        pred = pred[pred[:, :, 4] > self.opt.conf_thres]
        # pred now has lesser number of proposals. Proposals rejected on basis of object confidence score
        if len(pred) > 0:
            dets = non_max_suppression(pred.unsqueeze(0), self.opt.conf_thres, self.opt.nms_thres)[0].cpu()
            # Final proposals are obtained in dets. Information of bounding box and embeddings also included
            # Next step changes the detection scales
            scale_coords(self.opt.img_size, dets[:, :4], img0.shape).round()
            '''Detections is list of (x1, y1, x2, y2, object_conf, class_score, class_pred)'''
            # class_pred is the embeddings.

            detections = [STrack(STrack.tlbr_to_tlwh(tlbrs[:4]), tlbrs[4], f.numpy(), 30) for
                          (tlbrs, f) in zip(dets[:, :5], dets[:, 6:])]
        else:
            detections = []

        t2 = time.time()
        # print('Forward: {} s'.format(t2-t1))

        ''' Add newly detected tracklets to tracked_stracks'''
        unconfirmed = []
        tracked_stracks = []  # type: list[STrack]
        for track in self.tracked_stracks:
            if not track.is_activated:
                # previous tracks which are not active in the current frame are added in unconfirmed list
                unconfirmed.append(track)
                # print("Should not be here, in unconfirmed")
            else:
                # Active tracks are added to the local list 'tracked_stracks'
                tracked_stracks.append(track)

        ''' Step 2: First association, with embedding'''
        # Combining currently tracked_stracks and lost_stracks
        strack_pool = joint_stracks(tracked_stracks, self.lost_stracks)
        # Predict the current location with KF 运用卡尔曼滤波进行motion state的更新
        STrack.multi_predict(strack_pool, self.kalman_filter)

        # appearance affinity matrix
        dists = matching.embedding_distance(strack_pool, detections)
        # dists = matching.gate_cost_matrix(self.kalman_filter, dists, strack_pool, detections)
        # motion affinity matrix
        dists = matching.fuse_motion(self.kalman_filter, dists, strack_pool, detections)
        # The dists is the list of distances of the detection with the tracks in strack_pool
        # 匈牙利算法做匹配
        matches, u_track, u_detection = matching.linear_assignment(dists, thresh=0.7)
        # The matches is the array for corresponding matches of the detection with the corresponding strack_pool

        for itracked, idet in matches:
            # itracked is the id of the track and idet is the detection
            track = strack_pool[itracked]
            det = detections[idet]
            if track.state == TrackState.Tracked:
                # If the track is active, add the detection to the track
                track.update(detections[idet], self.frame_id)
                activated_starcks.append(track)
            else:
                # We have obtained a detection from a track which is not active, hence put the track in refind_stracks list
                track.re_activate(det, self.frame_id, new_id=False)
                refind_stracks.append(track)

        # None of the steps below happen if there are no undetected tracks.
        ''' Step 3: Second association, with IOU'''
        detections = [detections[i] for i in u_detection]
        # detections is now a list of the unmatched detections
        r_tracked_stracks = [] # This is container for stracks which were tracked till the
        # previous frame but no detection was found for it in the current frame
        for i in u_track:
            if strack_pool[i].state == TrackState.Tracked:
                r_tracked_stracks.append(strack_pool[i])
        dists = matching.iou_distance(r_tracked_stracks, detections)
        matches, u_track, u_detection = matching.linear_assignment(dists, thresh=0.5)
        # matches is the list of detections which matched with corresponding tracks by IOU distance method
        for itracked, idet in matches:
            track = r_tracked_stracks[itracked]
            det = detections[idet]
            if track.state == TrackState.Tracked:
                track.update(det, self.frame_id)
                activated_starcks.append(track)
            else:
                track.re_activate(det, self.frame_id, new_id=False)
                refind_stracks.append(track)
        # Same process done for some unmatched detections, but now considering IOU_distance as measure

        for it in u_track:
            track = r_tracked_stracks[it]
            if not track.state == TrackState.Lost:
                track.mark_lost()
                lost_stracks.append(track)
        # If no detections are obtained for tracks (u_track), the tracks are added to lost_tracks list and are marked lost

        '''Deal with unconfirmed tracks, usually tracks with only one beginning frame'''
        detections = [detections[i] for i in u_detection]
        dists = matching.iou_distance(unconfirmed, detections)
        matches, u_unconfirmed, u_detection = matching.linear_assignment(dists, thresh=0.7)
        for itracked, idet in matches:
            unconfirmed[itracked].update(detections[idet], self.frame_id)
            activated_starcks.append(unconfirmed[itracked])

        # The tracks which are yet not matched
        for it in u_unconfirmed:
            track = unconfirmed[it]
            track.mark_removed()
            removed_stracks.append(track)

        # after all these confirmation steps, if a new detection is found, it is initialized for a new track
        """ Step 4: Init new stracks"""
        for inew in u_detection:
            track = detections[inew]
            if track.score < self.det_thresh:
                continue
            track.activate(self.kalman_filter, self.frame_id)
            activated_starcks.append(track)

        """ Step 5: Update state"""
        # If the tracks are lost for more frames than the threshold number, the tracks are removed.
        for track in self.lost_stracks:
            if self.frame_id - track.end_frame > self.max_time_lost:
                track.mark_removed()
                removed_stracks.append(track)
        # print('Remained match {} s'.format(t4-t3))

        # Update the self.tracked_stracks and self.lost_stracks using the updates in this step.
        self.tracked_stracks = [t for t in self.tracked_stracks if t.state == TrackState.Tracked]
        self.tracked_stracks = joint_stracks(self.tracked_stracks, activated_starcks)
        self.tracked_stracks = joint_stracks(self.tracked_stracks, refind_stracks)
        # self.lost_stracks = [t for t in self.lost_stracks if t.state == TrackState.Lost]  # type: list[STrack]
        self.lost_stracks = sub_stracks(self.lost_stracks, self.tracked_stracks)
        self.lost_stracks.extend(lost_stracks)
        self.lost_stracks = sub_stracks(self.lost_stracks, self.removed_stracks)
        self.removed_stracks.extend(removed_stracks)
        self.tracked_stracks, self.lost_stracks = remove_duplicate_stracks(self.tracked_stracks, self.lost_stracks)

        # get scores of lost tracks
        output_stracks = [track for track in self.tracked_stracks if track.is_activated]

        logger.debug('===========Frame {}=========='.format(self.frame_id))
        logger.debug('Activated: {}'.format([track.track_id for track in activated_starcks]))
        logger.debug('Refind: {}'.format([track.track_id for track in refind_stracks]))
        logger.debug('Lost: {}'.format([track.track_id for track in lost_stracks]))
        logger.debug('Removed: {}'.format([track.track_id for track in removed_stracks]))
        # print('Final {} s'.format(t5-t4))
        return output_stracks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192

上述代码中的step与匹配细则相对应。

4 总结

tracking by detection是非常常见的MOT范式，但是目前MOT领域为了平衡追踪速度和精度，慢慢放弃了这种范式，转而投入将检测与embedding匹配进行结合的范式研究中。本文介绍的JDE就是一个网络同时输出图像画面中的检测框位置和检测框内物体的embedding，从而加速MOT的速度。

但是值得注意的是，JDE只是同时输出了检测框和embedding信息。后面还是要通过卡尔曼滤波和匈牙利算法进行目标的匹配。总的来说，还是分为检测和匹配双阶段。

5 参考

参考代码：

[1]https://github.com/Zhongdao/Towards-Realtime-MOT
[2]https://zhuanlan.zhihu.com/p/243290960

相关阅读:
两种形式的import
C# 面向对象之多态
 Java中的volatile为什么不能保证原子性
 【LeetCode刷题笔记】动态规划 — 70.爬楼梯
 Containerd容器运行时的Kubernetes（K8s）环境搭建
 探秘MySQL三个神秘隐藏列（mysql三个隐藏列） rownum、rowid、oid
代码的巨大进步
 不同平方的电线都能带动哪些家用电器
 Python 数学建模算法与应用
 【向量数据库】相似向量检索Faiss数据库的安装及余弦相似度计算（C++）
原文地址：https://blog.csdn.net/qq_45033722/article/details/126013141