• 【技术追踪】SAM(Segment Anything Model)代码解析与结构绘制之Image Encoder


      论文:Segment Anything
      代码:https://github.com/facebookresearch/segment-anything

    1. 使用SAM

      尽管官方demo玩的很花很溜,但只有能够本地运行起来,才能够查看中间过程不是,基于这篇文章,使用官方的狗狗图像,采用sam_vit_b_01ec64.pth模型,给定point,完成狗狗的分割。

      (1)狗狗图像:

    在这里插入图片描述

      (2)运行代码:

    import cv2
    import matplotlib.pyplot as plt
    import numpy as np
    from segment_anything import sam_model_registry, SamPredictor
    import torch
    
    
    def show_mask(mask, ax, random_color=False):
        if random_color:
            color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
        else:
            color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
        h, w = mask.shape[-2:]
        mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
        ax.imshow(mask_image)
        return mask_image
    
    
    def show_points(coords, labels, ax, marker_size=375):
        pos_points = coords[labels == 1]
        neg_points = coords[labels == 0]
        ax.scatter(pos_points[:, 0], pos_points[:, 1], color='green', marker='*', s=marker_size, edgecolor='white',
                   linewidth=1.25)
        ax.scatter(neg_points[:, 0], neg_points[:, 1], color='red', marker='*', s=marker_size, edgecolor='white',
                   linewidth=1.25)
    
    
    sam_checkpoint = "./sam_vit_b_01ec64.pth"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model_type = "vit_b"
    
    sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
    sam.to(device=device)
    predictor = SamPredictor(sam)
    
    image = cv2.imread("./test image/image dog.jpg")
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    
    predictor.set_image(image)
    
    input_point = np.array([[1300, 800]])
    input_label = np.array([1])
    
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    show_points(input_point, input_label, plt.gca())
    plt.axis('off')
    plt.show()
    
    masks, scores, logits = predictor.predict(
        point_coords=input_point,
        point_labels=input_label,
        multimask_output=True,
    )
    print(scores)
    index = np.argmax(scores)
    
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    show_mask(masks[index], plt.gca())
    show_points(input_point, input_label, plt.gca())
    plt.title(f"Mask {index + 1}, Score: {scores[index]:.3f}", fontsize=18)
    plt.axis('off')
    plt.show()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64

      (3)输出结果:

    在这里插入图片描述
    在这里插入图片描述

    2. Image Encoder代码解析

    (1)set_image函数

    位置:【segment_anything/predictor.py --> SamPredictor类 --> set_image函数】
    作用: 图像预处理:缩放、转换为Tensor,通道调整,调用set_torch_image函数

    本例中狗狗图像,即输入image的 [ H , W , C ] {[H, W, C]} [H,W,C] 大小为 [ 1365 , 2048 , 3 ] {[1365, 2048, 3]} [1365,2048,3]

    def set_image(
        self,
        image: np.ndarray,
        image_format: str = "RGB",
    ) -> None:
    
        assert image_format in [
            "RGB",
            "BGR",
        ], f"image_format must be in ['RGB', 'BGR'], is {image_format}."
        if image_format != self.model.image_format:
            image = image[..., ::-1]
    
        # Transform the image to the form expected by the model
        # 输入image: ndarray->(H, W, 3)=(1365, 2048, 3)
        # input_image: ndarray->(H*1024/W, 1024, 3)=(683, 1024, 3) 
        input_image = self.transform.apply_image(image)  # 等比缩放图像至长边为1024
    
        # 转换为tensor形式:input_image_torch: tensor->[683, 1024, 3]
        input_image_torch = torch.as_tensor(input_image, device=self.device)  
        # 通道调整:input_image_torch: tensor->[1, 3, 683, 1024]
        input_image_torch = input_image_torch.permute(2, 0, 1).contiguous()[None, :, :, :]
        # 调用set_torch_image函数,传入参数input_image_torch与原始图像大小(1365, 2048)
        self.set_torch_image(input_image_torch, image.shape[:2])
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    (2)set_torch_image函数

    位置:【segment_anything/predictor.py --> SamPredictor类 --> set_torch_image函数】
    作用: 图像预处理,调用image_encoder,实现图像嵌入

       def set_torch_image(
            self,
            transformed_image: torch.Tensor,
            original_image_size: Tuple[int, ...],
        ) -> None:
            
            assert (
                len(transformed_image.shape) == 4
                and transformed_image.shape[1] == 3
                and max(*transformed_image.shape[2:]) == self.model.image_encoder.img_size
            ), f"set_torch_image input must be BCHW with long side {self.model.image_encoder.img_size}."
            self.reset_image()
    
            self.original_size = original_image_size   # 原始图像大小(H, W)=(1365, 2048)
            self.input_size = tuple(transformed_image.shape[-2:])  # 输入图像大小(683, 1024)
            # transformed_image.size():[1, 3, H*1024/W, 1024]————>归一化且填充到正方形
            input_image = self.model.preprocess(transformed_image)   # input_image.size():[1, 3, 1024, 1024]
            self.features = self.model.image_encoder(input_image)   # feature.size():[1, 256, 64, 64]
            self.is_image_set = True
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    (3)preprocess函数

    位置:【segment_anything/modeling/sam.py --> sam类 --> preprocess函数】
    作用: 归一化图像并将其填充为正方形

     def preprocess(self, x: torch.Tensor) -> torch.Tensor:
            """Normalize pixel values and pad to a square input."""
            # 归一化, 均值和标准差已经定义好了, 至于为什么是这个哩, 猜测可能是整个数据集的
            # pixel_mean=[123.675, 116.28, 103.53], pixel_std=[58.395, 57.12, 57.375]
            x = (x - self.pixel_mean) / self.pixel_std
    
            # Pad
            h, w = x.shape[-2:]  # 输入图像大小 h=683, w=1024
            # Image Encoder的图像输入大小为1024
            padh = self.image_encoder.img_size - h  # 1024-683=341
            padw = self.image_encoder.img_size - w  # 1024-1024=0
            x = F.pad(x, (0, padw, 0, padh))  # 补零填充, x.size=[1, 3, 1024, 1024]
            return x
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    (4)ImageEncoderViT类

    位置:【segment_anything/modeling/image_encoder.py -->ImageEncoderViT类】
    作用: 实现图像嵌入,主要包括patch_embed、block和neck三个部分

    class ImageEncoderViT(nn.Module):
        def __init__(
            self,
            img_size: int = 1024,
            patch_size: int = 16,
            in_chans: int = 3,
            embed_dim: int = 768,
            depth: int = 12,
            num_heads: int = 12,
            mlp_ratio: float = 4.0,
            out_chans: int = 256,
            qkv_bias: bool = True,
            norm_layer: Type[nn.Module] = nn.LayerNorm,
            act_layer: Type[nn.Module] = nn.GELU,
            use_abs_pos: bool = True,
            use_rel_pos: bool = False,
            rel_pos_zero_init: bool = True,
            window_size: int = 0,
            global_attn_indexes: Tuple[int, ...] = (),
        ) -> None:
            """
            Args:
                img_size (int): Input image size.
                patch_size (int): Patch size.
                in_chans (int): Number of input image channels.
                embed_dim (int): Patch embedding dimension.
                depth (int): Depth of ViT.
                num_heads (int): Number of attention heads in each ViT block.
                mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
                qkv_bias (bool): If True, add a learnable bias to query, key, value.
                norm_layer (nn.Module): Normalization layer.
                act_layer (nn.Module): Activation layer.
                use_abs_pos (bool): If True, use absolute positional embeddings.
                use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
                rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
                window_size (int): Window size for window attention blocks.
                global_attn_indexes (list): Indexes for blocks using global attention.
            """
            super().__init__()
            self.img_size = img_size  # 输入图像大小1024
    		
    		# 将图像划分为Patch
            self.patch_embed = PatchEmbed(
                kernel_size=(patch_size, patch_size),  # 卷积核大小(16, 16)
                stride=(patch_size, patch_size),  # 卷积核步长(16, 16)
                in_chans=in_chans,  # 输入图像通道=3
                embed_dim=embed_dim,  # patch嵌入维度=768
            )
            
    		# 位置嵌入
            self.pos_embed: Optional[nn.Parameter] = None
            if use_abs_pos:
                # Initialize absolute positional embedding with pretrain image size.
                self.pos_embed = nn.Parameter(
                    torch.zeros(1, img_size // patch_size, img_size // patch_size, embed_dim)
                )  # 可学习参数[1, 64, 64, 768]
    		
    		# Block模块
            self.blocks = nn.ModuleList()
            for i in range(depth):
                block = Block(
                    dim=embed_dim,  # 嵌入维度=768
                    num_heads=num_heads,  # multi-head注意机制多头的数目=12
                    mlp_ratio=mlp_ratio,  # MLP隐藏层的维度变换因子=4
                    qkv_bias=qkv_bias,  # qkv全连接层的偏置=True
                    norm_layer=norm_layer,  # 归一化层: nn.LayerNorm
                    act_layer=act_layer,  # 激活函数层: nn.GELU
                    use_rel_pos=use_rel_pos,  # 是否添加相对位置嵌入=False
                    rel_pos_zero_init=rel_pos_zero_init,  # 零初始化相对位置参数=True
                    # sam_vit_b中global_attn_indexes=encoder_global_attn_indexes=[2, 5, 8, 11]
                    # 12个Block中的window_size[14,14,0,14,14,0,14,14,0,14,14,0]
                    window_size=window_size if i not in global_attn_indexes else 0,
                    input_size=(img_size // patch_size, img_size // patch_size),  # 输入大小(64, 64)
                )
                self.blocks.append(block)
    		
    		# 输出neck模块
            self.neck = nn.Sequential(
                nn.Conv2d(
                    embed_dim,
                    out_chans,
                    kernel_size=1,
                    bias=False,
                ),
                LayerNorm2d(out_chans),
                nn.Conv2d(
                    out_chans,
                    out_chans,
                    kernel_size=3,
                    padding=1,
                    bias=False,
                ),
                LayerNorm2d(out_chans),
            )
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            # 输入x.size():[1, 3, 1024, 1024]
            x = self.patch_embed(x)  # [1, 64, 64, 768]
            # 添加位置嵌入
            if self.pos_embed is not None:
                x = x + self.pos_embed  # [1, 64, 64, 768]
    		# attention模块
            for blk in self.blocks:
                x = blk(x)  # [1, 64, 64, 768]
    
            x = self.neck(x.permute(0, 3, 1, 2))  # 输出x.size():[1, 256, 64, 64]
    
            return x
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108

    ①patch_embed

    class PatchEmbed(nn.Module):
        def __init__(
            self,
            kernel_size: Tuple[int, int] = (16, 16),
            stride: Tuple[int, int] = (16, 16),
            padding: Tuple[int, int] = (0, 0),
            in_chans: int = 3,
            embed_dim: int = 768,
        ) -> None:
            
            super().__init__()
    
            self.proj = nn.Conv2d(
                in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding
            )
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            x = self.proj(x)  # [1, 3, 1024, 1024]——>[1, 768, 64, 64]
            # B C H W -> B H W C
            x = x.permute(0, 2, 3, 1)  # [1, 64, 64, 768]
            return x
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    ②Block

    class Block(nn.Module):
        def __init__(
            self,
            dim: int,
            num_heads: int,
            mlp_ratio: float = 4.0,
            qkv_bias: bool = True,
            norm_layer: Type[nn.Module] = nn.LayerNorm,
            act_layer: Type[nn.Module] = nn.GELU,
            use_rel_pos: bool = False,
            rel_pos_zero_init: bool = True,
            window_size: int = 0,
            input_size: Optional[Tuple[int, int]] = None,
        ) -> None:
            
            super().__init__()
            self.norm1 = norm_layer(dim)   # 归一化层nn.LayerNorm
            # attention模块
            self.attn = Attention(
                dim,
                num_heads=num_heads,
                qkv_bias=qkv_bias,
                use_rel_pos=use_rel_pos,
                rel_pos_zero_init=rel_pos_zero_init,
                input_size=input_size if window_size == 0 else (window_size, window_size),
            )
    
            self.norm2 = norm_layer(dim)  # 归一化层nn.LayerNorm
            # MLP模块, mlp_ratio=4, act_layer=nn.GELU
            self.mlp = MLPBlock(embedding_dim=dim, mlp_dim=int(dim * mlp_ratio), act=act_layer)
    
            self.window_size = window_size  # 窗口大小=14或0
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            shortcut = x  # [1, 64, 64, 768]
            x = self.norm1(x)
            # Window partition
            if self.window_size > 0:
                H, W = x.shape[1], x.shape[2]  # H=64, W=64
                x, pad_hw = window_partition(x, self.window_size)  # x.size():[25, 14, 14, 768], Pad_hw.size():[70, 70]
    
            x = self.attn(x)  # [25, 14, 14, 768]
            # Reverse window partition
            if self.window_size > 0:
                x = window_unpartition(x, self.window_size, pad_hw, (H, W))  # [1, 64, 64, 768]
    
            x = shortcut + x   # 残差连接
            x = x + self.mlp(self.norm2(x))  # [1, 64, 64, 768]
    
            return x
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50

    window_partition函数:不重叠窗口划分

    def window_partition(x: torch.Tensor, window_size: int) -> Tuple[torch.Tensor, Tuple[int, int]]:
        
        B, H, W, C = x.shape  # [1, 64, 64, 768]
    
        pad_h = (window_size - H % window_size) % window_size  # 需要填充的高度=6
        pad_w = (window_size - W % window_size) % window_size  # 需要填充的宽度=6
        if pad_h > 0 or pad_w > 0:
            x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))  # 填充为: [1, 70, 70, 768]
        Hp, Wp = H + pad_h, W + pad_w   # Hp=70, Wp=70
        
    	# 重塑为[1, 5, 14, 5, 14, 768]
        x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)  
        # [25, 14, 14, 768]
        windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)  
        return windows, (Hp, Wp)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    Attention类:多头注意力机制

    class Attention(nn.Module):
    
        def __init__(
            self,
            dim: int,
            num_heads: int = 8,
            qkv_bias: bool = True,
            use_rel_pos: bool = False,
            rel_pos_zero_init: bool = True,
            input_size: Optional[Tuple[int, int]] = None,
        ) -> None:
            
            super().__init__()
            self.num_heads = num_heads  # head数目=12
            head_dim = dim // num_heads  # 768/12=64
            self.scale = head_dim**-0.5  # 0.125
    
            self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)  # (768, 768*3)
            self.proj = nn.Linear(dim, dim)
    
            self.use_rel_pos = use_rel_pos
            if self.use_rel_pos:
                assert (
                    input_size is not None
                ), "Input size must be provided if using relative positional encoding."
                # initialize relative positional embeddings
                self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
                self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            B, H, W, _ = x.shape  # B=25, H=14, W=14
            # qkv with shape (3, B, nHead, H * W, C)
            # [25,14,14,768]->[25,14,14,2304]->[25,14*14,3,12,64]->[3,25,12,196,64]
            qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
            # q, k, v with shape (B * nHead, H * W, C)=[25*12,14*14,64]=[300,196,64]
            q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)
    		
            attn = (q * self.scale) @ k.transpose(-2, -1)  # [300,196,196] 
    		# 使用相对位置编码
            if self.use_rel_pos:
                attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))
    
            attn = attn.softmax(dim=-1)  # [300,196,196]
            # [300,196,196]->[300,196,64]->[25,12,14,14,64]->[25,14,14,12,64]->[25,14,14,768]
            x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
            x = self.proj(x)  # [25,14,14,768]
    
            return x
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48

    获取相对位置编码:

    def get_rel_pos(q_size: int, k_size: int, rel_pos: torch.Tensor) -> torch.Tensor:
    
        max_rel_dist = int(2 * max(q_size, k_size) - 1)  # 27
        # Interpolate rel pos if needed.
        if rel_pos.shape[0] != max_rel_dist:
            # Interpolate rel pos.
            rel_pos_resized = F.interpolate(
                rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
                size=max_rel_dist,
                mode="linear",
            )
            rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
        else:
            rel_pos_resized = rel_pos  # [27,64]
    
        # Scale the coords with short length if shapes for q and k are different.
        # size[14,1]:[0,1,2,3,4,5,6,7,8,9,10,11,12,13]
        q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
        # size[1,14]:[0,1,2,3,4,5,6,7,8,9,10,11,12,13]
        k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
        # size[14,14]:相对位置编码,右上角为0,左下角为26,沿x=y对称
        relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)
    
        return rel_pos_resized[relative_coords.long()]  # [14,14,64]
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    relative_coords编码如下:
    在这里插入图片描述
    添加相对位置编码:

    def add_decomposed_rel_pos(
        attn: torch.Tensor,
        q: torch.Tensor,
        rel_pos_h: torch.Tensor,
        rel_pos_w: torch.Tensor,
        q_size: Tuple[int, int],
        k_size: Tuple[int, int],
    ) -> torch.Tensor:
        
        q_h, q_w = q_size  # (14,14)
        k_h, k_w = k_size  # (14,14)
        # rel_pos_h=rel_pos_w=[27,64]
        Rh = get_rel_pos(q_h, k_h, rel_pos_h)  # 获取相对位置编码(14,14,64)
        Rw = get_rel_pos(q_w, k_w, rel_pos_w)  # 获取相对位置编码(14,14,64)
    
        B, _, dim = q.shape   # B=300, dim=64
        r_q = q.reshape(B, q_h, q_w, dim)  # [300, 14, 14, 64]
        rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)  # [300,14,14,14]
        rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)  # [300,14,14,14]
    
        # rel_h[:, :, :, :, None]=rel_w[:, :, :, None, :]=[300,14,14,14,1]
        # attn=[300,196,196]->[300,14,14,14,14]->[300,196,196]
        attn = (
            attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]
        ).view(B, q_h * q_w, k_h * k_w)
    
        return attn
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    window_unpartition函数:恢复原始中间特征尺寸

    def window_unpartition(
        windows: torch.Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]
    ) -> torch.Tensor:
        
        Hp, Wp = pad_hw  # (70,70)
        H, W = hw  # (64,64)
        B = windows.shape[0] // (Hp * Wp // window_size // window_size)  # B=1
        # [1,5,5,14,14,768]
        x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
        x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)  # [1,70,70,768]
    
        if Hp > H or Wp > W:
            x = x[:, :H, :W, :].contiguous()  # 去掉填充元素[1,64,64,768]
        return x
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    MLP模块:

    class MLPBlock(nn.Module):
        def __init__(
            self,
            embedding_dim: int,
            mlp_dim: int,
            act: Type[nn.Module] = nn.GELU,
        ) -> None:
            super().__init__()
            self.lin1 = nn.Linear(embedding_dim, mlp_dim)
            self.lin2 = nn.Linear(mlp_dim, embedding_dim)
            self.act = act()
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            return self.lin2(self.act(self.lin1(x)))
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    3. ImageEncoderViT结构绘制

    (1)结构打印

    ImageEncoderViT(
      (patch_embed): PatchEmbed(
        (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (blocks): ModuleList(
        (0-11): 12 x Block(
          (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (attn): Attention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          (mlp): MLPBlock(
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (act): GELU(approximate='none')
          )
        )
      )
      (neck): Sequential(
        (0): Conv2d(768, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): LayerNorm2d()
        (2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (3): LayerNorm2d()
      )
    )
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    (2)结构绘制

    在这里插入图片描述

  • 相关阅读:
    通过安装chrome插件解决Sorry, your browser does not support Java的问题
    linux之vim编辑器
    嵌入式系统编程实现485串口收发数据
    HTTP 协议
    市场调研的步骤与技巧:助你了解市场需求
    矩阵分析与应用+张贤达
    ClickHouse的表引擎
    “WingChunTechnique “app Tech Support(URL)
    SLAM学了2年还是不会?每一步其实都是脚印
    HTML+CSS美食静态网页设计——简单牛排美食餐饮(9个页面)公司网站模板企业网站实现
  • 原文地址:https://blog.csdn.net/qq_43426908/article/details/132939732