点云数据一般是由激光雷达等3D扫描设备获取的空间若干点的信息,一般包括(X,Y,Z)位置信息、RGB颜色信息和强度信息等,是一种多维度的复杂数据集合。相比于2D图像来说,3D点云数据可以提供丰富的几何、形状和尺度信息,不易受光照强度变化和其它物体遮挡的影响。
点云可以通过四种主要技术获得:
基于不同原理获取的点云数据,其数据的表示特征和应用的范围也各不相同。
激光雷达(LiDAR)点云数据,是由三维激光雷达设备扫描得到的空间点的数据集,每一个点都包含了X、Y、Z三维坐标信息,有的还包含颜色信息、反射强度信息、回波次数信息等。
激光雷达数据的获取原理:由车载激光扫描系统向周围发射激光信号,然后收集反射的激光信号,再通过数据采集、组合导航、点云解算,计算出这些点的准确空间信息。
pcd点云格式是pcl库种常常使用的点云文件格式。一个pcd文件中通常由两部分组成:分别是文件说明和点云数据
ply文件格式是斯坦福大学开发的一套三维mesh模型数据格式,图形学领域内很多著名的模型数据,比如Stanford的三维扫描数据库,Geogia Tech的大型几何模型库,最初的模型都是基于这个格式的。
其他:bin,txt格式
https://arxiv.org/pdf/1710.07368.pdf
由于点云的稀疏性以及不规律性,一般的2D CNN无法直接处理,因此需要事先转换成CNN-friendly的数据结构。SqueezeSeg使用球面投影,将点云转换为前视图
采用KITTI64线激光雷达数据,因此前视图H=64,同时受到数据集标注的影响,只考虑90°的前视角范围并划分为512个单元格,因此前视图W=512;每个点有5个feature:点的三维坐标(x,y,z)、反射强度I和点到视角中心的距离r,所以处理后的输入图像尺寸为64x512x5。
import torch
import torch.nn as nn
import torch.nn.functional as F
class Conv(nn.Module):
def __init__(self, inputs, outputs, kernel_size=3, stride=(1,2), padding=1):
super(Conv, self).__init__()
self.conv = nn.Conv2d(inputs, outputs, kernel_size=kernel_size, stride=stride, padding=padding)
def forward(self, x):a
return F.relu(self.conv(x))
class MaxPool(nn.Module):
def __init__(self, kernel_size=3, stride=(1,2), padding=(1,0)):
super(MaxPool, self).__init__()
self.pool = nn.MaxPool2d(kernel_size=kernel_size, stride=stride, padding=padding, ceil_mode=True)
def forward(self,x):
return self.pool(x)
class Fire(nn.Module):
def __init__(self, inputs, out_channels1, out_channelsex1x1, out_channelsex3x3):
super(Fire, self).__init__()
self.conv1x1 = Conv(inputs, out_channels1, kernel_size=1, stride=1, padding=0) # 64 --> 16
self.ex1x1 = Conv(out_channels1, out_channelsex1x1, kernel_size=1, stride=1, padding=0) # 16 --> 64
self.ex3x3 = Conv(out_channels1, out_channelsex3x3, kernel_size=3, stride=1, padding=1) # 16 --> 64
def forward(self,x):
return torch.cat([self.ex1x1(self.conv1x1(x)), self.ex3x3(self.conv1x1(x))], 1) # 合并为128
class Deconv(nn.Module):
def __init__(self, inputs, out_channels, kernel_size, stride, padding=0):
super(Deconv, self).__init__()
self.deconv = nn.ConvTranspose2d(inputs, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
def forward(self, x):
return F.relu(self.deconv(x))
class FireDeconv(nn.Module): # W --> W x 2, H不变
def __init__(self, inputs, out_channels, out_channelsex1x1, out_channelsex3x3):
super(FireDeconv, self).__init__()
self.conv1x1 = Conv(inputs, out_channels, 1, 1, 0) # FireDeconv(512, 64, 128, 128)
self.deconv = Deconv(out_channels, out_channels, [1,4], [1,2], [0,1])
self.ex1x1 = Conv(out_channels, out_channelsex1x1, 1, 1, 0)
self.ex3x3 = Conv(out_channels, out_channelsex3x3, 3, 1, 1)
def forward(self,x):
x = self.conv1x1(x)
x = self.deconv(x)
return torch.cat([self.ex1x1(x), self.ex3x3(x)], 1)
class SqueezeSeg( nn.Module ):
def __init__(self):
super(SqueezeSeg, self).__init__()
# encoder
self.conv1 = Conv(5, 64, 3, (1,2), 1) # 1, 64, 64, 256 W方向步长为2,H方向步长为1
self.conv1_skip = Conv(5, 64, 1, 1, 0) # 第一个跳跃连接
self.pool1 = MaxPool(3, (1,2), (1,0)) # 1, 64, 64, 128 只在W方向下采样2倍
self.fire2 = Fire(64, 16, 64, 64) # 1, 128, 64, 128
self.fire3 = Fire(128, 16, 64, 64) # 1, 128, 64, 128
self.pool3 = MaxPool(3, (1,2), (1,0)) # 1, 128, 64, 64
self.fire4 = Fire(128, 32, 128, 128) # 1, 256, 64, 64
self.fire5 = Fire(256, 32, 128, 128) # 1, 256, 64, 64
self.pool5 = MaxPool(3, (1,2), (1,0)) # 1, 256, 64, 32
self.fire6 = Fire(256, 48, 192, 192) # 1, 384, 64, 32
self.fire7 = Fire(384, 48, 192, 192) # 1, 384, 64, 32
self.fire8 = Fire(384, 64, 256, 256) # 1, 512, 64, 32
self.fire9 = Fire(512, 64, 256, 256) # 1, 512, 64, 32
# decoder
self.firedeconv10 = FireDeconv(512, 64, 128, 128) # 1, 256, 64, 64
self.firedeconv11 = FireDeconv(256, 32, 64, 64) # 1, 128, 64, 128
self.firedeconv12 = FireDeconv(128, 16, 32, 32) # 1, 64, 64, 256
self.firedeconv13 = FireDeconv(64, 16, 32, 32) # 1, 64, 64, 512
self.drop = nn.Dropout2d()
self.conv14 = nn.Conv2d(64, 4, kernel_size=3, stride=1, padding=1) # 1, 4, 64, 512
# self.bf = BilateralFilter(mc, stride=1, padding=(1,2))
# self.rc = RecurrentCRF(mc, stride=1, padding=(1,2))
def forward(self, x):
# encoder
out_c1 = self.conv1(x)
out = self.pool1(out_c1)
out_f3 = self.fire3( self.fire2(out) )
out = self.pool3(out_f3)
out_f5 = self.fire5( self.fire4(out) )
out = self.pool5(out_f5)
out = self.fire9( self.fire8( self.fire7( self.fire6(out) ) ) )
# decoder
out = torch.add(self.firedeconv10(out), out_f5)
out = torch.add(self.firedeconv11(out), out_f3)
out = torch.add(self.firedeconv12(out), out_c1)
out = self.drop( torch.add(self.firedeconv13(out), self.conv1_skip(x)) )
out = self.conv14(out)
# bf_w = self.bf(x[:, :3, :, :])
# out = self.rc(out, lidar_mask, bf_w)
return out
if __name__ == "__main__":
x = torch.randn(1, 5, 64, 512) # 规整为(B, C, H, W)形式
print(x.shape)
conv1 = Conv(5, 64, 3, (1,2), 1)
conv1_skip = Conv(5, 64, 1, 1, 0)
pool1 = MaxPool(3, (1,2), (1,0))
fire2 = Fire(64, 16, 64, 64)
fire3 = Fire(128, 16, 64, 64)
pool3 = MaxPool(3, (1,2), (1,0))
fire4 = Fire(128, 32, 128, 128)
fire5 = Fire(256, 32, 128, 128)
pool5 = MaxPool(3, (1,2), (1,0))
fire6 = Fire(256, 48, 192, 192)
fire7 = Fire(384, 48, 192, 192)
fire8 = Fire(384, 64, 256, 256)
fire9 = Fire(512, 64, 256, 256)
# decoder
firedeconv10 = FireDeconv(512, 64, 128, 128)
firedeconv11 = FireDeconv(256, 32, 64, 64)
firedeconv12 = FireDeconv(128, 16, 32, 32)
firedeconv13 = FireDeconv(64, 16, 32, 32)
drop = nn.Dropout2d()
convdeconv14 = nn.Conv2d(64, 4, kernel_size=3, stride=1, padding=1) # 输出通道数为4
for layer in [conv1, pool1, fire2, fire3, pool3, fire4, fire5, pool5, fire6, fire7, fire8, fire9, firedeconv10, firedeconv11, firedeconv12, firedeconv13, convdeconv14]:
x = layer(x)
print(x.shape)
# 输出
# torch.Size([1, 5, 64, 512])
# torch.Size([1, 64, 64, 256])
# torch.Size([1, 64, 64, 128])
# torch.Size([1, 128, 64, 128])
# torch.Size([1, 128, 64, 128])
# torch.Size([1, 128, 64, 64])
# torch.Size([1, 256, 64, 64])
# torch.Size([1, 256, 64, 64])
# torch.Size([1, 256, 64, 32])
# torch.Size([1, 384, 64, 32])
# torch.Size([1, 384, 64, 32])
# torch.Size([1, 512, 64, 32])
# torch.Size([1, 512, 64, 32])
# torch.Size([1, 256, 64, 64])
# torch.Size([1, 128, 64, 128])
# torch.Size([1, 64, 64, 256])
# torch.Size([1, 64, 64, 512])
# torch.Size([1, 4, 64, 512])
https://arxiv.org/pdf/1809.08495v1.pdf
整体网络结构:
“Together with other improvements such as focal loss, batch normalization and a LiDAR mask channel, SqueezeSegV2 sees accuracy improvements of 6.0% to 8.6% in various pixel categories over the original SqueezeSeg.”(论文中未给出检测速度说明)
https://arxiv.org/pdf/2004.01803.pdf
“On the SemanticKITTI benchmark, SqueezeSegV3 out performs all previously published methods by at least 3.7 mIoU with comparable inference speed, demonstrating the effectiveness of spatially-adaptive convolution.”
整体网络结构
亮点:
贡献:
算法步骤:
Step1:球映射(Spherical projection)
Step2:语义分割(Segmentation)
Step3:点云重建(Point cloud reconstruction)
Step4:点云后处理(Post-processing)
网络结构
在训练期间,该网络使用随机梯度下降和加权交叉熵损失进行端到端优化:
点云重建 (Point Cloud Reconstruction)做法:
类比于RangeNet++:二者都是先构建range image,然后利用特定的卷积网络进行分割,再将2D分割结果映射回3D点云中,最后用KNN优化分割结果。
点云数据处理:SalsaNext中对于点云数据的处理采用方式和RangeNet++相同,都为使用 range image作为输入。比较特殊的是这里采用360°的数据。
网络架构
类比于SalsaNet:
调研结果汇总
数据使用形式:上述方法均是先用spherical projection将3D LiDAR点云映射为2D LiDAR image然后使用2D卷积网络进行后续的分割,只是映射公式存在差异
网络结构:上述方法均采用了Encoder-Decoder+Skip Connection的结构,并且改进基本是在保持大框架的前提下对各卷积模块进行改进,或是在pre/post processing过程中加入trick
策略对比:
检测精度对比:
推理速度对比:
注:不同GPU算力不一致,这里仅汇总了论文中提供的实验数据