手写实现Pytorch版本YOLOv3

YOLOv3

YOLOv3

小黄人单目标检测

使用的数据集的坐标为左上角和右下角点位。
输出：
1. c：0或1，表示是否包含目标
2. x1,y1：
3. x2,y2：两个点用于画出回归框
4. cls：具体是哪一类的
和标签进行损失对比，损失总和进行反向传播

YOLOv3目标检测模型

YOLO实现多目标检测基础

一个图像中可能有多个目标，上面的情况不能用了。YOLO和单目标检测逻辑不一样了。将图片分格子，每个格子分好之后，判断每个格子里面是否有符合要求的c，x1……cls。上图中，打勾的红色，判定里面有目标，就取目标，没打钩的，就判定里面没目标。
图像金字塔实现多目标检测。
对图片区域的等分：
1. 13*13：大目标
2. 26*26：中目标
3. 52*52：小目标
YOLO采用的图片标注为中心点和w，h的形式，之前单目标检测，用的是左上角和右下角坐标。
1. 左上角+右下角的方式：有一个坐标有问题，那就模型框误差了。
2. 中心点方式：中心点确定以后，w和h学习过程中，去确定图像的宽高。W，H和中心点互不影响。

中心点坐标

YOLO将物体分为了大中小九种，设置了9种先验框。图像分块的左上角会给一个锚点。

锚点画框

每个锚点上都会画框，画框和原始框对比IOU交并比，根据IOU判断当前是否存在目标。
图像目标原始中心点和锚点的偏移量进行计算时，数原始中心点在哪个格子，找到左上角锚点以后，回归计算偏移量，训练时计算偏移量就可以了。
3*3怎么进行下采样的？步长为2. 卷积后大小为原来的1/2。

DarkNet53网络结构

手写实现YOLOv3

构建网络

定义卷积块，残差块，卷积集合块（包含多组卷积，1x1卷积和3x3卷积都有）。
下采样模块（卷积核大小为3，步长为2的卷积），上采样插值。
定义YOLOv3网络，是之前定义小的模块的组合。

# 卷积块
class ConvolutionalLayer(nn.Module):
    def __init__(self,in_channels,out_channels,kernel_size,stride,padding,bias=False):
        super(ConvolutionalLayer,self).__init__()
        self.sub_module = nn.Sequential(
            nn.Conv2d(in_channels,out_channels,kernel_size,stride,padding,bias = bias),
            nn.BatchNorm2d(out_channels),
            nn.LeakyReLU(),
        )
    
    def forward(self,x):
        return self.sub_module(x)

# 残差块
class ResidualLayer(nn.Module):
    def __init__(self,in_channels,out_channels):
        super(ResidualLayer,self).__init__()
        self.sub_module = nn.Sequential(
            ConvolutionalLayer(in_channels,out_channels,1,1,0),
            ConvolutionalLayer(out_channels,in_channels,3,1,1),
        )
    def forward(self,x):
        return self.sub_module(x)+x

# 卷积集合块
class ConvolutionalSetLayer(torch.nn.Module):
    def __init__(self, in_channels, out_channels):
        super(ConvolutionalSetLayer, self).__init__()

        self.sub_module = torch.nn.Sequential(
            ConvolutionalLayer(in_channels, out_channels, 1, 1, 0),
            ConvolutionalLayer(out_channels, in_channels, 3, 1, 1),

            ConvolutionalLayer(in_channels, out_channels, 1, 1, 0),
            ConvolutionalLayer(out_channels, in_channels, 3, 1, 1),

            ConvolutionalLayer(in_channels, out_channels, 1, 1, 0),
        )

    def forward(self, x):
        return self.sub_module(x)
    

# 下采样
class DownSamplingLayer(nn.Module):
    def __init__(self,in_channels,out_channels):
        super(DownSamplingLayer,self).__init__()
        self.sub_module = nn.Sequential(
            ConvolutionalLayer(in_channels,out_channels,3,2,1),
        )
    def forward(self,x):
        return self.sub_module(x)
    
# 上采样
class UpSamplingLayer(nn.Module):
    def __init__(self):
        super(UpSamplingLayer,self).__init__()
        
    def forward(self,x):
        return functional.interpolate(x,scale_factor=2,mode='nearest')
    

class Yolo_V3_Net(nn.Module):
    def __init__(self):
        super(Yolo_V3_Net,self).__init__()

        self.trunk_52 = nn.Sequential(
            ConvolutionalLayer(3,32,3,1,1),
            DownSamplingLayer(32,64),

            ResidualLayer(64,32),

            DownSamplingLayer(64,128),

            ResidualLayer(128,64),
            ResidualLayer(128,64),

            DownSamplingLayer(128,256),

            ResidualLayer(256,128),
            ResidualLayer(256,128),
            ResidualLayer(256,128),
            ResidualLayer(256,128),
            ResidualLayer(256,128),
            ResidualLayer(256,128),
            ResidualLayer(256,128),
            ResidualLayer(256,128),
        )

        self.trunk_26 = nn.Sequential(
            DownSamplingLayer(256,512),

            ResidualLayer(512,256),
            ResidualLayer(512,256),
            ResidualLayer(512,256),
            ResidualLayer(512,256),
            ResidualLayer(512,256),
            ResidualLayer(512,256),
            ResidualLayer(512,256),
            ResidualLayer(512,256),
        )

        self.trunk_13 = nn.Sequential(
            DownSamplingLayer(512,1024),

            ResidualLayer(1024,512),
            ResidualLayer(1024,512),
            ResidualLayer(1024,512),
            ResidualLayer(1024,512),
        )

        self.convset_13 = nn.Sequential(
            ConvolutionalSetLayer(1024,512)
        )

        self.dection_13 =  nn.Sequential(
            ConvolutionalLayer(512,1024,3,1,1),
            nn.Conv2d(1024,45,1,1,0)
        )

        self.up_13_to_26 = nn.Sequential(
            ConvolutionalLayer(512,256,3,1,1),
            UpSamplingLayer()
        )

        self.convset_26 = nn.Sequential(
            ConvolutionalSetLayer(768,256)
        )

        self.dection_26 =  nn.Sequential(
            ConvolutionalLayer(256,512,3,1,1),
            nn.Conv2d(512,45,1,1,0)
        )

        self.up_26_to_52 = nn.Sequential(
            ConvolutionalLayer(256,128,3,1,1),
            UpSamplingLayer()
        )

        self.convset_52 = nn.Sequential(
            ConvolutionalSetLayer(384,128)
        )

        self.dection_52 =  nn.Sequential(
            ConvolutionalLayer(128,256,3,1,1),
            nn.Conv2d(256,45,1,1,0)
        )

    def forward(self,x):
        h_52 = self.trunk_52(x)
        h_26 = self.trunk_26(h_52)
        h_13 = self.trunk_13(h_26)

        convset_13_out = self.convset_13(h_13)
        detection_13_out = self.dection_13(convset_13_out)
        up_13_to_26_out = self.up_13_to_26(convset_13_out)
        cat_13_to_26 = torch.cat((up_13_to_26_out,h_26),dim=1)

        convset_26_out = self.convset_26(cat_13_to_26)
        detection_26_out = self.dection_26(convset_26_out)
        up_26_to_52_out = self.up_26_to_52(convset_26_out)
        cat_26_to_52 = torch.cat((up_26_to_52_out,h_52),dim=1)

        convset_52_out = self.convset_52(cat_26_to_52)
        detection_52_out = self.dection_52(convset_52_out)

        return detection_13_out,detection_26_out,detection_52_out

数据集制作

解析xml，将图片名字，多个目标的类别，目标的中心点坐标和宽高，按照行输入到data.txt中，方便后面使用。

class_num={
    'person':0,
    'horse':1,
    'bicycle':2,
}
script_dir = os.path.dirname(os.path.realpath(__file__))
xml_dir = os.path.join(script_dir, 'data/image_voc')
xml_filenames=os.listdir(xml_dir) # 获取文件夹下的所有文件名

with open(script_dir+'/data.txt','a') as f:
    for xml_filename in xml_filenames:
        xml_filename_path = os.path.join(xml_dir,xml_filename)
        tree = et.parse(xml_filename_path) # 获取xml信息
        root = tree.getroot()
        filename = root.find('filename')
        names = root.findall('object/name')
        box=root.findall('object/bndbox')
        #for x1,y1,x2,y2 in box:
            #print(x1.text)
        data=[]
        data.append(filename.text)
        for name,box in zip(names,box):
            cls = class_num[name.text]
            # math.floor向下取整
            cx,cy=math.floor((int(box[0].text)+int(box[2].text))/2),math.floor((int(box[1].text)+int(box[3].text))/2)
            w,h=(int(box[2].text)-int(box[0].text)),(int(box[3].text)-int(box[1].text))
            data.append(cls)
            data.append(cx)
            data.append(cy)
            data.append(w)
            data.append(h)
        _str=''
        for i in data:
            _str=_str+str(i)+' '
        f.write(_str+'\n') 
f.close()

创建数据集，制作dataloader：

init：读入上一步制作好的data.txt。

len：返回长度

__getitem__(self, index)：

获取某个index位置的图片，每隔5份就做一个切割，是一组数据.
对图片进行缩放到416*416，不对称的部分填充黑色。
将图片数据转成tensor数据。
制作先验框数据标签，注意数据的转换，很复杂

工具类utils

将图片缩放到416*416，缩放时，先填充黑色，再缩放。

def make_416_image(path):
    img=Image.open(path)
    w,h=img.size[0],img.size[1]
    temp=max(h,w)
    mask=Image.new(mode='RGB',size=(temp,temp),color=(0,0,0)) #填充黑度图
    mask.paste(img,(0,0))
    return mask

anchor配置文件config

DATA_HEIGHT = 416
DATA_WIDTH = 416

CLASS_NUM = 3

anchors = {
    13: [[270, 254], [291, 179], [162, 304]],
    26: [[175, 222], [112, 235], [175, 140]],
    52: [[81, 118], [53, 142], [44, 28]]
}

ANCHORS_AREA = {
    13: [x * y for x, y in anchors[13]],
    26: [x * y for x, y in anchors[26]],
    52: [x * y for x, y in anchors[52]],
}

train

训练：

检查GPU
引入网络模型
将数据集加载到dataloader
加载之前训练的权重文件，如果有就加载，没有就不加载
获取数据后，放到GPU上
数据导入网络后的输出output，和标签值target计算loss：
- output换轴，为了后面进行计算
- 区分出正样本mask_obj和负样本mask_noobj
- BCELoss计算二分类loss，注意要先用sigmoid激活，做一个数据归一化到0-1之间，方便计算，防止出现梯度爆炸和计算精度的问题
- MSELoss计算回归损失，注意这里中正样本进行计算
- CrossEntropyLoss多分类损失，目标是哪一类，这个也是只用正样本进行计算。
- 三个loss按照一定的权重相加，返回总的loss
将13，26，52三层的loss相加，得到总的loss
梯度清零，反向传播，梯度更新三件套
打印epoch损失
保存每个epoch的权重结果
训练过程可视化的tensorboard

def loss_fn(output, target,c):
    output = output.permute(0, 2, 3, 1)# 换轴 #N,45,13,13==>N,13,13,45
    output = output.reshape(output.size(0), output.size(1), output.size(2), 3, -1)#N,13,13,3,15

    mask_obj = target[..., 0] > 0#N,13,13,3 # 正样本
    mask_noobj = target[..., 0] == 0 # 负样本

    loss_p_fun=nn.BCELoss() # 二分类损失
    loss_p=loss_p_fun(torch.sigmoid(output[...,0]),target[...,0]) # 正负样本都需要
    
    loss_box_fun=nn.MSELoss() # 回归损失
    loss_box=loss_box_fun(output[mask_obj][...,1:5],target[mask_obj][...,1:5])
    
    loss_segment_fun=nn.CrossEntropyLoss() # 多分类损失
    loss_segment = loss_segment_fun(output[mask_obj][...,5:],torch.argmax(target[mask_obj][...,5:],dim=1, keepdim= True).squeeze(dim=1))

    loss = c * loss_p + (1-c)*0.5*loss_box+ (1-c)*0.5*loss_segment
    return loss

if __name__ =='__main__':
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 是否在cuda上训练
    dataset = YoloDataSet()
    data_Loader = DataLoader(dataset,batch_size=2,shuffle=True)

    weight_path= 'params/net597.pt' # 权重文件
    net = Yolo_V3_Net().to(device)
    if os.path.exists(weight_path):
        net.load_state_dict(torch.load(weight_path))

    opt = optim.Adam(net.parameters())

    epoch = 0
    while True:
        for target_13, target_26, target_52, img_data in data_Loader:
            # 数据放到GPU上
            target_13, target_26, target_52, img_data = target_13.to(device), target_26.to(device), target_52.to(device), img_data.to(device)
            # print(target_13.shape) # torch.Size([2, 13, 13, 3, 8])

            output_13, output_26, output_52 = net(img_data)
            loss_13 = loss_fn(output_13.float(), target_13.float(), 0.7)
            loss_26 = loss_fn(output_26.float(), target_26.float(), 0.7)
            loss_52 = loss_fn(output_52.float(), target_52.float(), 0.7)

            loss = loss_13 + loss_26 + loss_52
            opt.zero_grad()
            loss.backward()
            opt.step() # 梯度更新三件套

            print(epoch,loss.item())
        
        torch.save(net.state_dict(), f'params/net{epoch}.pt')
        print(f'{epoch}保存成功')
        epoch+=1

检测流程

图像输入网络，输出分为3部分(13,26,52)：

首先筛选置信度很低的，比如置信度小于0.5的直接过滤掉。
求出检测中心点的索引和偏移量，索引表示在哪个网格，偏移量表示位置相对于网格左上角，偏移了多少。
回放到416*416图像上，然后再回放到原图上。
WH：根据先验框和检验框缩放比例的计算后得到。

init：

判断cuda是否可用，可用就把网络放到GPU上计算
加载训练好的权重。

forward：

获取网络输出
根据get_index_and_bias：获取中心点的索引和偏移量
根据get_true_position：按照先验框，索引，偏移量，计算得到原图上的坐标，偏移量和分类。

get_index_and_bias：获取中心点的索引和偏移量：

output换轴，reshape，得到output的形状为 N H W 3 8的形状
获取大于置信度值的mask
返回置信度大于设定值的位置索引，和偏移量。

get_true_position：按照先验框，索引，偏移量，计算得到原图上的坐标，偏移量和分类。

class Detector(nn.Module):
    def __init__(self):
        super(Detector, self).__init__()
        self.weight_path= 'params/net178.pt'
        self.net = Yolo_V3_Net().to(device)
        if os.path.exists(self.weight_path):
            self.net.load_state_dict(torch.load(self.weight_path))
        
        self.net.eval() # 加载batch参数加载到预测过程中
        
    def forward(self, input, thresh, anchors,case):
        output_13, output_26, output_52 = self.net(input)
        idxs_13, bias_13 = self.get_index_and_bias(output_13, thresh)
        boxes_13 = self.get_true_position(idxs_13, bias_13, 32, anchors[13],case)

        idxs_26, bias_26 = self.get_index_and_bias(output_26, thresh)
        boxes_26 = self.get_true_position(idxs_26, bias_26, 16, anchors[26],case)

        idxs_52, bias_52 = self.get_index_and_bias(output_52, thresh)
        boxes_52 = self.get_true_position(idxs_52, bias_52, 8, anchors[52],case)

        return torch.cat([boxes_13,boxes_26,boxes_52],dim=0)

    def get_index_and_bias(self,output,thresh):
        output = output.permute(0, 2, 3, 1)#
        output = output.reshape(output.size(0), output.size(1), output.size(2), 3, -1) # N H W 3 8
        
        mask = output[...,0]>thresh #N H W 3 #最后一个维度的第0个元素，代表置信度
        index = mask.nonzero() # 返回为true的坐标索引
        bias = output[mask]

        return index,bias
    
    def get_true_position(self,index,bias,t,anchors,case): # 多加一个case参数，处理416*416到原图的缩放
        anchors = torch.tensor(anchors)
        a = index[:,3] # 拿到了每个框
        cy = (index[:,1].float()+bias[:,2].float())*t/case # 需要区分是13*13特征图上的，还是26*26,52*52上的，按照t比例还原
        cx = (index[:,2].float()+bias[:,1].float())*t/case # 需要区分是13*13特征图上的，还是26*26,52*52上的，按照t比例还原

        w = anchors[a,0]*torch.exp(bias[:,3])/case
        h = anchors[a,1]*torch.exp(bias[:,4])/case

        p = bias[:,0]
        cls_p = bias[:,5:]
        cls_index = torch.argmax(cls_p,dim=1)

        return torch.stack([torch.sigmoid(p),cx,cy,w,h,cls_index],dim=1)

完整代码：https://github.com/cauccliu/YOLOv3Pytorch

人工智能

#Python #CV

手写实现Pytorch版本YOLOv3

https://cauccliu.github.io/2024/04/22/YOLOv3/

Author

Liuchang

Posted on

April 22, 2024

Licensed under

YOLOv3量化 Previous

Pytorch实现单目标检测网络 Next