TensorFlow2—YOLOv2

学习yolo也有一段时间了，一直在死磕yolov3,最后想想还是先把yolov2先好好捋一遍吧，原理搞懂不难，代码实现对于我这种基础比较差的人，还是有一点难度。好了废话不多说，我们先看看yolo算法的相关原理吧。

Tensorflow-YoloV2

1、YOLOv2论文解读
2、tf2-yolov2代码实现
- 2.1 训练数据预处理
- - 2.1.1 统一改变输入图片尺寸
  - 2.1.2 解析XML文件
  - 2.1.3 读取图片（图片数据预处理）
  - 2.1.4 真实标签格式处理（单张图片）
  - 2.1.5 真实标签格式处理（批量图片）
  - 2.1.6 模型的创建
  - 2.1.7 权重初始化
  - 2.1.8 损失函数的计算

1、YOLOv2论文解读

YOLOv2相比YOLOv1的改进策略： 1、batch normalization Batch Normalization可以提升模型收敛速度，而且可以起到一定正则化效果，降低模型的过拟合。在YOLOv2中，每个卷积层后面都添加了Batch Normalization层，并且不再使用droput。使用Batch Normalization后，YOLOv2的mAP提升了2.4%。 2、High Resolution Classifier 目前大部分的检测模型都会在先在ImageNet分类数据集上预训练模型的主体部分（CNN特征提取器），由于历史原因，ImageNet分类模型基本采用大小 224x224的图片作为输入，分辨率相对较低，不利于检测模型。所以YOLOv1在采用224x224分类模型预训练后，将分辨率增加至448x448，并使用这个高分辨率在检测数据集上finetune。但是直接切换分辨率，检测模型可能难以快速适应高分辨率。所以YOLOv2增加了在ImageNet数据集上使用448x448来finetune分类网络这一中间过程（10 epochs），这可以使得模型在检测数据集上finetune之前已经适用高分辨率输入。使用高分辨率分类器后，YOLOv2的mAP提升了约4%。 3、Convolutional With Anchor Boxes

2、tf2-yolov2代码实现

2.1 训练数据预处理

2.1.1 统一改变输入图片尺寸

代码语言：javascript复制

# -*- coding: utf-8 -*-
import cv2
import os

def rebuild(path_src, path_dst, width, height):
    """
    :param path_src: 原图相对地址
    :param path_dst: 保存图相对地址
    :return: None
    """
    i = 1
    image_names = os.listdir(path_src)
    for image in image_names:
        if image.endswith('.jpg') or image.endswith('.png'):
            img_path = path_src   image
            save_path = path_dst   image
            img = cv2.imread(img_path)
            resize_img = cv2.resize(img, (width, height))
            cv2.imwrite(save_path, resize_img)
            print("修改第 "   str(i), " 张图片：", save_path)
            i = i   1
 
if __name__ == "__main__":
    # 原图相对地址，也可以使用绝对地址
    path_src = "data/train/image"
    # 保存图相对地址，也可以使用绝对地址
    path_dst = "data/train/image_new"
    width = 512
    heght = 512
    rebuild(path_src, path_dst, width, heght)

2.1.2 解析XML文件

每张图片的标签信息全部保存在.xml(使用labelImg标注图片生成的文件)文件中，标签文件中包含**原图路径，原图名，目标位置信息(左上角坐标，右下角坐标，够成一个矩形框)，类别名，**我们需要的是原图路径，目标位置信息以及类别名，所有我们需要将这些信息从xml标签文件中提取出来。

代码语言：javascript复制

import tensorflow as tf
import os, glob
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"  #tensorflow的标志位，只显示warning和Error
import numpy as np
from tensorflow import keras,losses
tf.random.set_seed(2234)#设置随机数，好让每次的初始状态都是一样的
np.random.seed(2234)
import xml.etree.ElementTree as ET  
#专门解析XML文件的库，同时返回原图片的路径与标签中目标的位置信息以及类别信息

#解析XML文件
def prase_annotation(img_dir,ann_dir,labels):
    # img_dir:图片存放的路径
    # ann_dir:xml文件存放的位置
    # labels:每一个名字对应的数字去做编码，我们的object在保存的时候，并没有进行编码,是一个元组
    # labels:("sugarweet","weed")   甜菜编码为1，杂草编码为2，留出来的背景编码为0
    # 所有图片的信息用一个列表来表示
    imgs_info = [] #存储所有图片信息的容器列表
    max_boxes = 0   #计算所有图片中，目标在一张图片中所可能出现的最大数量
    #遍历所有的图片，因为他们是一一对应的
    for ann in os.listdir(ann_dir):
        # 遍历ann_dir的XML文件，返回值是XML文件的地址
        # 这里的ann就是代表了这个xml文件
        # 提取图片的名字
        tree = ET.parse(os.path.join(ann_dir,ann))  #分析指定的XML文件
        img_info = dict()  # 为每一个标签xml文件创建一个内容存放容器字典
        img_info["object"] = []
        boxes_counter = 0  # 计算该标签文件中所含有的目标数量

        for elem in tree.iter():  #遍历xml中的所有的节点
            if "filename" in elem.tag:   #elem.tag:代表的是节点
                # 读取文件名，将文件绝对路径存储在字典中
                img_info["filename"] = os.path.join(img_dir,elem.text)  # elem.text：代表的是节点的文件内容
            if "width" in elem.tag:
                img_info["width"] = int(elem.text)
                assert img_info["width"] == 512

            if "height" in elem.tag:
                img_info["height"] = int(elem.text)
                assert img_info["height"] == 512

            if "object" in elem.tag or "part" in elem.tag: # 读取目标框的信息
                #x1-y1-x2-y2-label
                object_info = [0, 0, 0, 0, 0] # 创建存储目标框信息的容器列表
                boxes_counter  = 1
                for attr in list(elem):  # 循环读取子节点
                    if "name" in attr.tag:
                        label = labels.index(attr.text)   1 # 返回索引值 并加1， 因为背景为0
                        object_info[4] = label

                    if "bndbox" in attr.tag:  # bndbox的信息
                        for pos in list(attr):
                            if "xmin" in pos.tag:
                                object_info[0] = int(pos.text)
                            if "ymin" in pos.tag:
                                object_info[1] = int(pos.text)
                            if "xmax" in pos.tag:
                                object_info[2] = int(pos.text)
                            if "ymax" in pos.tag:
                                object_info[3] = int(pos.text)

                img_info["object"].append(object_info)   #遍历多少个，就插入多少个box

        imgs_info.append(img_info)
            #每张图片的filename ， w/h/box_info ,这个也是一个列表，列表的长度代表的是样本的数量
            # 到这里为止，所有的xml文件的信息都被提取出来了
            # 用一个矩阵来表示box的信息  （N，5）=（max_objects_num,5）
        #每一张图片的循环
        if boxes_counter > max_boxes:
            max_boxes = boxes_counter
    # 这个最多的box的数量就是max_boxes
    # [b,40,5] 第一维是图片的数量，第二维是某一张图片中最多的object之和，第三维是5个数据（前四个是坐标，最后一个是label信息）
    boxes = np.zeros([len(imgs_info),max_boxes,5])
    imgs=[]
    for i, img_info in enumerate(imgs_info):
        # [N,5]
        img_boxes = np.array(img_info["object"])   # img_boxes.shape[N, 5]
        boxes[i, :img_boxes.shape[0]] = img_boxes  
        imgs.append(img_info["filename"])   # 图片的真实存放路径
        # print(img_info["filename"], boxes[i, :5])
    	# imgs:图片的路径   boxes：[b,40,5]
    return imgs, boxes
    # imgs是一个列表，包含每张图片的路径
    # boxes 是一个三维的矩阵 ，包含了图片的目标位置和类别信息

#测试
# obj_names = ["sugarbeet","weed"]
#imgs,boxes=prase_annotation("data/train/image","data/train/annotation",obj_names)

paras_annotation返回值imgs, boxes, 其中imgs是个列表，它包含了每张图片的路径，boxes是一个三维矩阵，它包含了每张图片的所有目标位置与类别信息，所以它的shape是**[b, max_boxes, 5]。**

b: 图片数量

max_boxes: 所有图片中最大目标数，比如图片A有3个目标，图片B有4个目标，图片C有10个目标，则最大目标数就是10

5: x_min, y_min, x_max, y_max, label(在xml中就是name)。

之所以有max_boxes这个参数设置，是为了将所有的标签文件的信息都放在一个矩阵变量中。因为每张图片的目标数必然是不一样的，如果不设置max_boxes这个参数，就无法将所有的标签文件信息合在一个矩阵变量中。如果一个图片的目标数不够max_boxes怎么办，例如图片A有3个目标，max_boxes是10，则假设图片A有10个目标，只是将后7个目标的数据全部置为0，前三个目标的数据赋值于它原本的数值，这也是开始为什么用np.zeros()初始化boxes。

2.1.3 读取图片（图片数据预处理）

我们训练需要的是图片的内容信息，不是路径，所以我们需要通过图片路径来读取图片，以获得图片信息。此处读取图片信息可以参考日月光华tensorflow2.0课程。

代码语言：javascript复制

import imgaug as ia
from imgaug import augmenters as iaa

def preprocess(img, img_boxes):
    # img是一个图片字符串
    # img_boxes :长度[40,5]的一个矩阵
    img=tf.io.read_file(img)
    img=tf.image.decode_png(img,channels=3)
    img=tf.image.convert_image_dtype(img,tf.float32)
    return img,img_boxes
# 创建一个dataset对象
def get_dataset(img_dir,ann_dir,batchsize):
    # 这个函数返回一个dataset对象
    #这里imgs是可以迭代的，[b]   b是某一个图片的路径
    #boxes是 [b,40,5]
    imgs,boxes = prase_annotation("data/train/image","data/train/annotation",obj_names)
    dataset=tf.data.Dataset.from_tensor_slices((imgs,boxes))
    #在训练的时候，不是需要你的图片的路径，而是需要一个路径下所代表的的图片数据
    # 数据流的处理流程
    dataset= dataset.shuffle(1000).map(preprocess).batch(batchsize).repeat()
    #repeat加上去之后代表数据集不会终止的，会一直的循环
    return dataset

train_dataset=get_dataset("data/train/image","data/train/annotation",batchsize=12)

#数据增强
#提取每一个图片以后，对每一个图片要进行旋转，对他相应
def augmentation_generator(yolo_dataset):
    for batch in yolo_dataset:
        # conversion tensor->numpy
        img = batch[0].numpy()
        boxes = batch[1].numpy()
        # conversion bbox numpy->ia object
        ia_boxes = []
        for i in range(img.shape[0]):
            ia_bbs = [ia.BoundingBox(x1=bb[0],
                                       y1=bb[1],
                                       x2=bb[2],
                                       y2=bb[3]) for bb in boxes[i]
                      if (bb[0]   bb[1]  bb[2]   bb[3] > 0)]
            ia_boxes.append(ia.BoundingBoxesOnImage(ia_bbs, shape=(512, 512)))
        # data augmentation
        seq = iaa.Sequential([
            iaa.Fliplr(0.5),
            iaa.Flipud(0.5),
            iaa.Multiply((0.4, 1.6)), # change brightness
            #iaa.ContrastNormalization((0.5, 1.5)),
            #iaa.Affine(translate_px={"x": (-100,100), "y": (-100,100)}, scale=(0.7, 1.30))
            ])
        #seq = iaa.Sequential([])
        seq_det = seq.to_deterministic()
        img_aug = seq_det.augment_images(img)
        img_aug = np.clip(img_aug, 0, 1)
        boxes_aug = seq_det.augment_bounding_boxes(ia_boxes)
        # conversion ia object -> bbox numpy
        for i in range(img.shape[0]):
            boxes_aug[i] = boxes_aug[i].remove_out_of_image().clip_out_of_image()
            for j, bb in enumerate(boxes_aug[i].bounding_boxes):
                boxes[i,j,0] = bb.x1
                boxes[i,j,1] = bb.y1
                boxes[i,j,2] = bb.x2
                boxes[i,j,3] = bb.y2
        # conversion numpy->tensor
        batch = (tf.convert_to_tensor(img_aug), tf.convert_to_tensor(boxes))
        #batch = (img_aug, boxes)
        yield batch
aug_train_dataset=augmentation_generator(train_dataset)

使用tensorflow自带的读取图片函数tf.io.read_file来读取图片，不用使用for循环一个一个的读取图片，然后使用tf.image.decode_png将图片信息解码出来，如果你的训练图片是jpg,则使用tf.image.decode_jpeg来解码。tf.image.convert_image_dtype(x, tf.float32)可将数据直接归一化并将数据格式转化为tf.float32格式。

为了更加方便训练，我们需要构建一个tensorflow队列，将解码出来的图片数据与标签数据一起加载进队列中，而且通过这种方式，也可以使图片数据与标签数据一一对应，不会出现图片与标签对照絮乱的情况。

2.1.4 真实标签格式处理（单张图片）

单张图片处理好了，多张图片直接处理调用即可。目标检测的输出并不是一个二维的张量，在这个yolov2模型的输出则是五维张量 [batch,16,16,5,25]，我们真实的标签则是[batch,max_boxes,5]，很明显真实标签的shape与预测标签的shape不一致，所以无法去做比较，损失函数就不能完成，为了完成损失函数或者说是真是标签与网络预测做比较，需要修改真实标签的形状。

在修改真实标签shape之前，需要了解YOLOV2的损失函数是由几部分构成的。

YOLOV2损失函数包含三部分： 1、坐标损失: x,y,w,h 2、类别损失: class,根据自己的标签设定 3、置信度损失: confidence, anchors与真实框的IOU

针对损失函数，需要预先准备四个变量，分别是真实标签掩码：五维张量的真实标签：转换格式的三维张量真实标签：只包含类别的五维张量：

请看具体代码：

代码语言：javascript复制

IMGSIZE = 512
GRIDSIZE = 16
ANCHORS = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828] #提供好的anchor


def process_true_boxes(gt_boxes, anchors):  # 把每一个图片的box信息写入到网格中间去
    # 512//16=32  #进行32倍率的下采样
    scale = IMGSIZE // GRIDSIZE
    # [5,2]这样的一个数组, [5,2] 将anchors转化为矩阵形式，一行代表一个anchors
    anchors = np.array(anchors).reshape((5, 2))

    # 构建空白的GRID
    # 是否有一个box落在网格的中心区域里面,每个网格点有5个anchor，每个网格点是否存在物体
    # 用来判断该方格位置的anchors有没有目标，每个方格有5个anchors
    detector_mask = np.zeros([GRIDSIZE, GRIDSIZE, 5, 1])
    # 如果落到了，还需要一个信息去保存这个点所对应的真实的box所对应的真实的位置信息
    # X-Y-W-H-L 是否有object
    # 在输出方格的尺寸上[16, 16, 5]制作真实标签, 用于和预测输出值做比较，计算损失值
    matching_gt_box = np.zeros([GRIDSIZE, GRIDSIZE, 5, 5])  #这里的第一个5代表了anchors的索引值
    # [N,5] x1-y1-x2-y2-l => x-y-w-h-l
    # 通俗的说就是将左上角的点和右下角的点的信息转换为预测各自中心点的坐标和该框的长度和宽度
    # 网络输出的时候预测的也是中心点的宽度啊，高度啊，不会预测左上角和右下角的信息
    gt_boxes_grid = np.zeros(gt_boxes.shape)
    # DB:tensor => numpy   方便进行迭代
    gt_boxes = gt_boxes.numpy()

    for i, box in enumerate(gt_boxes):   #[40,5]
        # box : [5] ,x1-y1-x2-y2-l
        # 512 => 16   要除以一个下采样的比率,把原始的坐标信息映射到feauter map上的一个坐标信息
        # 提取出中心点
        # 16x16上的中心点的坐标，宽度以及高度
        x = ((box[0]   box[2]) / 2) / scale
        y = ((box[1]   box[3]) / 2) / scale
        w = (box[2] - box[0]) / scale
        h = (box[3] - box[1]) / scale
        # 把他的第i行，用变换后的中心点的坐标以及高度还有宽度写入
        gt_boxes_grid[i] = np.array([x, y, w, h, box[4]])
        # gt_boxes_grid就是转换格式的真实标签,该变量存储的是一张图片的信息,
        # gt_boxes_grid :这个变量是用来计算置信度损失的，将在计算损失函数部分使用

        if w * h > 0:
            best_anchor = 0
            best_iou = 0
            for j in range(5):  # 计算交并比
                # 计算真实目标框与5个anchors的交并比，选出做好的一个anchors
                interct = np.minimum(w, anchors[j, 0]) * np.minimum(h, anchors[j, 1])
                # 第j个anchors的宽，拿出宽度最小的乘以第j个anchors的高度，拿出最小的
                union = w * h   (anchors[j, 0] * anchors[j, 1]) - interct
                iou = interct / union

                if iou > best_iou:
                    best_anchor = j  # 将更加优秀的anchors的索引赋值与之前定义好的变量
                    best_iou = iou
            if best_iou > 0:
                # 如果存在iou，就把数据写在对应的网格数据上去
                # 向下取整，即是将中心点坐标转化为左上角坐标， 用于后续计算赋值
                x_coord = np.floor(x).astype(np.int32)
                y_coord = np.floor(y).astype(np.int32)

                # 将最好的一个anchors赋值1，别的anchors默认为0
                # 图像坐标系的坐标与数组的坐标互为转置：[x,y] => [y, x]
                # 掩码detector_mask赋值1，表示该网格的某个anchors与落在该网格的目标有很好的匹配，即IOU值很大。也可以理解为该网格具有真实目标中心。
                detector_mask[y_coord, x_coord, best_anchor] = 1

                # [b,h,w,5,x-y-w-h-l]
                # 将最好的一个anchors赋值真实标签的信息[x_center, y_center, w, h, label]，别的anchors默认为0
                # matching_gt_box则在匹配最好的一个anchors上赋值位置信息与标签，即[x, y, w, h, label]，matching_gt_box这个变量就是用来与网络预测值做比较用的
                matching_gt_box[y_coord, x_coord, best_anchor] = np.array([x, y, w, h, box[4]])

    #     matching_gt_box:这5个位置的真实的box的坐标信息[16,16,5,5] 用于计算损失值
    #     detector_mask: 表示当前的位置有没有这一个box [16,16,5,1] 判断哪个anchors有目标
    #     gt_boxes_grid :数据的等价的变换，把一个张量数据变成三个张量数据  一张图片中目标的位置信息，转化后的格式
    return matching_gt_box, detector_mask, gt_boxes_grid

1、在标签文件.xml中，目标框的记载方式是[x_min, y_min, x_max, y_max]，我们需要将这种格式转化为[x_center, y_center, w, h]这种格式，因为网络输出的格式就是[x_center, y_center, w, h]这种格式，而且anchors也是宽高形式。note:在后文中，x_center, y_center统一使用x,y代替，另外x,y并不是坐标，而是偏置，所有我们后续需要构建一个16x16的坐标网格，w, y则是倍率。

2、gt_boxes_grid就是转换格式的真实标签，shape:[max_boxes, 5], 5:[x, y, w, h, label]，该变量存储的是一张图片的信息，后续会扩展为多张图片。这个变量是用来计算置信度损失的，将在计算损失函数部分使用。

3、格式转换完成后，得到所有真实目标框的中心坐标[x, y]，宽高[w, h]。网络模型的最后输出shape是16x16，每个网格中有5个anchors。在所有的网格中，计算每个网格中每个anchors(共5个anchors)与中心值落在该网格的目标的IOU，至于IOU如何计算，这里就不再赘述。根据IOU的值，来判断该网格中5个anchors哪个anchors与真实目标框匹配最好。

4、因为使用了max_boxes这个参数，所以gt_boxes.shape[max_boxes, 5]的内容并不全是有效数据，前面讲过，一张图片有几个目标，就赋值几个目标的信息于gt_boxes, 当该图片的目标数不足max_boxes时，不足部分填充0。所以gt_boxes中为0的部分全是无效数据。通过 if w*h > 0 可以有效筛选掉无效数据，然后使用一个循环将5个anchors中与目标的IOU最大的一个anchors挑选出来，并记录该anchors的索引序号与IOU。

5、因为矩阵中第一维表示行，第二维表示列，比如a[4, 3]，a有4行3列；但在图像坐标系中，横轴是x, 纵轴是y, 这也就是说y的值是图像的行数，x的值是图像的列数。所以在赋值中，需要将y写在第一维，x写在第二维，即 detector_mask[y_coord, x_coord, best_anchor] = 1。根据之前计算的IOU，可以知道与目标匹配最好的anchors的索引序号，然后对该anchors赋予相对应的值。掩码detector_mask赋值1，表示该网格的某个anchors与落在该网格的目标有很好的匹配，即IOU值很大。也可以理解为该网格具有真实目标中心。 matching_gt_box则在匹配最好的一个anchors上赋值位置信息与标签，即[x, y, w, h, label]，matching_gt_box这个变量就是用来与网络预测值做比较用的。

2.1.5 真实标签格式处理（批量图片）

代码语言：javascript复制

IMGSIZE = 512
GRIDSIZE = 16
ANCHORS = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828] #提供好的anchor

def ground_truth_generator(dataset):
    # 构建一个训练数据集迭代器，每次迭代的数量由batch决定
    # dataset: 训练集队列，包含训练集原图片数据信息，标签位置[x_min, y_min, x_max, y_max, label]信息

    for imgs, imgs_boxes in dataset:
        # imgs:[batchsize,512,512,3]
        # imgs_boxes:[batchsize,40,5] 不一定是40， 由实际情况来判断

        # 创建三个批量数据列表
        # 对应上面函数的单个图片数据变量
        batch_matching_gt_box = []
        batch_detector_mask = []
        batch_gt_boxes_grid = []
        b = imgs.shape[0]  # 计算一个batch有多少张图片，batchsize

        for i in range(b):  #for each image
            matching_gt_box, detector_mask, gt_boxes_grid = process_true_boxes(imgs_boxes[i], ANCHORS)
            batch_matching_gt_box.append(matching_gt_box)
            batch_detector_mask.append(detector_mask)
            batch_gt_boxes_grid.append(gt_boxes_grid)
        # 将其转化为矩阵形式并转化为tensor，[b, 16,16,5,1]
        detector_mask = tf.cast(np.array(batch_detector_mask), dtype=tf.float32)
        # 将其转化为矩阵形式并转化为tensor，[b,16,16,5,5] x_center-y_center-w-h-l
        matching_gt_box = tf.cast(np.array(batch_matching_gt_box), dtype=tf.float32)
        # 将其转化为矩阵形式并转化为tensor，[b,40,5] x_center-y_center-w-h-l
        gt_boxes_grid = tf.cast(np.array(batch_gt_boxes_grid), dtype=tf.float32)

        # [b,16,16,5]
        # 将所有的label信息单独分出来，用于后续计算分类损失值
        matching_classes = tf.cast(matching_gt_box[..., 4], dtype=tf.int32)
        # 将标签进行独热码编码 [b,16,16,5,num_classes:3]，
        matching_classes_oh = tf.one_hot(matching_classes, depth=3)
        # 将背景标签去除，背景为0
        # x_center-y_center-w-h-conf-l0-l1-l2 => x_center-y_center-w-h-conf-l1-l2
        # [b,16,16,5,2]
        matching_classes_oh = tf.cast(matching_classes_oh[..., 1:], dtype=tf.float32)

        # [b,512,512,3]
        # [b,16,16,5,1]
        # [b,16,16,5,5]
        # [b,16,16,5,2]
        # [b,40,5]
        yield imgs, detector_mask, matching_gt_box, matching_classes, gt_boxes_grid

train_gen = ground_truth_generator(aug_train_dataset)

1、不光将保存单张图片标签信息的变量合并为保存一个batch_size的变量，还需要创建一个类别变量，这个变量在前面说过，是为了分类损失函数使用的，即用来分类的。

2、如何将类别单独分出来，并另存为一个变量，就比较简单，matching_gt_box的shape为[b, 16, 16, 5, 5]，**最后一维代表的值为真实目标的坐标（x, y, w, h）和类别（label），所有只需要取该变量的最后一维的第5个值就可以，**如上面代码所示。得到matching_classes变量后，事情并没有做完，因为网络输出shape为[b, 16, 16, 5, 7] 训练集只有2类，所以7表示x-y-w-h-confidece-label1-label2，不包含背景，类别数可以根据你的类别数修改。但实际类别是3类，即背景-label1-label2 ，虽然在网络输出中不包含背景，但自己需要知道在目标检测中，背景默认为一类，这也是为什么在xml解析这一小节中，制作标签时，默认将标签数加1，因为背景默认为0。

因为网络输出不包含背景，所有我们需要将真实标签中的背景去除，去除的方法也比较简单，先将matching_classes热编码，另存为matching_classes_oh: [b, 16, 16, 5, 3]，在matching_classes_oh的最后一维中的第一个值就是背景类别，只需要使用切片即可，如代码所示。最后matching_classes_oh的shape为[b, 16, 16, 5, 2]，在最后一维的值形式为：[1, 0]:label1, [0, 1]:label2, [0, 0]:背景，也表示该anchors没有真实目标。

2.1.6 模型的创建

代码语言：javascript复制

import tensorflow as tf
from tensorflow.keras import layers
input_image = layers.Input((IMGSIZE,IMGSIZE,3),dtype="float32")
import  tensorflow.keras.backend as K

class SpaceToDepth(layers.Layer):
 
    def __init__(self, block_size, **kwargs):
        self.block_size = block_size
        super(SpaceToDepth, self).__init__(**kwargs)
 
    def call(self, inputs):
        x = inputs
        batch, height, width, depth = K.int_shape(x)
        batch = -1
        reduced_height = height // self.block_size
        reduced_width = width // self.block_size
        y = K.reshape(x, (batch, reduced_height, self.block_size,
                             reduced_width, self.block_size, depth))
        z = K.permute_dimensions(y, (0, 1, 3, 2, 4, 5))
        t = K.reshape(z, (batch, reduced_height, reduced_width, depth * self.block_size **2))
        return t
 
    def compute_output_shape(self, input_shape):
        shape =  (input_shape[0], input_shape[1] // self.block_size, input_shape[2] // self.block_size,
                  input_shape[3] * self.block_size **2)
        return tf.TensorShape(shape)
 
# input_image = layers.Input((512,512, 3), dtype='float32')
input_image = tf.keras.Input(shape=(512, 512, 3))
# unit1
# [512, 512, 3] => [512, 512, 32]
x = layers.Conv2D(32, (3,3), strides=(1,1),padding='same', name='conv_1', use_bias=False)(input_image)
x = layers.BatchNormalization(name='norm_1')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# [512, 512, 32] => [256, 256, 32]
x = layers.MaxPooling2D(pool_size=(2,2))(x)
# unit2
# [256, 256, 32] => [256, 256, 64]
x = layers.Conv2D(64, (3,3), strides=(1,1), padding='same', name='conv_2',use_bias=False)(x)
x = layers.BatchNormalization(name='norm_2')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# [256, 256, 64] => [128, 128, 64]
x = layers.MaxPooling2D(pool_size=(2,2))(x)
# Layer 3
# [128, 128, 64] => [128, 128, 128]
x = layers.Conv2D(128, (3,3), strides=(1,1), padding='same', name='conv_3', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_3')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# Layer 4
# [128, 128, 128] => [128, 128, 64]
x = layers.Conv2D(64, (1,1), strides=(1,1), padding='same', name='conv_4', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_4')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# Layer 5
# [128, 128, 64] => [128, 128, 128]
x = layers.Conv2D(128, (3,3), strides=(1,1), padding='same', name='conv_5', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_5')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# [128, 128, 128] => [64, 64, 128]
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
# Layer 6
# [64, 64, 128] => [64, 64, 256]
x = layers.Conv2D(256, (3,3), strides=(1,1), padding='same', name='conv_6', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_6')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# Layer 7
# [64, 64, 256] => [64, 64, 128]
x = layers.Conv2D(128, (1,1), strides=(1,1), padding='same', name='conv_7', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_7')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# Layer 8
# [64, 64, 128] = [64, 64, 256]
x = layers.Conv2D(256, (3,3), strides=(1,1), padding='same', name='conv_8', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_8')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# [64, 64, 256] => [32, 32, 256]
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
# Layer 9
# [32, 32, 256] => [32, 32, 512]
x = layers.Conv2D(512, (3, 3), strides=(1, 1), padding='same', name='conv_9', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_9')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
# Layer 10
# [32, 32, 512] => [32, 32, 256]
x = layers.Conv2D(256, (1, 1), strides=(1, 1), padding='same', name='conv_10', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_10')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 11
# [32, 32, 256] => [32, 32, 512]
x = layers.Conv2D(512, (3, 3), strides=(1, 1), padding='same', name='conv_11', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_11')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 12
# [32, 32, 512] => [32, 32, 256]
x = layers.Conv2D(256, (1, 1), strides=(1, 1), padding='same', name='conv_12', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_12')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 13
# [32, 32, 256] => [32, 32, 512]
x = layers.Conv2D(512, (3, 3), strides=(1, 1), padding='same', name='conv_13', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_13')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# for skip connection:后续拼接操作
skip_x = x  # [b,32,32,512]
# [32, 32, 512] => [16, 16, 512]
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
 
# Layer 14
# [16, 16, 512] => [16, 16, 1024]
x = layers.Conv2D(1024, (3, 3), strides=(1, 1), padding='same', name='conv_14', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_14')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 15
# [16, 16, 1024] => [16, 16, 512]
x = layers.Conv2D(512, (1, 1), strides=(1, 1), padding='same', name='conv_15', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_15')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 16
# [16, 16, 512] => [16, 16, 1024]
x = layers.Conv2D(1024, (3, 3), strides=(1, 1), padding='same', name='conv_16', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_16')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 17
# [16, 16, 1024] => [16, 16, 512]
x = layers.Conv2D(512, (1, 1), strides=(1, 1), padding='same', name='conv_17', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_17')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 18
# [16, 16, 512] => [16, 16, 1024]
x = layers.Conv2D(1024, (3, 3), strides=(1, 1), padding='same', name='conv_18', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_18')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 19
# [16, 16, 1024] => [16, 16, 512]
x = layers.Conv2D(1024, (3, 3), strides=(1, 1), padding='same', name='conv_19', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_19')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 20
# [16, 16, 512] => [16, 16, 1024]
x = layers.Conv2D(1024, (3, 3), strides=(1, 1), padding='same', name='conv_20', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_20')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
 
# Layer 21
# [32, 32, 512] => [32, 32, 64]
skip_x = layers.Conv2D(64, (1, 1), strides=(1, 1), padding='same', name='conv_21', use_bias=False)(skip_x)
skip_x = layers.BatchNormalization(name='norm_21')(skip_x)
skip_x = layers.LeakyReLU(alpha=0.1)(skip_x)
 
# [32, 32, 64] => [16, 16, 64*2*2]
skip_x = SpaceToDepth(block_size=2)(skip_x)
 
# concat
# [16,16,1024], [16,16,256] => [16,16,1280]
x = tf.concat([skip_x, x], axis=-1)
 
# Layer 22
# [16,16,1280] => [16, 16, 1024]
x = layers.Conv2D(1024, (3, 3), strides=(1, 1), padding='same', name='conv_22', use_bias=False)(x)
x = layers.BatchNormalization(name='norm_22')(x)
x = layers.LeakyReLU(alpha=0.1)(x)
x = layers.Dropout(0.5)(x)  # add dropout
# [16,16,5,7] => [16,16,35]
 
# [16, 16, 1024] => [16, 16, 35]
x = layers.Conv2D(5 * 7, (1, 1), strides=(1, 1), padding='same', name='conv_23')(x)
 
# [16, 16, 35] => [16, 16, 5, 7]
output = layers.Reshape((GRIDSZ, GRIDSZ, 5, 7))(x)
# create model
model = tf.keras.models.Model(input_image, output)
x=tf.random.normal((4,512,512,3))
out=model(x)

网络模型基于darknet-19改进的，输入是[512, 512, 3], 输出是[16, 16, 5, 7]。在网络模型的第21层，是一个拼接操作，拼接的是13层和20层的输出，其中13层的输出shape:[32, 32, 512], 20层的输出shape:[16, 16, 1024]，所以需要将13层的输出reshape成[16,16]。创建一个自定义层类，在该类中实现13层shape的改变。

2.1.7 权重初始化

代码语言：javascript复制

class WeightReader:   #权重解析的函数
    def __init__(self, weight_file):
        self.offset = 4
        self.all_weights = np.fromfile(weight_file, dtype='float32')

    def read_bytes(self, size):
        self.offset = self.offset   size
        return self.all_weights[self.offset - size:self.offset]

    def reset(self):
        self.offset = 4
weight_reader = WeightReader('yolo.weights')


weight_reader.reset()
nb_conv = 23

for i in range(1, nb_conv   1):
    conv_layer = model.get_layer('conv_'   str(i))
    conv_layer.trainable = True

    if i < nb_conv:
        norm_layer = model.get_layer('norm_'   str(i))
        norm_layer.trainable = True

        size = np.prod(norm_layer.get_weights()[0].shape)

        beta = weight_reader.read_bytes(size)
        gamma = weight_reader.read_bytes(size)
        mean = weight_reader.read_bytes(size)
        var = weight_reader.read_bytes(size)

        weights = norm_layer.set_weights([gamma, beta, mean, var])

    if len(conv_layer.get_weights()) > 1:
        bias = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[1].shape))
        kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
        kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
        kernel = kernel.transpose([2, 3, 1, 0])
        conv_layer.set_weights([kernel, bias])
    else:
        kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
        kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
        kernel = kernel.transpose([2, 3, 1, 0])
        conv_layer.set_weights([kernel])

layer = model.layers[-2]  # last convolutional layer
# print(layer.name)
layer.trainable = True

weights = layer.get_weights()

new_kernel = np.random.normal(size=weights[0].shape) / (GRIDSIZE * GRIDSIZE)
new_bias = np.random.normal(size=weights[1].shape) / (GRIDSIZE * GRIDSIZE)

layer.set_weights([new_kernel, new_bias])

2.1.8 损失函数的计算

目标检测的损失函数和目标分类的损失有很大的不同，目标检测需要输出目标的坐标，类别，置信度，既然输出了这三个值，那训练的时候，也需要针对这三个参数计算损失值。这一步其实算是整个目标检测中最重要和复杂的一部分啦。

1）制作网格坐标

由于需要计算坐标损失，而且坐标损失都带有坐标两字啦，那就需要在训练前制作一个坐标系，该坐标系为16x16，即x轴16，y轴16。制作坐标系的代码如下：

代码语言：javascript复制

x_grid = tf.tile(tf.range(GRIDSZ), [GRIDSZ])
# [1,16,16,1,1]
# [b,16,16,5,2]
x_grid = tf.reshape(x_grid, (1, GRIDSZ, GRIDSZ, 1, 1))
x_grid = tf.cast(x_grid, tf.float32)
# [1,16_1,16_2,1,1]=>[1,16_2,16_1,1,1]
y_grid = tf.transpose(x_grid, (0, 2, 1, 3, 4))
# [1,16_2,16_1,1,1] => [1, 16, 16, 1, 2]
xy_grid = tf.concat([x_grid, y_grid], axis=-1)
# [1,16,16,1,2]=> [b,16,16,5,2]
xy_grid = tf.tile(xy_grid, [y_pred.shape[0], 1, 1, 5, 1])

xy_grid的最后一维存储的就是坐标值，从[0,0] -> [15, 15] 共有256对坐标值。至于为什么要建立坐标系，是因为网络预测输出的x,y并不是坐标值，而是偏移量，经过激活函数后，还需要加上建立的坐标系才是真正的坐标值。比如网络预测输出[0, 1, 1, 0, 0:2] = (0.3, 0.4)，然后加上坐标系，那中心坐标值就是（1.3,1.4），这个值才是绝对坐标值。怕有些同学不懂这个[0, 1, 1, 0, 0:2]矩阵的含义，解释一哈，0:第1张图片，索引都是从0开始；1，1：输出的16x16网格中的第2行第2列的一个网格，0：该网格中的第一个anchors，0:2，该anchors中的x,y值。

2）坐标损失的计算

代码语言：javascript复制

def yolo_loss(detector_mask,matching_gt_boxes,matching_class_oh,gt_boxes_grid,y_pred):
    # detector_mask：代表每一个box上面是否有物体  [b,16,16,5,1]
    # matching_gt_boxes：box对应的一个目标的一个位置信息  [b,16,16,5,5] x-y-w-h-l
    # matching_class_oh：label和label2的一个概率  [b,16,16,5,2]
    # gt_boxes_grid：[b,40,5] x-y-w-h-l
    # y_pred:[b,16,16,5,7]  x-y-w-h-conf-l0-l1   这里的xywh都是经过了sigmoid激活函数变换的一个信息


    # 4.1 坐标loss fun的实现
    anchors = np.array(ANCHORS).reshape(5,2)
    x_grid = tf.tile(tf.range(GRIDSIZE), [GRIDSIZE])
    # [1,16,16,1,1]
    # [b,16,16,5,2]
    x_grid = tf.reshape(x_grid, (1, GRIDSIZE, GRIDSIZE, 1, 1))
    x_grid = tf.cast(x_grid, tf.float32)
    # [1,16_1,16_2,1,1]=>[1,16_2,16_1,1,1]
    y_grid = tf.transpose(x_grid, (0, 2, 1, 3, 4))
    # [1,16_2,16_1,1,1] => [1, 16, 16, 1, 2]
    xy_grid = tf.concat([x_grid, y_grid], axis=-1)
    # [1,16,16,1,2]=> [b,16,16,5,2]
    xy_grid = tf.tile(xy_grid, [y_pred.shape[0], 1, 1, 5, 1])

    # [b,16,16,5,7] x-y-w-h-conf-l1-l2
    # pred_xy 既不是相对位置，也不是绝对位置，是偏移量
    # 通过激活函数转化为相对位置
    pred_xy = tf.sigmoid(y_pred[..., 0:2])
    # 加上之前设定好的坐标，变为绝对位置
    # [b,16,16,5,2]
    pred_xy = pred_xy   xy_grid
    # [b,16,16,5,2]
    pred_wh = tf.exp(y_pred[..., 2:4])
    # [b,16,16,5,2] * [5,2] => [b,16,16,5,2]
    # w,h为倍率，要乘上anchors，才是宽高
    pred_wh = pred_wh * anchors

    # 计算真实目标框的数量，用来做平均
    # 由于detector_mask的值为0和1，所以可以不用比较，直接求和即可
    n_detector_mask = tf.reduce_sum(tf.cast(detector_mask > 0., tf.float32))  # 方法一
    # n_detector_mask = tf.reduce_sum(detector_mask)  # 方法二
    # print("真实目标框数量：",float(n_detector_mask))
    # [b,16,16,5,1] * [b,16,16,5,2]
    # 只计算有object位置处的损失，没有的就不计算，所有要乘以掩码
    xy_loss = detector_mask * tf.square(matching_gt_boxes[..., :2] - pred_xy) / (n_detector_mask   1e-6)
    xy_loss = tf.reduce_sum(xy_loss)
    wh_loss = detector_mask * tf.square(tf.sqrt(matching_gt_boxes[..., 2:4]) -
                                        tf.sqrt(pred_wh)) / (n_detector_mask   1e-6)
    wh_loss = tf.reduce_sum(wh_loss)

    # 1. coordinate loss
    coord_loss = xy_loss   wh_loss

1 、计算x,y(这里的x,y都是中心值，后面不再赘述)：预测输出的值是个偏移量，通过激活函数sigmoid()将其转变成0~1范围内的相对位置，最后再与坐标系相加，就可以得到该预测值的绝对坐标。 2、计算w, h：预测输出的宽高不需要经过激活函数啦，pred_wh = exp(pred_wh)，exp()表示e的几次方，不需要多做解释，将处理过的w, h再和anchors相乘，就会得到最后的w, h。 3、计算真实目标数：只计算有目标的anchors的损失值，通过之前计算的掩码detector_mask可以判断哪个anchors有真实目标，最后会求个平均值，所有要先将真实目标数计算出来。 4、计算x, y 损失值：使用均方差损失函数，这是计算的所有网格中所有anchors的损失值，由于我们只计算有目标处的anchors的损失值，所以乘以个掩码detector_mask，就可以得到我们所需要的损失值。 5、计算w, h 损失值：和求解x,y损失值一样，只是在YOLO原文中提到，要先将w,h的值开根号，再进行均方差计算。最后乘以掩码，求和，就得到了w,h处的损失值 6、计算坐标损失值：最后将x,y损失值与w,h损失值相加求和，得到最终坐标损失值。

3）类别损失的计算

代码语言：javascript复制

# 4.2 class loss
    # 首先提取它的类别信息
    #[b,16,16,5,2]
    pred_box_class = y_pred[...,5:]
    # 这个点所对应的真实的box的概率是怎么样的呢？
    # [b,16,16,5,2] => [b,16,16,5]
    true_box_class = tf.argmax(matching_class_oh,-1)
    class_loss = losses.sparse_categorical_crossentropy(true_box_class, pred_box_class, from_logits=True)

    # 使用categorical_crossentropy，需要将标签one_hot化，
    # 两种损失函数经测试，差距不大
    # class_loss = losses.categorical_crossentropy(y_true=matching_classes_oh,
    #                                              y_pred=pred_box_class,
    #                                              from_logits=True)
    # [b,16,16,5] => [b,16,16,5,1]* [b,16,16,5,1]
    # 增加一个维度进行矩阵元素相乘，返回有目标的损失值
    class_loss = tf.expand_dims(class_loss, -1) * detector_mask
    # 求个平均值，即每个目标分类的损失值
    class_loss = tf.reduce_sum(class_loss) / (n_detector_mask   1e-6)

这个计算方法和目标分类没有区别，就是真实目标的标签与网络预测目标的标签做比较，使用的函数是交叉熵损失函数。这也是为什么在前面一节中有个操作，将背景类别去除，因为在目标分类中就没得背景这个类别，而且背景也无法进行训练。

4）置信度损失的计算 置信度就是在这个网格中的每个anchors有目标的概率，比如第2行第2列网格的第2个anchors，我们给它起个名叫小Y，在训练中，经过网络预测，网络说小Y啊，你只有30%的概率，不可信啊，这个30%概率就是这个anchors小Y的预测置信度。那小Y的真实置信度如何计算呢？对了，还需要解释一下什么是预测置信度，什么是真实置信度，这个真实置信度只会出现在训练中，损失函数也是训练中才会有的，哈哈。预测置信度是经过网络预测的置信度，真实置信度就是真实目标标签坐标与预测目标标签的IOU。现在说说如何计算真实置信度，简单，我们有真实目标的[x, y, w, h]，小Y也有[x, y, w, h]，只需要计算这两个坐标的IOU(交并比)就可以得到小Y的真实置信度，代码如下：

代码语言：javascript复制

def compute_iou(x1, y1, w1, h1, x2, y2, w2, h2):
    # x1以及后面的x2都是中心点的坐标
    #第一个点的左上角阿宁和右下角的坐标
    xmin1 = x1 -0.5*w1
    xmax1 = x1   0.5*w1
    ymin1 = y1-0.5*w1
    ymax1 = y1  0.5*w1

    #第二个点的左上角阿宁和右下角的坐标
    xmin2 = x2 -0.5*w2
    xmax2 = x2   0.5*w2
    ymin2 = y2-0.5*w2
    ymax2 = y2  0.5*w2

    # (xmin1,ymin1,xmax1,ymax1)真实值， (xmin2,ymin2,xmax2,ymax2) 预测值
    # 计算交并比
    interw = np.minimum(xmax1, xmax2) - np.minimum(xmin1, xmin2)
    interh = np.minimum(ymax1, ymax2) - np.minimum(ymin1, ymin2)
    inter = interw*interh
    union = w1*h1   w2*h2 - inter
    iou = inter/(union  1e-6)
    return iou

不能只计算一个anchor的真实置信度啊，要计算所有anchors的置信度。

代码语言：javascript复制

    # 4.3 object loss
    # 首先来获得每一个点的x和y的坐标
    #真实的x1,y1,w1,h1
    x1, y1, w1, h1 = matching_gt_boxes[..., 0], matching_gt_boxes[..., 1], matching_gt_boxes[..., 2], matching_gt_boxes[..., 3]
    #预测的位置 x2,y2,w2,h2
    x2, y2, w2, h2 = pred_xy[..., 0], pred_xy[..., 1], pred_xy[..., 2], pred_xy[..., 3]
    #计算交并比
    ious = compute_iou(x1, y1, w1, h1 , x2, y2, w2, h2)
    # [b,16,16,5,1]
    #这个iou是每一个anchor box 跟他对应的 gt box 之间的iou
    ious = tf.expand_dims(ious, axis=-1)

真实置信度ious需要增加一个维度，因为人家预测置信度的维度是5维，真实置信度只是4维，所以在最后一维增加一维。

代码语言：javascript复制

#怎么来获取没有物体掩码的信息呢？
    # [b,16,16,5,1]
    pred_conf = tf.sigmoid(y_pred[...,4:5])   #预测置信度
    # 要经过预测置信度sigmoid()处理，使置信度值维持在0~1范围内。
    #计算所有的anchor box跟所有的gt box之间的IOU
    # [b,16,16,5,2] => [b,16,16,5, 1, 2]
    pred_xy = tf.expand_dims(pred_xy, axis=4)
    # [b,16,16,5,2] => [b,16,16,5, 1, 2]
    pred_wh = tf.expand_dims(pred_wh, axis=4)
    pred_wh_half = pred_wh / 2.
    pred_xymin = pred_xy - pred_wh_half
    pred_xymax = pred_xy   pred_wh_half

    # [b, 40, 5] => [b, 1, 1, 1, 40, 5]
    true_boxes_grid = tf.reshape(gt_boxes_grid, [gt_boxes_grid.shape[0], 1, 1, 1, gt_boxes_grid.shape[1], gt_boxes_grid.shape[2]])
    true_xy = true_boxes_grid[..., 0:2]
    true_wh = true_boxes_grid[..., 2:4]
    true_wh_half = true_wh / 2.
    true_xymin = true_xy - true_wh_half
    true_xymax = true_xy   true_wh_half
    # predxymin, predxymax, true_xymin, true_xymax
    # [b,16,16,5,1,2] vs [b,1,1,1,40,2]=> [b,16,16,5,40,2]
    intersectxymin = tf.maximum(pred_xymin, true_xymin)
    # [b,16,16,5,1,2] vs [b,1,1,1,40,2]=> [b,16,16,5,40,2]
    intersectxymax = tf.minimum(pred_xymax, true_xymax)
    # [b,16,16,5,40,2]
    intersect_wh = tf.maximum(intersectxymax - intersectxymin, 0.)
    # [b,16,16,5,40] * [b,16,16,5,40]=>[b,16,16,5,40]
    intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
    # [b,16,16,5,1]
    pred_area = pred_wh[..., 0] * pred_wh[..., 1]
    # [b,1,1,1,40]
    true_area = true_wh[..., 0] * true_wh[..., 1]
    # [b,16,16,5,1] [b,1,1,1,40]-[b,16,16,5,40]=>[b,16,16,5,40]
    union_area = pred_area   true_area - intersect_area
    # [b,16,16,5,40]
    iou_score = intersect_area / union_area
    # [b,16,16,5]
    best_iou = tf.reduce_max(iou_score, axis=4)
    # [b,16,16,5,1]
    best_iou = tf.expand_dims(best_iou, axis=-1)

    nonobj_detection = tf.cast(best_iou < 0.6, tf.float32)
    nonobj_mask = nonobj_detection * (1 - detector_mask)
    # nonobj counter
    n_nonobj = tf.reduce_sum(tf.cast(nonobj_mask > 0., tf.float32))

    nonobj_loss = tf.reduce_sum(nonobj_mask * tf.square(-pred_conf)) 
                  / (n_nonobj   1e-6)
    obj_loss = tf.reduce_sum(detector_mask * tf.square(ious - pred_conf)) 
               / (n_detector_mask   1e-6)

    loss = coord_loss   class_loss   nonobj_loss   5 * obj_loss

    return loss, [nonobj_loss   5 * obj_loss, class_loss, coord_loss]

之所以说置信度损失比较麻烦，是因为在置信度损失这一部分中，不仅需要计算有目标的anchors的置信度损失，还需要计算没有真实目标的anchors的置信度损失。

根据代码来详细解释， pred_xy在坐标损失值计算的过程中就已经计算出来啦，先在最后一维的前一维增加1维，具体功能是为了混合大匹配，pred_wh同理。将[x, y, w, h] => [x_min, y_min, x_max, y_max]，这一步简单，得到pred_xymin, pred_xymax，网络输出坐标格式已经转换完成。

接下来就是处理真实目标坐标值，存储真实目标坐标值的变量gt_boxes_grid的shape[b, 40, 5]，它的shape和pred_xymin, pred_xymax不匹配，就无法进行计算，现在对它变形，开始变形，通过reshape，将它的shape变形为[b, 1, 1, 1, 40, 5]，pred_xymin的shape为[b, 16, 16, 5, 1, 2]，然后使用和网络输出处理相同操作，得到true_xymin, true_xymax。

开始计算IOU啦，将pred_xymin和true_xymin相比较取大值，将pred_xymax和true_xymax相比较取小值，然后将两者返回的结果相减，并和0比较，返回大于0的值。

代码语言：javascript复制

intersect_wh = tf.maximum(intersectxymax - intersectxymin, 0.)

为什么还要有个maximum()操作呢？是因为，我们将所有的预测anchors与所有的真实anchors中目标坐标想比较，计算IOU，总会有两个目标框没有交集的情况出现，如果它们没有交集，计算的intersectxymax - intersectxymin的值为负，然后使用maximum()和0比较，就将这种情况筛选掉啦。保留的都是有交集的。

计算IOU

代码语言：javascript复制

# 选出每个anchors的最大交并比
best_iou = tf.reduce_max(iou_score, axis=4)

这条代码，是为了选出每个anchors中最大的IOU交并比，因为每个anchors都会与所有的真实目标值想匹配，所有每个anchors中都会有多个IOU，这么多IOU对我们是没有用的，我们做混合匹配的目的就是选出每个anchors与所有真实目标值的最优匹配。

**无目标的anchors掩码：**在计算有目标的anchors的置信度的过程中，用到了掩码detector_mask, 只是这个掩码是有真实目标的掩码，即有目标为1，无目标为0。现在需要求解无目标的掩码nonobj_mask，它的含义是有目标的anchors为0，无目标的anchors为1。有同学可能又会说，博主，这个好求解，用nonobj_mask = 1 - detector_mask就可以了撒，得到的结果就是没有目标的掩码，想想也对撒，此时的nonobj_mask的值含义就是有目标的anchors为0，无目标的anchors为1。同学你误我啊，这是不对滴，因为这是基于真实标签制作的掩码，计算出来的结果都是基于我们打标注的真实标签，不会出现误差。要多考虑一哈，我们现在处于训练阶段，处于计算损失函数这一阶段，要向网络预测值靠，这样才能通过减小损失，提升网络检测精度。上一小节IOU组合大匹配计算出了best_iou, 这个值其实也是概率，它的shape为[b, 16, 16, 5]，通过这个shape我们就可以明白它是输出的16x16网格中每个anchors的IOU值，然后将这个IOU与阈值(自己设定，根据实际情况，我设为0.6)相比较，小于阈值的，我们都认为该anchors没有目标，具体代码如下：

代码语言：javascript复制

# [b,16,16,5,1]
best_iou = tf.expand_dims(best_iou, axis=-1)
# 设定当IOU小于0.6时，就认为没有目标
nonobj_detection = tf.cast(best_iou < 0.6, tf.float32)

best_iou虽然可以理解成概率值或置信度，可是每个anchors，网络不都会预测一个置信度吗，比如pre_conf。我们要明白两个问题，

我们处于训练阶段，YOLO又是有监督学习，损失函数如果没有真实标签数据参与，就无法有效减小损失函数，快速收敛网络；
我们之前计算的IOU都是网络预测网格与真实网格一一对应计算的，万一哪个anchors出轨了咋办？它和隔壁老王家的anchors中的真实目标有更好的IOU。正是基于这种情况，YOLO作者才会想到，让它们来个混合大匹配，所有的anchors都进行匹配计算一次，选出最好的一个，如果这样你的IOU还比阈值小，说明你是真没有目标。

到这一步，所有的工作基本都完成啦，还差最后一个小操作，就是将一些网络预测错的网格anchors筛选掉：

代码语言：javascript复制

# 计算预测框没有目标的掩码
nonobj_mask = nonobj_detection * (1 - detector_mask)

计算无目标位置处的损失值：最后的美人终于出来了，因为要计算无目标位置处的损失值，那就说明在真实标签中，该位置没有目标，那应该如何计算它的损失值呢，在前面提到过，网络输出值中含有置信度，我们使用这个置信度即可。因为计算的是无目标处的损失值，无目标一旦出现目标，说明就是预测错误，所以该置信度越小越好，当然最后要乘以一个无目标掩码，之前计算过的，然后求和，求平均值。

代码语言：javascript复制

nonobj_loss = tf.reduce_sum(nonobj_mask * tf.square(-pred_conf)) / (n_nonobj   1e-6)

到此，所有的损失值已经计算完成，工作到这里基本已经完成啦，额，还有一个，就是我们追求的是网络检测精度，所以，要给有目标的置信度损失权重加大，代码如下：

代码语言：javascript复制

loss = coord_loss   class_loss   nonobj_loss   5 * obj_loss

tcp/ip 图像识别编程算法

0 人点赞