PP-Structure版面分析、表格识别使用指南

版面分析

版面分析指的是对图片形式的文档进行区域划分，定位其中的关键区域，如文字、标题、表格、图片等。

在上图中，最上面有图片区域，中间是标题和表格区域，下面是文字区域。

命令行使用

代码语言：javascript复制

paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --table=false --ocr=false

Python 代码使用

代码语言：javascript复制

import os
import cv2
from paddleocr import PPStructure,save_structure_res

if __name__ == '__main__':

    table_engine = PPStructure(table=False, ocr=False, show_log=True)

    save_folder = './output'
    img_path = 'ppstructure/docs/table/1.png'
    img = cv2.imread(img_path)
    result = table_engine(img)
    save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

    for line in result:
        img = line.pop('img')
        print(line)
        while True:
            cv2.imshow('img', img)
            key = cv2.waitKey()
            if key & 0xFF == ord('q'):
                break
        cv2.destroyAllWindows()

运行结果

代码语言：javascript复制

{'type': 'text', 'bbox': [11, 729, 407, 847], 'res': '', 'img_idx': 0}
{'type': 'text', 'bbox': [442, 754, 837, 847], 'res': '', 'img_idx': 0}
{'type': 'title', 'bbox': [443, 705, 559, 719], 'res': '', 'img_idx': 0}
{'type': 'figure', 'bbox': [10, 1, 841, 294], 'res': '', 'img_idx': 0}
{'type': 'figure_caption', 'bbox': [70, 317, 707, 357], 'res': '', 'img_idx': 0}
{'type': 'figure_caption', 'bbox': [160, 317, 797, 335], 'res': '', 'img_idx': 0}
{'type': 'table', 'bbox': [453, 359, 822, 664], 'res': '', 'img_idx': 0}
{'type': 'table', 'bbox': [12, 360, 410, 716], 'res': '', 'img_idx': 0}
{'type': 'table_caption', 'bbox': [494, 343, 785, 356], 'res': '', 'img_idx': 0}
{'type': 'table_caption', 'bbox': [69, 318, 706, 357], 'res': '', 'img_idx': 0}

'text', 'bbox': 11, 729, 407, 847

'text', 'bbox': 442, 754, 837, 847

'title', 'bbox': 443, 705, 559, 719

'figure', 'bbox': 10, 1, 841, 294

'figure_caption', 'bbox': 70, 317, 707, 357

'figure_caption', 'bbox': 160, 317, 797, 335

'table', 'bbox': 453, 359, 822, 664

'table', 'bbox': 12, 360, 410, 716

'table_caption', 'bbox': 494, 343, 785, 356

'table_caption', 'bbox': 69, 318, 706, 357

从运行的结果来看，它是将原始图像拆成了图像、图像标题、表格、表格标题、文字和文字标题六个分类。

模型训练

下载 PaddleDection 框架代码

PaddleDetection: PaddleDetection 的目的是为工业界和学术界提供丰富、易用的目标检测模型 (gitee.com)

下载，解压，进入 PaddleDection 主目录，安装需要的 Python 库

代码语言：javascript复制

pip install -r .requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

cocotools 安装错误的话可以使用如下命令安装

代码语言：javascript复制

git clone https://github.com/pdollar/coco.git
cd coco/PythonAPI
python setup.py build_ext --inplace
python setup.py build_ext install

数据集：这是一个英文数据集，包含 5 个类 {0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}

代码语言：javascript复制

wget https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz
tar -xzvf publaynet.tar.gz

这是一个 COCO 数据集，随便打开一张图像大概是这个样子的

它的标签文件是 json 文件，里面的内容如下

代码语言：javascript复制

{"file_name": "PMC1087888_00001.jpg", "width": 612, "id": 410520, "height": 792}
{"segmentation": [[55.14, 456.69, 296.1, 456.69, 296.1, 467.82, 296.1, 467.82, 296.1, 480.15, 296.1, 480.15, 296.1, 491.28, 144.06, 491.28, 144.06, 503.04, 55.14, 503.04, 55.14, 491.92, 55.14, 480.15, 55.14, 468.46, 55.14, 456.69]], 
"area": 9380.344594193506, 
"iscrowd": 0, 
"image_id": 410520, 
"bbox": [55.14, 456.69, 240.96, 46.35], 
"category_id": 1, 
"id": 4010177}

第一行表示标注文件中图像信息列表，每个元素是一张图像的信息。第二行到最后一个行表示标注文件中目标物体的标注信息列表，每个元素是一个目标物体的标注信息。这里只是其中一个区域的标注，其他还有几个区域标注，这里没有列出。

代码语言：javascript复制

{

    'segmentation':             # 物体的分割标注
    'area':                     # 物体的区域面积
    'iscrowd':                  # 是否多区域
    'image_id':                 # image id
    'bbox':                     # bbox [x1,y1,w,h]
    'category_id':              # 图片类别
    'id':                       # 区域 id
}

这里我们可以看到 category_id 为 1，表示这个区域是一个 Title。

修改配置文件 configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml ，内容如下

PP-PicoDet 模型原理

PP-PicoDet 是一个目标检测模型，对比于 YOLO 系列在轻量级检测中 (移动端) 表现更好

PicoDet-S 以 0.99M 参数以及 1.08G-FLOPs 实现 30.06% mAP。它在移动端 ARM-CPU 上实现了 150FPS，输入尺寸为 320。PicoDet-M 在仅 2.15M 参数和 2.5G-FLOPs 的情况下实现 34.3% mAP。PicoDet-L 在仅 3.3M 参数和 8.74G-FLOPs 情况下实现 40.9% mAP。本文提供了小、中、大三种模型来支持不同的部署场景。

整体网络结构

我们先来看它的主干网 (Backbone)，是百度自研的轻量级网络 ESNet。它是根据 ShuffleNet V2 进行的改进，有关 ShuffleNet V2 的内容可以参考深度学习网络模型的改进与调整中的 ShuffleNet V2。

第一个改进是引入了 SE block，主要作用是对通道加权，增强特征的提取能力。有关 SE block 的内容可以参考深度学习网络模型的改进与调整的 MobileNet V3。第二个改进是使用了一组深度可分离卷积在 stride=2 的时候，替换掉了 channel shuffle。channel shuffle 可以增强不同通道中的信息交换，但是这个信息交换是不容于 1*1 卷积的，1*1 卷积的计算速度通常比较慢，这里在每次进行下采样的时候就会替换掉 channel shuffle。第三个改进是引入了 Ghost block，主要目的是降低网络的冗余性，有关 Ghost block 的内容可以参考深度学习网络模型的改进与调整中的 GhostNet 的 Ghost bottleneck。

Backbone 的权重占整个网络的 60% 以上，并且 Backbone 的特征提取作用也是至关重要的。优化 Backbone 对检测的性能提升还是非常有帮助的。

在 Neck 部分，使用的是 CSP-PAN。PAN 是一种双向特征融合的网络，先上采样 (深层到浅层) 再下采样 (浅层到深层)。CSP-PAN 中在每层输入的地方插入了 1*1 的卷积，用来统一通道数，这么做的好处可以减少计算量，因为不统一通道数，在 concat 融合的过程中，通道数会成倍的增加，越来越多，对于移动端是非常不友好的。在上图中所示，分层特征图通过 Backbone 输出的各层中，C3 的通道数最小为 96，那么整个 CSP-PAN 都会通过 1*1 卷积统一到 96 通道数的维度上，整个参数量减少了 73%。

CSP-PAN 每一层输入的网络结构是 CSP 的结构，有关 CSP 结构的内容可以参考 YOLO 系列介绍 (二) YOLOV4 中的内容。一般的 PAN 网络都是三层输出，但是在 CSP-PAN 中增加了一层 64 倍下采样的分支，就是上图中右上角橙色框的 P6 部分，目的是为了增大大物体的召回率。这里的下采样都使用的是深度可分离卷积 (DP)。这种操作 mAP 提升了 1 个点，速度只损失了 0.25%。

Sim-OTA 动态样本匹配

只有符合的标签匹配策略的样本才会定义为正样本，只有正样本所对应的特征图的像素才能够参与 loss 计算及反传。所以标签匹配策略是非常重要的，选择合适的正样本对精度提升至关重要。

Sim-OTA 是 YOLOX 作者在 OT 策略上提出的简化算法，其作用是为不同目标选择不同数量的正样本。样本采集是动态的。

SimOTA 跟其他采样方式相比，对于一些被遮挡的物体，它仍然可以采样到，比如上图中中间的头发背影。

上图是一张我们要检测的原始图像，上面有绿色框和黄色框，绿色框为 ground truth box，即人工标注出来的区域。黄色框为当前 ground truth box 的中心点为中心点作为一个特征点，向上下左右四个方向分别延伸 2.5 倍的步长 (stride，特征图对应原图的比例)，不同的特征图上的特征点的黄色框是不同的。灰色网格代表 FPN 的其中一个步长 (stride) 为长度给图像打的网格，一个网格代表 feature map 中的像素点所能看到的感受野。如果 feature map 中的一个像素点对应原图的中心点在绿色框或者黄色框的区域内，那么这个像素点就属于正样本的候选区域。那么在上图中黄绿框相交的四个角的灰色网格不属于正样本区域。
计算正样本候选区域所产生的每个预测框与当前 ground truth box 的 IoU。在 YOLO 中一个灰色网格会产生三个预测框 Anchor。
将计算的 IoU 按从大到小排序，将排名靠前 10 的 IoU 求和，由于 IoU 本身值不会超过 1，所以这个和的指区间为 0~10，记该值为 dynamic_k。
计算正样本候选区域产生的预测框与当前 ground truth box 的 cost 值，得到 Cost 代价矩阵。该矩阵计算公式为，其中 λ 是平衡系数，和分别是 ground truth box 和预测框 Anchor 的分类损失和回归损失。该矩阵代表当前 gruond truth box 和预测框之间的代价关系，预测框的 cost 值越小越好。通过 Cost 矩阵，使网络能够自适应的找到每个 ground truth box 的正样本。Cost 代价矩阵由三个部分组成：1、每个 ground truth box 和灰色网格的预测框的重合程度越高，代表这个灰色网格已经尝试去拟合 ground truth box 了，因此它的 cost 代价就会越小。2、每个 ground truth box 和灰色网格的预测框的分类精度越高，代表这个灰色网格已经尝试去拟合 ground truth box 了。3、每个 ground truth box 的中心是否落在灰色网格的一定半径内，如果在一定半径内，代表这个灰色网格已经尝试去拟合 ground truth box 了。
将 Cost 矩阵的的值按从小到大的顺序排列。取前 dynamic_k 个 cost 最小的预测框作为当前 ground truth box 最终的正样本，将其余剩下的预测框作为负样本。对于不同的 ground truth box，dynamic_k 的值是不一样的。
使用求出的最终正负样本来计算分类和回归损失。

在 PP-PicoDet 中，修改了 Sim-OTA 的原始 loss，变为，该方法只在训练阶段使用，预测阶段是无损的。mAP 提升了 1%。

其他优化

代码分析

ESNet 的完整代码位于 ppdet/modeling/backbones/esnet.py

代码语言：javascript复制

class SEModule(nn.Layer):
    def __init__(self, channel, reduction=4):
        '''
        SE模块
        channel：输入通道数
        reducion：通道缩放率
        '''
        super(SEModule, self).__init__()
        # 通过全局池化将空间尺寸变成1*1，通道数不变
        self.avg_pool = AdaptiveAvgPool2D(1)
        # 1*1卷积进行通道降维
        self.conv1 = Conv2D(
            in_channels=channel,
            out_channels=channel // reduction,
            kernel_size=1,
            stride=1,
            padding=0,
            weight_attr=ParamAttr(),
            bias_attr=ParamAttr())
        # 1*1卷积进行通道升维
        self.conv2 = Conv2D(
            in_channels=channel // reduction,
            out_channels=channel,
            kernel_size=1,
            stride=1,
            padding=0,
            weight_attr=ParamAttr(),
            bias_attr=ParamAttr())

    def forward(self, inputs):
        outputs = self.avg_pool(inputs)
        outputs = self.conv1(outputs)
        outputs = F.relu(outputs)
        outputs = self.conv2(outputs)
        # 获取每个通道的权重
        outputs = F.hardsigmoid(outputs)
        # 利用每个通道的权重对输入的特征图进行特征值的调整
        return paddle.multiply(x=inputs, y=outputs)

代码语言：javascript复制

def channel_shuffle(x, groups):
    '''
    通道洗牌
    '''
    batch_size, num_channels, height, width = x.shape[0:4]
    assert num_channels % groups == 0, 'num_channels should be divisible by groups'
    # 每通道的向量数n
    channels_per_group = num_channels // groups
    # reshape成(g,n)的矩阵
    x = paddle.reshape(
        x=x, shape=[batch_size, groups, channels_per_group, height, width])
    # 转置成(n,g)
    x = paddle.transpose(x=x, perm=[0, 2, 1, 3, 4])
    # flatten打散
    x = paddle.reshape(x=x, shape=[batch_size, num_channels, height, width])
    return x

代码语言：javascript复制

class ConvBNLayer(nn.Layer):
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size,
                 stride,
                 padding,
                 groups=1,
                 act=None):
        '''
        普通卷积
        '''
        super(ConvBNLayer, self).__init__()
        self._conv = Conv2D(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            groups=groups,
            weight_attr=ParamAttr(initializer=KaimingNormal()),
            bias_attr=False)

        self._batch_norm = BatchNorm2D(
            out_channels,
            weight_attr=ParamAttr(regularizer=L2Decay(0.0)),
            bias_attr=ParamAttr(regularizer=L2Decay(0.0)))
        if act == "hard_swish":
            act = 'hardswish'
        self.act = act

    def forward(self, inputs):
        y = self._conv(inputs)
        y = self._batch_norm(y)
        if self.act:
            y = getattr(F, self.act)(y)
        return y

代码语言：javascript复制

class InvertedResidual(nn.Layer):
    def __init__(self,
                 in_channels,
                 mid_channels,
                 out_channels,
                 stride,
                 act="relu"):
        '''
        stride=1构建块
        '''
        super(InvertedResidual, self).__init__()
        # 1*1卷积
        self._conv_pw = ConvBNLayer(
            in_channels=in_channels // 2,
            out_channels=mid_channels // 2,
            kernel_size=1,
            stride=1,
            padding=0,
            groups=1,
            act=act)
        # 3*3深度可分离卷积
        self._conv_dw = ConvBNLayer(
            in_channels=mid_channels // 2,
            out_channels=mid_channels // 2,
            kernel_size=3,
            stride=stride,
            padding=1,
            groups=mid_channels // 2,
            act=None)
        # SE block
        self._se = SEModule(mid_channels)
        # 1*1深度可分离卷积
        self._conv_linear = ConvBNLayer(
            in_channels=mid_channels,
            out_channels=out_channels // 2,
            kernel_size=1,
            stride=1,
            padding=0,
            groups=1,
            act=act)

    def forward(self, inputs):
        # 对特征图进行通道拆分
        x1, x2 = paddle.split(
            inputs,
            num_or_sections=[inputs.shape[1] // 2, inputs.shape[1] // 2],
            axis=1)
        # 左分支
        x2 = self._conv_pw(x2)
        x3 = self._conv_dw(x2)
        # 将1*1和3*3的结果进行合并
        x3 = paddle.concat([x2, x3], axis=1)
        # 合并后SE
        x3 = self._se(x3)
        x3 = self._conv_linear(x3)
        # 合并左右分支
        out = paddle.concat([x1, x3], axis=1)
        return channel_shuffle(out, 2)

代码语言：javascript复制

class InvertedResidualDS(nn.Layer):
    def __init__(self,
                 in_channels,
                 mid_channels,
                 out_channels,
                 stride,
                 act="relu"):
        '''
        stride=2构建块
        '''
        super(InvertedResidualDS, self).__init__()
        # 右分支
        # 3*3深度可分离卷积
        self._conv_dw_1 = ConvBNLayer(
            in_channels=in_channels,
            out_channels=in_channels,
            kernel_size=3,
            stride=stride,
            padding=1,
            groups=in_channels,
            act=None)
        # 1*1深度可分离卷积
        self._conv_linear_1 = ConvBNLayer(
            in_channels=in_channels,
            out_channels=out_channels // 2,
            kernel_size=1,
            stride=1,
            padding=0,
            groups=1,
            act=act)
        # 左分支
        # 1*1卷积
        self._conv_pw_2 = ConvBNLayer(
            in_channels=in_channels,
            out_channels=mid_channels // 2,
            kernel_size=1,
            stride=1,
            padding=0,
            groups=1,
            act=act)
        # 3*3深度可分离卷积
        self._conv_dw_2 = ConvBNLayer(
            in_channels=mid_channels // 2,
            out_channels=mid_channels // 2,
            kernel_size=3,
            stride=stride,
            padding=1,
            groups=mid_channels // 2,
            act=None)
        # se block
        self._se = SEModule(mid_channels // 2)
        # 1*1深度可分离卷积
        self._conv_linear_2 = ConvBNLayer(
            in_channels=mid_channels // 2,
            out_channels=out_channels // 2,
            kernel_size=1,
            stride=1,
            padding=0,
            groups=1,
            act=act)
        # 3*3深度可分离卷积
        self._conv_dw_mv1 = ConvBNLayer(
            in_channels=out_channels,
            out_channels=out_channels,
            kernel_size=3,
            stride=1,
            padding=1,
            groups=out_channels,
            act="hard_swish")
        # 1*1卷积
        self._conv_pw_mv1 = ConvBNLayer(
            in_channels=out_channels,
            out_channels=out_channels,
            kernel_size=1,
            stride=1,
            padding=0,
            groups=1,
            act="hard_swish")

    def forward(self, inputs):
        # 右分支
        x1 = self._conv_dw_1(inputs)
        x1 = self._conv_linear_1(x1)
        # 左分支
        x2 = self._conv_pw_2(inputs)
        x2 = self._conv_dw_2(x2)
        x2 = self._se(x2)
        x2 = self._conv_linear_2(x2)
        # 合并左右分支
        out = paddle.concat([x1, x2], axis=1)
        out = self._conv_dw_mv1(out)
        out = self._conv_pw_mv1(out)

        return out

代码语言：javascript复制

@register
@serializable
class ESNet(nn.Layer):
    def __init__(self,
                 scale=1.0,
                 act="hard_swish",
                 feature_maps=[4, 11, 14],
                 channel_ratio=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]):
        '''
        ESNet Backbone
        '''
        super(ESNet, self).__init__()
        self.scale = scale
        if isinstance(feature_maps, Integral):
            feature_maps = [feature_maps]
        self.feature_maps = feature_maps
        # C3、C4、C5层的ES Block的数量
        stage_repeats = [3, 7, 3]
        # 每一步的输出通道数
        stage_out_channels = [
            -1, 24, make_divisible(128 * scale), make_divisible(256 * scale),
            make_divisible(512 * scale), 1024
        ]

        self._out_channels = []
        self._feature_idx = 0
        # 第一个普通卷积
        self._conv1 = ConvBNLayer(
            in_channels=3,
            out_channels=stage_out_channels[1],
            kernel_size=3,
            stride=2,
            padding=1,
            act=act)
        # 3*3最大池化层
        self._max_pool = MaxPool2D(kernel_size=3, stride=2, padding=1)
        self._feature_idx  = 1

        # 2. bottleneck sequences
        self._block_list = []
        arch_idx = 0
        # 遍历每一个特征输出层
        for stage_id, num_repeat in enumerate(stage_repeats):
            # 遍历每一层中的重复次数
            for i in range(num_repeat):
                channels_scales = channel_ratio[arch_idx]
                mid_c = make_divisible(
                    int(stage_out_channels[stage_id   2] * channels_scales),
                    divisor=8)
                if i == 0:
                    # 第一次进行降采样，即stride=2
                    block = self.add_sublayer(
                        name=str(stage_id   2)   '_'   str(i   1),
                        sublayer=InvertedResidualDS(
                            in_channels=stage_out_channels[stage_id   1],
                            mid_channels=mid_c,
                            out_channels=stage_out_channels[stage_id   2],
                            stride=2,
                            act=act))
                else:
                    # 之后不进行降采样，即stride=1
                    block = self.add_sublayer(
                        name=str(stage_id   2)   '_'   str(i   1),
                        sublayer=InvertedResidual(
                            in_channels=stage_out_channels[stage_id   2],
                            mid_channels=mid_c,
                            out_channels=stage_out_channels[stage_id   2],
                            stride=1,
                            act=act))
                # ES block列表
                self._block_list.append(block)
                arch_idx  = 1
                self._feature_idx  = 1
                self._update_out_channels(stage_out_channels[stage_id   2],
                                          self._feature_idx, self.feature_maps)

    def _update_out_channels(self, channel, feature_idx, feature_maps):
        if feature_idx in feature_maps:
            self._out_channels.append(channel)

    def forward(self, inputs):
        y = self._conv1(inputs['image'])
        y = self._max_pool(y)
        outs = []
        for i, inv in enumerate(self._block_list):
            # 通过每一层的ES block
            y = inv(y)
            if i   2 in self.feature_maps:
                outs.append(y)

        return outs

    @property
    def out_shape(self):
        return [ShapeSpec(channels=c) for c in self._out_channels]

CSP-PAN 的完整代码位于 ppdet/modeling/necks/csp_pan.py

代码语言：javascript复制

class ConvBNLayer(nn.Layer):
    def __init__(self,
                 in_channel=96,
                 out_channel=96,
                 kernel_size=3,
                 stride=1,
                 groups=1,
                 act='leaky_relu'):
        '''
        普通卷积
        '''
        super(ConvBNLayer, self).__init__()
        initializer = nn.initializer.KaimingUniform()
        self.conv = nn.Conv2D(
            in_channels=in_channel,
            out_channels=out_channel,
            kernel_size=kernel_size,
            groups=groups,
            padding=(kernel_size - 1) // 2,
            stride=stride,
            weight_attr=ParamAttr(initializer=initializer),
            bias_attr=False)
        self.bn = nn.BatchNorm2D(out_channel)
        if act == "hard_swish":
            act = 'hardswish'
        self.act = act

    def forward(self, x):
        x = self.bn(self.conv(x))
        if self.act:
            x = getattr(F, self.act)(x)
        return x

代码语言：javascript复制

class DPModule(nn.Layer):

    def __init__(self,
                 in_channel=96,
                 out_channel=96,
                 kernel_size=3,
                 stride=1,
                 act='leaky_relu',
                 use_act_in_out=True):
        '''
        深度可分离卷积DP
        '''
        super(DPModule, self).__init__()
        initializer = nn.initializer.KaimingUniform()
        self.use_act_in_out = use_act_in_out
        self.dwconv = nn.Conv2D(
            in_channels=in_channel,
            out_channels=out_channel,
            kernel_size=kernel_size,
            groups=out_channel,
            padding=(kernel_size - 1) // 2,
            stride=stride,
            weight_attr=ParamAttr(initializer=initializer),
            bias_attr=False)
        self.bn1 = nn.BatchNorm2D(out_channel)
        self.pwconv = nn.Conv2D(
            in_channels=out_channel,
            out_channels=out_channel,
            kernel_size=1,
            groups=1,
            padding=0,
            weight_attr=ParamAttr(initializer=initializer),
            bias_attr=False)
        self.bn2 = nn.BatchNorm2D(out_channel)
        if act == "hard_swish":
            act = 'hardswish'
        self.act = act

    def forward(self, x):
        x = self.bn1(self.dwconv(x))
        if self.act:
            x = getattr(F, self.act)(x)
        x = self.bn2(self.pwconv(x))
        if self.use_act_in_out and self.act:
            x = getattr(F, self.act)(x)
        return x

代码语言：javascript复制

class DarknetBottleneck(nn.Layer):

    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size=3,
                 expansion=0.5,
                 add_identity=True,
                 use_depthwise=False,
                 act="leaky_relu"):
        '''
        DarkNet block,包含两个卷积和一个残差连接
        '''
        super(DarknetBottleneck, self).__init__()
        hidden_channels = int(out_channels * expansion)
        conv_func = DPModule if use_depthwise else ConvBNLayer
        self.conv1 = ConvBNLayer(
            in_channel=in_channels,
            out_channel=hidden_channels,
            kernel_size=1,
            act=act)
        self.conv2 = conv_func(
            in_channel=hidden_channels,
            out_channel=out_channels,
            kernel_size=kernel_size,
            stride=1,
            act=act)
        self.add_identity = 
            add_identity and in_channels == out_channels

    def forward(self, x):
        identity = x
        out = self.conv1(x)
        out = self.conv2(out)

        if self.add_identity:
            return out   identity
        else:
            return out

代码语言：javascript复制

class CSPLayer(nn.Layer):

    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size=3,
                 expand_ratio=0.5,
                 num_blocks=1,
                 add_identity=True,
                 use_depthwise=False,
                 act="leaky_relu"):
        '''
        CSPDarkNet
        '''
        super().__init__()
        mid_channels = int(out_channels * expand_ratio)
        # 右分支
        self.main_conv = ConvBNLayer(in_channels, mid_channels, 1, act=act)
        # 左分支
        self.short_conv = ConvBNLayer(in_channels, mid_channels, 1, act=act)
        # 最后输出的1*1卷积
        self.final_conv = ConvBNLayer(
            2 * mid_channels, out_channels, 1, act=act)
        # 堆叠DarkNet block
        self.blocks = nn.Sequential(* [
            DarknetBottleneck(
                mid_channels,
                mid_channels,
                kernel_size,
                1.0,
                add_identity,
                use_depthwise,
                act=act) for _ in range(num_blocks)
        ])

    def forward(self, x):
        # 左分支，通过1*1的卷积，获取原特征图一半的通道数
        x_short = self.short_conv(x)
        # 右分支，通过1*1的卷积，获取原特征图一半的通道数
        x_main = self.main_conv(x)
        # 右分支通过一系列DarkNet block的堆叠
        x_main = self.blocks(x_main)
        # 拼接左右分支
        x_final = paddle.concat((x_main, x_short), axis=1)
        # 使用1*1卷积进行输出
        return self.final_conv(x_final)

代码语言：javascript复制

class Channel_T(nn.Layer):
    def __init__(self,
                 in_channels=[116, 232, 464],
                 out_channels=96,
                 act="leaky_relu"):
        '''
        输入部分统一通道数
        '''
        super(Channel_T, self).__init__()
        self.convs = nn.LayerList()
        for i in range(len(in_channels)):
            self.convs.append(
                ConvBNLayer(
                    in_channels[i], out_channels, 1, act=act))

    def forward(self, x):
        outs = [self.convs[i](x[i]) for i in range(len(x))]
        return outs

代码语言：javascript复制

@register
@serializable
class CSPPAN(nn.Layer):

    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size=5,
                 num_features=3,
                 num_csp_blocks=1,
                 use_depthwise=True,
                 act='hard_swish',
                 spatial_scales=[0.125, 0.0625, 0.03125]):
        '''
        CSP-PAN Neck
        '''
        super(CSPPAN, self).__init__()
        self.conv_t = Channel_T(in_channels, out_channels, act=act)
        in_channels = [out_channels] * len(spatial_scales)
        # 输入的三层通道数
        self.in_channels = in_channels
        # 输出的三层通道数
        self.out_channels = out_channels
        # 空间尺度
        self.spatial_scales = spatial_scales
        # 特征数
        self.num_features = num_features
        # 卷积层，深度可分离卷积或普通卷积
        conv_func = DPModule if use_depthwise else ConvBNLayer

        if self.num_features == 4:
            # P6层的第一个深度可分离卷积
            self.first_top_conv = conv_func(
                in_channels[0], in_channels[0], kernel_size, stride=2, act=act)
            # P6层的第二个深度可分离卷积
            self.second_top_conv = conv_func(
                in_channels[0], in_channels[0], kernel_size, stride=2, act=act)
            self.spatial_scales.append(self.spatial_scales[-1] / 2)

        # 上采样
        self.upsample = nn.Upsample(scale_factor=2, mode='nearest')
        # 上采样模块列表
        self.top_down_blocks = nn.LayerList()
        for idx in range(len(in_channels) - 1, 0, -1):
            # 使用CSPDarkNet进行上采样
            self.top_down_blocks.append(
                CSPLayer(
                    in_channels[idx - 1] * 2,
                    in_channels[idx - 1],
                    kernel_size=kernel_size,
                    num_blocks=num_csp_blocks,
                    add_identity=False,
                    use_depthwise=use_depthwise,
                    act=act))

        # 下采样列表
        self.downsamples = nn.LayerList()
        # 下采样模块列表
        self.bottom_up_blocks = nn.LayerList()
        # 下采样中一个深度可分离卷积接一个CSPDarkNet
        for idx in range(len(in_channels) - 1):
            self.downsamples.append(
                conv_func(
                    in_channels[idx],
                    in_channels[idx],
                    kernel_size=kernel_size,
                    stride=2,
                    act=act))
            self.bottom_up_blocks.append(
                CSPLayer(
                    in_channels[idx] * 2,
                    in_channels[idx   1],
                    kernel_size=kernel_size,
                    num_blocks=num_csp_blocks,
                    add_identity=False,
                    use_depthwise=use_depthwise,
                    act=act))

    def forward(self, inputs):
        assert len(inputs) == len(self.in_channels)
        # 统一输入通道数
        inputs = self.conv_t(inputs)

        # 上采样过程
        inner_outs = [inputs[-1]]
        for idx in range(len(self.in_channels) - 1, 0, -1):
            # 获取深层特征
            feat_heigh = inner_outs[0]
            # 获取浅层特征(比feat_heigh低一层)
            feat_low = inputs[idx - 1]
            # 对深层特征进行上采样
            upsample_feat = self.upsample(feat_heigh)
            # 合并上采样之后的高层特征和浅层特征，再送入CSPDarkNet网络
            inner_out = self.top_down_blocks[len(self.in_channels) - 1 - idx](
                paddle.concat([upsample_feat, feat_low], 1))
            inner_outs.insert(0, inner_out)

        # 下采样过程
        outs = [inner_outs[0]]
        for idx in range(len(self.in_channels) - 1):
            # 获取浅层特征
            feat_low = outs[-1]
            # 获取深层特征(比feat_low高一层)
            feat_height = inner_outs[idx   1]
            # 对浅层特征使用深度可分离卷积进行下采样
            downsample_feat = self.downsamples[idx](feat_low)
            # 合并下采样之后的浅层特征和深层特征，再送入CSPDarkNet网络
            out = self.bottom_up_blocks[idx](paddle.concat(
                [downsample_feat, feat_height], 1))
            outs.append(out)

        top_features = None
        # 获取P6层的特征
        if self.num_features == 4:
            top_features = self.first_top_conv(inputs[-1])
            top_features = top_features   self.second_top_conv(outs[-1])
            outs.append(top_features)

        return tuple(outs)

    @property
    def out_shape(self):
        return [
            ShapeSpec(
                channels=self.out_channels, stride=1. / s)
            for s in self.spatial_scales
        ]

    @classmethod
    def from_config(cls, cfg, input_shape):
        return {'in_channels': [i.channels for i in input_shape], }

表格识别

表格识别指的是对图片上的表格进行识别，不仅要识别表格中的文字，而且需要识别表格中单元格的坐标信息。

整个过程分为两步，第一调用文本检测算法，对单行文本进行检测，得到一个一个的检测框，再对这些检测框送入到文本识别算法获得文字。第二调用表格结构预测算法获得单元格的坐标，并与文本检测的检测框坐标进行聚合，再与文本识别结果进行聚合，最终得到图像表格的 Excel 结果。

命令行使用

进入 PaddleOCR 的 ppstructure 目录

代码语言：javascript复制

mkdir inference
cd inference
wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar
tar -xzvf ch_PP-OCRv3_det_infer.tar
wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar
tar -xzvf ch_PP-OCRv3_rec_infer.tar
wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar
tar -xzvf ch_ppstructure_mobile_v2.0_SLANet_infer.tar

在 ppstructure/table/predict_table.py 中添加执行参数

代码语言：javascript复制

--det_model_dir=inference/ch_PP-OCRv3_det_infer --rec_model_dir=inference/ch_PP-OCRv3_rec_infer --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt --image_dir=docs/table/table.jpg --output=../output/table

并将工作目录设置为 ppstructure。这里的表格图片如下

执行结果，生成的 Excel 文件如下

模型训练

文字检测模型略过，文字识别模型的训练可以参考 PaddleOCR 使用指南中的模型训练

这里主要看一下表格结构预测模型的训练

数据集：https://dax-cdn.cdn.appdomain.cloud/dax-pubtabnet/2.0.0/pubtabnet.tar.gz

下载后解压，大概样子如下所示

数据集生成

下载数据集生成工具：GitHub - WenmuZhou/TableGeneration: 通过浏览器渲染生成表格图像

图像识别

0 人点赞