Detectron Model Zoo and Baselines

2022-09-02 22:10:47 浏览数 (2)

Introduction

This file documents a large collection of baselines trained with Detectron, primarily in late December 2017. We refer to these results as the 12_2017_baselines. All configurations for these baselines are located in the configs/12_2017_baselines directory. The tables below provide results and useful statistics about training and inference. Links to the trained models as well as their output are provided. Unless noted differently below (see "Notes" under each table), the following common settings are used for all training and inference runs.

Common Settings and Notes

  • All baselines were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators (with 16GB GPU memory, CUDA 8.0, and cuDNN 6.0.21).
  • All baselines were trained using 8 GPU data parallel sync SGD with a minibatch size of either 8 or 16 images (see the im/gpu column).
  • For training, only horizontal flipping data augmentation was used.
  • For inference, no test-time augmentations (e.g., multiple scales, flipping) were used.
  • All models were trained on the union of coco_2014_train and coco_2014_valminusminival, which is exactly equivalent to the recently defined coco_2017_train dataset.
  • All models were tested on the coco_2014_minival dataset, which is exactly equivalent to the recently defined coco_2017_val dataset.
  • Inference times are often expressed as "X Y", in which X is time taken in reasonably well-optimized GPU code and Y is time taken in unoptimized CPU code. (The CPU code time could be reduced substantially with additional engineering.)
  • Inference results for boxes, masks, and keypoints ("kps") are provided in the COCO json format.
  • The model id column is provided for ease of reference.
  • To check downloaded file integrity: for any download URL on this page, simply append .md5sum to the URL to download the file's md5 hash.
  • All models and results below are on the COCO dataset.
  • Baseline models and results for the Cityscapes dataset are coming soon!

Training Schedules

We use three training schedules, indicated by the lr schd column in the tables below.

  • 1x: For minibatch size 16, this schedule starts at a LR of 0.02 and is decreased by a factor of * 0.1 after 60k and 80k iterations and finally terminates at 90k iterations. This schedules results in 12.17 epochs over the 118,287 images in coco_2014_train union coco_2014_valminusminival (or equivalently, coco_2017_train).
  • 2x: Twice as long as the 1x schedule with the LR change points scaled proportionally.
  • s1x ("stretched 1x"): This schedule scales the 1x schedule by roughly 1.44x, but also extends the duration of the first learning rate. With a minibatch size of 16, it reduces the LR by * 0.1 at 100k and 120k iterations, finally ending after 130k iterations.

All training schedules also use a 500 iteration linear learning rate warm up. When changing the minibatch size between 8 and 16 images, we adjust the number of SGD iterations and the base learning rate according to the principles outlined in our paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

License

All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

ImageNet Pretrained Models

The backbone models pretrained on ImageNet are available in the format used by Detectron. Unless otherwise noted, these models are trained on the standard ImageNet-1k dataset.

  • R-50.pkl: converted copy of MSRA's original ResNet-50 model
  • R-101.pkl: converted copy of MSRA's original ResNet-101 model
  • X-101-64x4d.pkl: converted copy of FB's original ResNeXt-101-64x4d model trained with Torch7
  • X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
  • X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see our ResNeXt paper for details on ImageNet-5k)

Log Files

Training and inference logs are available for most models in the model zoo.

Proposal, Box, and Mask Detection Baselines

RPN Proposal Baselines

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-C4

RPN

1x

2

4.3

0.187

4.7

0.113

-

-

-

51.6

35998355

model | props: 1, 2, 3

R-50-FPN

RPN

1x

2

6.4

0.416

10.4

0.080

-

-

-

57.2

35998814

model | props: 1, 2, 3

R-101-FPN

RPN

1x

2

8.1

0.503

12.6

0.108

-

-

-

58.2

35998887

model | props: 1, 2, 3

X-101-64x4d-FPN

RPN

1x

2

11.5

1.395

34.9

0.292

-

-

-

59.4

35998956

model | props: 1, 2, 3

X-101-32x8d-FPN

RPN

1x

2

11.6

1.102

27.6

0.222

-

-

-

59.5

36760102

model | props: 1, 2, 3

Notes:

  • Inference time only includes RPN proposal generation.
  • "prop. AR" is proposal average recall at 1000 proposals per image.
  • Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival.

Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-C4

Fast

1x

1

6.0

0.456

22.8

0.241   0.003

34.4

-

-

-

36224013

model | boxes

R-50-C4

Fast

2x

1

6.0

0.453

45.3

0.241   0.003

35.6

-

-

-

36224046

model | boxes

R-50-FPN

Fast

1x

2

6.0

0.285

7.1

0.076   0.004

36.4

-

-

-

36225147

model | boxes

R-50-FPN

Fast

2x

2

6.0

0.287

14.4

0.077   0.004

36.8

-

-

-

36225249

model | boxes

R-101-FPN

Fast

1x

2

7.7

0.448

11.2

0.102   0.003

38.5

-

-

-

36228880

model | boxes

R-101-FPN

Fast

2x

2

7.7

0.449

22.5

0.103   0.004

39.0

-

-

-

36228933

model | boxes

X-101-64x4d-FPN

Fast

1x

1

6.3

0.994

49.7

0.292   0.003

40.4

-

-

-

36226250

model | boxes

X-101-64x4d-FPN

Fast

2x

1

6.3

0.980

98.0

0.291   0.003

39.8

-

-

-

36226326

model | boxes

X-101-32x8d-FPN

Fast

1x

1

6.4

0.721

36.1

0.217   0.003

40.6

-

-

-

37119777

model | boxes

X-101-32x8d-FPN

Fast

2x

1

6.4

0.720

72.0

0.217   0.003

39.7

-

-

-

37121469

model | boxes

R-50-C4

Mask

1x

1

6.4

0.466

23.3

0.252   0.020

35.5

31.3

-

-

36224121

model | boxes | masks

R-50-C4

Mask

2x

1

6.4

0.464

46.4

0.253   0.019

36.9

32.5

-

-

36224151

model | boxes | masks

R-50-FPN

Mask

1x

2

7.9

0.377

9.4

0.082   0.019

37.3

33.7

-

-

36225401

model | boxes | masks

R-50-FPN

Mask

2x

2

7.9

0.377

18.9

0.083   0.018

37.7

34.0

-

-

36225732

model | boxes | masks

R-101-FPN

Mask

1x

2

9.6

0.539

13.5

0.111   0.018

39.4

35.6

-

-

36229407

model | boxes | masks

R-101-FPN

Mask

2x

2

9.6

0.537

26.9

0.109   0.016

40.0

35.9

-

-

36229740

model | boxes | masks

X-101-64x4d-FPN

Mask

1x

1

7.3

1.036

51.8

0.292   0.016

41.3

37.0

-

-

36226382

model | boxes | masks

X-101-64x4d-FPN

Mask

2x

1

7.3

1.035

103.5

0.292   0.014

41.1

36.6

-

-

36672114

model | boxes | masks

X-101-32x8d-FPN

Mask

1x

1

7.4

0.766

38.3

0.223   0.017

41.3

37.0

-

-

37121516

model | boxes | masks

X-101-32x8d-FPN

Mask

2x

1

7.4

0.765

76.5

0.222   0.014

40.7

36.3

-

-

37121596

model | boxes | masks

Notes:

  • Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
  • Inference time excludes proposal generation.

End-to-End Faster & Mask R-CNN Baselines

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-C4

Faster

1x

1

6.3

0.566

28.3

0.167   0.003

34.8

-

-

-

35857197

model | boxes

R-50-C4

Faster

2x

1

6.3

0.569

56.9

0.174   0.003

36.5

-

-

-

35857281

model | boxes

R-50-FPN

Faster

1x

2

7.2

0.544

13.6

0.093   0.004

36.7

-

-

-

35857345

model | boxes

R-50-FPN

Faster

2x

2

7.2

0.546

27.3

0.092   0.004

37.9

-

-

-

35857389

model | boxes

R-101-FPN

Faster

1x

2

8.9

0.647

16.2

0.120   0.004

39.4

-

-

-

35857890

model | boxes

R-101-FPN

Faster

2x

2

8.9

0.647

32.4

0.119   0.004

39.8

-

-

-

35857952

model | boxes

X-101-64x4d-FPN

Faster

1x

1

6.9

1.057

52.9

0.305   0.003

41.5

-

-

-

35858015

model | boxes

X-101-64x4d-FPN

Faster

2x

1

6.9

1.055

105.5

0.304   0.003

40.8

-

-

-

35858198

model | boxes

X-101-32x8d-FPN

Faster

1x

1

7.0

0.799

40.0

0.233   0.004

41.3

-

-

-

36761737

model | boxes

X-101-32x8d-FPN

Faster

2x

1

7.0

0.800

80.0

0.233   0.003

40.6

-

-

-

36761786

model | boxes

R-50-C4

Mask

1x

1

6.6

0.620

31.0

0.181   0.018

35.8

31.4

-

-

35858791

model | boxes | masks

R-50-C4

Mask

2x

1

6.6

0.620

62.0

0.182   0.017

37.8

32.8

-

-

35858828

model | boxes | masks

R-50-FPN

Mask

1x

2

8.6

0.889

22.2

0.099   0.019

37.7

33.9

-

-

35858933

model | boxes | masks

R-50-FPN

Mask

2x

2

8.6

0.897

44.9

0.099   0.018

38.6

34.5

-

-

35859007

model | boxes | masks

R-101-FPN

Mask

1x

2

10.2

1.008

25.2

0.126   0.018

40.0

35.9

-

-

35861795

model | boxes | masks

R-101-FPN

Mask

2x

2

10.2

0.993

49.7

0.126   0.017

40.9

36.4

-

-

35861858

model | boxes | masks

X-101-64x4d-FPN

Mask

1x

1

7.6

1.217

60.9

0.309   0.018

42.4

37.5

-

-

36494496

model | boxes | masks

X-101-64x4d-FPN

Mask

2x

1

7.6

1.210

121.0

0.309   0.015

42.2

37.2

-

-

35859745

model | boxes | masks

X-101-32x8d-FPN

Mask

1x

1

7.7

0.961

48.1

0.239   0.019

42.1

37.3

-

-

36761843

model | boxes | masks

X-101-32x8d-FPN

Mask

2x

1

7.7

0.975

97.5

0.240   0.016

41.7

36.9

-

-

36762092

model | boxes | masks

Notes:

  • For these models, RPN and the detector are trained jointly and end-to-end.
  • Inference time is fully image-to-detections, including proposal generation.

RetinaNet Baselines

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-FPN

RetinaNet

1x

2

6.8

0.483

12.1

0.125

35.7

-

-

-

36768636

model | boxes

R-50-FPN

RetinaNet

2x

2

6.8

0.482

24.1

0.127

35.7

-

-

-

36768677

model | boxes

R-101-FPN

RetinaNet

1x

2

8.7

0.666

16.7

0.156

37.7

-

-

-

36768744

model | boxes

R-101-FPN

RetinaNet

2x

2

8.7

0.666

33.3

0.154

37.8

-

-

-

36768840

model | boxes

X-101-64x4d-FPN

RetinaNet

1x

2

12.6

1.613

40.3

0.341

39.8

-

-

-

36768875

model | boxes

X-101-64x4d-FPN

RetinaNet

2x

2

12.6

1.625

81.3

0.339

39.2

-

-

-

36768907

model | boxes

X-101-32x8d-FPN

RetinaNet

1x

2

12.7

1.343

33.6

0.277

39.5

-

-

-

36769563

model | boxes

X-101-32x8d-FPN

RetinaNet

2x

2

12.7

1.340

67.0

0.276

38.6

-

-

-

36769641

model | boxes

Notes: none

Mask R-CNN with Bells & Whistles

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

X-152-32x8d-FPN-IN5k

Mask

s1x

1

9.6

1.188

85.8

12.100   0.046

48.1

41.5

-

-

37129812

model | boxes | masks

[above without test-time aug.]

0.325   0.018

45.2

39.7

-

-

Notes:

  • A deeper backbone architecture is used: ResNeXt-152-32x8d-FPN
  • The backbone ResNeXt-152-32x8d model was trained on ImageNet-5k (not the usual ImageNet-1k)
  • Training uses multi-scale jitter over scales {640, 672, 704, 736, 768, 800}
  • Row 1: test-time augmentations are multi-scale testing over {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping (on each scale)
  • Row 2: same model as row 1, but without any test-time augmentation (i.e., same as the common baseline configuration)
  • Like the other results, this is a single model result (it is not an ensemble of models)

Keypoint Detection Baselines

Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)

Our keypoint detection baselines differ from our box and mask baselines in a couple of details:

  • Due to less training data for the keypoint detection task compared with boxes and masks, we enable multi-scale jitter during training for all keypoint detection models. (Testing is still without any test-time augmentations by default.)
  • Models are trained only on images from coco_2014_train union coco_2014_valminusminival that contain at least one person with keypoint annotations (all other images are discarded from the training set).
  • Metrics are reported for the person class only (still run on the entire coco_2014_minival dataset).

Person-Specific RPN Baselines

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-FPN

RPN

1x

2

6.4

0.391

9.8

0.082

-

-

-

64.0

35998996

model | props: 1, 2, 3

R-101-FPN

RPN

1x

2

8.1

0.504

12.6

0.109

-

-

-

65.2

35999521

model | props: 1, 2, 3

X-101-64x4d-FPN

RPN

1x

2

11.5

1.394

34.9

0.289

-

-

-

65.9

35999553

model | props: 1, 2, 3

X-101-32x8d-FPN

RPN

1x

2

11.6

1.104

27.6

0.224

-

-

-

66.2

36760438

model | props: 1, 2, 3

Notes:

  • Metrics are for the person category only.
  • Inference time only includes RPN proposal generation.
  • "prop. AR" is proposal average recall at 1000 proposals per image.
  • Proposal download links ("props"): "1" is coco_2014_train; "2" is coco_2014_valminusminival; and "3" is coco_2014_minival. These include all images, not just the ones with valid keypoint annotations.

Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-FPN

Kps

1x

2

7.7

0.533

13.3

0.081   0.087

52.7

-

64.1

-

37651787

model | boxes | kps

R-50-FPN

Kps

s1x

2

7.7

0.533

19.2

0.080   0.085

53.4

-

65.5

-

37651887

model | boxes | kps

R-101-FPN

Kps

1x

2

9.4

0.668

16.7

0.109   0.080

53.5

-

65.0

-

37651996

model | boxes | kps

R-101-FPN

Kps

s1x

2

9.4

0.668

24.1

0.108   0.076

54.6

-

66.0

-

37652016

model | boxes | kps

X-101-64x4d-FPN

Kps

1x

2

12.8

1.477

36.9

0.288   0.077

55.8

-

66.7

-

37731079

model | boxes | kps

X-101-64x4d-FPN

Kps

s1x

2

12.9

1.478

53.4

0.286   0.075

56.3

-

67.1

-

37731142

model | boxes | kps

X-101-32x8d-FPN

Kps

1x

2

12.9

1.215

30.4

0.219   0.084

55.4

-

66.2

-

37730253

model | boxes | kps

X-101-32x8d-FPN

Kps

s1x

2

12.9

1.214

43.8

0.218   0.071

55.9

-

67.0

-

37731010

model | boxes | kps

Notes:

  • Metrics are for the person category only.
  • Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
  • Inference time excludes proposal generation.

End-to-End Keypoint-Only Mask R-CNN Baselines

backbone

type

lr schd

im/ gpu

train mem (GB)

train time (s/iter)

train time total (hr)

inference time (s/im)

box AP

mask AP

kp AP

prop. AR

model id

download links

R-50-FPN

Kps

1x

2

9.0

0.832

20.8

0.097   0.092

53.6

-

64.2

-

37697547

model | boxes | kps

R-50-FPN

Kps

s1x

2

9.0

0.828

29.9

0.096   0.089

54.3

-

65.4

-

37697714

model | boxes | kps

R-101-FPN

Kps

1x

2

10.6

0.923

23.1

0.124   0.084

54.5

-

64.8

-

37697946

model | boxes | kps

R-101-FPN

Kps

s1x

2

10.6

0.921

33.3

0.123   0.083

55.3

-

65.8

-

37698009

model | boxes | kps

X-101-64x4d-FPN

Kps

1x

2

14.1

1.655

41.4

0.302   0.079

56.3

-

66.0

-

37732355

model | boxes | kps

X-101-64x4d-FPN

Kps

s1x

2

14.1

1.731

62.5

0.322   0.074

56.9

-

66.8

-

37732415

model | boxes | kps

X-101-32x8d-FPN

Kps

1x

2

14.2

1.410

35.3

0.235   0.080

56.0

-

66.0

-

37792158

model | boxes | kps

X-101-32x8d-FPN

Kps

s1x

2

14.2

1.408

50.8

0.236   0.075

56.9

-

67.0

-

37732318

model | boxes | kps

Notes:

  • Metrics are for the person category only.
  • For these models, RPN and the detector are trained jointly and end-to-end.
  • Inference time is fully image-to-detections, including proposal generation.

0 人点赞