Introduction
This file documents a large collection of baselines trained with Detectron, primarily in late December 2017. We refer to these results as the 12_2017_baselines. All configurations for these baselines are located in the configs/12_2017_baselines
directory. The tables below provide results and useful statistics about training and inference. Links to the trained models as well as their output are provided. Unless noted differently below (see "Notes" under each table), the following common settings are used for all training and inference runs.
Common Settings and Notes
- All baselines were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators (with 16GB GPU memory, CUDA 8.0, and cuDNN 6.0.21).
- All baselines were trained using 8 GPU data parallel sync SGD with a minibatch size of either 8 or 16 images (see the im/gpu column).
- For training, only horizontal flipping data augmentation was used.
- For inference, no test-time augmentations (e.g., multiple scales, flipping) were used.
- All models were trained on the union of
coco_2014_train
andcoco_2014_valminusminival
, which is exactly equivalent to the recently definedcoco_2017_train
dataset. - All models were tested on the
coco_2014_minival
dataset, which is exactly equivalent to the recently definedcoco_2017_val
dataset. - Inference times are often expressed as "X Y", in which X is time taken in reasonably well-optimized GPU code and Y is time taken in unoptimized CPU code. (The CPU code time could be reduced substantially with additional engineering.)
- Inference results for boxes, masks, and keypoints ("kps") are provided in the COCO json format.
- The model id column is provided for ease of reference.
- To check downloaded file integrity: for any download URL on this page, simply append
.md5sum
to the URL to download the file's md5 hash. - All models and results below are on the COCO dataset.
- Baseline models and results for the Cityscapes dataset are coming soon!
Training Schedules
We use three training schedules, indicated by the lr schd column in the tables below.
- 1x: For minibatch size 16, this schedule starts at a LR of 0.02 and is decreased by a factor of * 0.1 after 60k and 80k iterations and finally terminates at 90k iterations. This schedules results in 12.17 epochs over the 118,287 images in
coco_2014_train
unioncoco_2014_valminusminival
(or equivalently,coco_2017_train
). - 2x: Twice as long as the 1x schedule with the LR change points scaled proportionally.
- s1x ("stretched 1x"): This schedule scales the 1x schedule by roughly 1.44x, but also extends the duration of the first learning rate. With a minibatch size of 16, it reduces the LR by * 0.1 at 100k and 120k iterations, finally ending after 130k iterations.
All training schedules also use a 500 iteration linear learning rate warm up. When changing the minibatch size between 8 and 16 images, we adjust the number of SGD iterations and the base learning rate according to the principles outlined in our paper Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
License
All models available for download through this document are licensed under the Creative Commons Attribution-ShareAlike 3.0 license.
ImageNet Pretrained Models
The backbone models pretrained on ImageNet are available in the format used by Detectron. Unless otherwise noted, these models are trained on the standard ImageNet-1k dataset.
- R-50.pkl: converted copy of MSRA's original ResNet-50 model
- R-101.pkl: converted copy of MSRA's original ResNet-101 model
- X-101-64x4d.pkl: converted copy of FB's original ResNeXt-101-64x4d model trained with Torch7
- X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB
- X-152-32x8d-IN5k.pkl: ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see our ResNeXt paper for details on ImageNet-5k)
Log Files
Training and inference logs are available for most models in the model zoo.
Proposal, Box, and Mask Detection Baselines
RPN Proposal Baselines
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-C4 | RPN | 1x | 2 | 4.3 | 0.187 | 4.7 | 0.113 | - | - | - | 51.6 | 35998355 | model | props: 1, 2, 3 |
R-50-FPN | RPN | 1x | 2 | 6.4 | 0.416 | 10.4 | 0.080 | - | - | - | 57.2 | 35998814 | model | props: 1, 2, 3 |
R-101-FPN | RPN | 1x | 2 | 8.1 | 0.503 | 12.6 | 0.108 | - | - | - | 58.2 | 35998887 | model | props: 1, 2, 3 |
X-101-64x4d-FPN | RPN | 1x | 2 | 11.5 | 1.395 | 34.9 | 0.292 | - | - | - | 59.4 | 35998956 | model | props: 1, 2, 3 |
X-101-32x8d-FPN | RPN | 1x | 2 | 11.6 | 1.102 | 27.6 | 0.222 | - | - | - | 59.5 | 36760102 | model | props: 1, 2, 3 |
Notes:
- Inference time only includes RPN proposal generation.
- "prop. AR" is proposal average recall at 1000 proposals per image.
- Proposal download links ("props"): "1" is
coco_2014_train
; "2" iscoco_2014_valminusminival
; and "3" iscoco_2014_minival
.
Fast & Mask R-CNN Baselines Using Precomputed RPN Proposals
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-C4 | Fast | 1x | 1 | 6.0 | 0.456 | 22.8 | 0.241 0.003 | 34.4 | - | - | - | 36224013 | model | boxes |
R-50-C4 | Fast | 2x | 1 | 6.0 | 0.453 | 45.3 | 0.241 0.003 | 35.6 | - | - | - | 36224046 | model | boxes |
R-50-FPN | Fast | 1x | 2 | 6.0 | 0.285 | 7.1 | 0.076 0.004 | 36.4 | - | - | - | 36225147 | model | boxes |
R-50-FPN | Fast | 2x | 2 | 6.0 | 0.287 | 14.4 | 0.077 0.004 | 36.8 | - | - | - | 36225249 | model | boxes |
R-101-FPN | Fast | 1x | 2 | 7.7 | 0.448 | 11.2 | 0.102 0.003 | 38.5 | - | - | - | 36228880 | model | boxes |
R-101-FPN | Fast | 2x | 2 | 7.7 | 0.449 | 22.5 | 0.103 0.004 | 39.0 | - | - | - | 36228933 | model | boxes |
X-101-64x4d-FPN | Fast | 1x | 1 | 6.3 | 0.994 | 49.7 | 0.292 0.003 | 40.4 | - | - | - | 36226250 | model | boxes |
X-101-64x4d-FPN | Fast | 2x | 1 | 6.3 | 0.980 | 98.0 | 0.291 0.003 | 39.8 | - | - | - | 36226326 | model | boxes |
X-101-32x8d-FPN | Fast | 1x | 1 | 6.4 | 0.721 | 36.1 | 0.217 0.003 | 40.6 | - | - | - | 37119777 | model | boxes |
X-101-32x8d-FPN | Fast | 2x | 1 | 6.4 | 0.720 | 72.0 | 0.217 0.003 | 39.7 | - | - | - | 37121469 | model | boxes |
R-50-C4 | Mask | 1x | 1 | 6.4 | 0.466 | 23.3 | 0.252 0.020 | 35.5 | 31.3 | - | - | 36224121 | model | boxes | masks |
R-50-C4 | Mask | 2x | 1 | 6.4 | 0.464 | 46.4 | 0.253 0.019 | 36.9 | 32.5 | - | - | 36224151 | model | boxes | masks |
R-50-FPN | Mask | 1x | 2 | 7.9 | 0.377 | 9.4 | 0.082 0.019 | 37.3 | 33.7 | - | - | 36225401 | model | boxes | masks |
R-50-FPN | Mask | 2x | 2 | 7.9 | 0.377 | 18.9 | 0.083 0.018 | 37.7 | 34.0 | - | - | 36225732 | model | boxes | masks |
R-101-FPN | Mask | 1x | 2 | 9.6 | 0.539 | 13.5 | 0.111 0.018 | 39.4 | 35.6 | - | - | 36229407 | model | boxes | masks |
R-101-FPN | Mask | 2x | 2 | 9.6 | 0.537 | 26.9 | 0.109 0.016 | 40.0 | 35.9 | - | - | 36229740 | model | boxes | masks |
X-101-64x4d-FPN | Mask | 1x | 1 | 7.3 | 1.036 | 51.8 | 0.292 0.016 | 41.3 | 37.0 | - | - | 36226382 | model | boxes | masks |
X-101-64x4d-FPN | Mask | 2x | 1 | 7.3 | 1.035 | 103.5 | 0.292 0.014 | 41.1 | 36.6 | - | - | 36672114 | model | boxes | masks |
X-101-32x8d-FPN | Mask | 1x | 1 | 7.4 | 0.766 | 38.3 | 0.223 0.017 | 41.3 | 37.0 | - | - | 37121516 | model | boxes | masks |
X-101-32x8d-FPN | Mask | 2x | 1 | 7.4 | 0.765 | 76.5 | 0.222 0.014 | 40.7 | 36.3 | - | - | 37121596 | model | boxes | masks |
Notes:
- Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
- Inference time excludes proposal generation.
End-to-End Faster & Mask R-CNN Baselines
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-C4 | Faster | 1x | 1 | 6.3 | 0.566 | 28.3 | 0.167 0.003 | 34.8 | - | - | - | 35857197 | model | boxes |
R-50-C4 | Faster | 2x | 1 | 6.3 | 0.569 | 56.9 | 0.174 0.003 | 36.5 | - | - | - | 35857281 | model | boxes |
R-50-FPN | Faster | 1x | 2 | 7.2 | 0.544 | 13.6 | 0.093 0.004 | 36.7 | - | - | - | 35857345 | model | boxes |
R-50-FPN | Faster | 2x | 2 | 7.2 | 0.546 | 27.3 | 0.092 0.004 | 37.9 | - | - | - | 35857389 | model | boxes |
R-101-FPN | Faster | 1x | 2 | 8.9 | 0.647 | 16.2 | 0.120 0.004 | 39.4 | - | - | - | 35857890 | model | boxes |
R-101-FPN | Faster | 2x | 2 | 8.9 | 0.647 | 32.4 | 0.119 0.004 | 39.8 | - | - | - | 35857952 | model | boxes |
X-101-64x4d-FPN | Faster | 1x | 1 | 6.9 | 1.057 | 52.9 | 0.305 0.003 | 41.5 | - | - | - | 35858015 | model | boxes |
X-101-64x4d-FPN | Faster | 2x | 1 | 6.9 | 1.055 | 105.5 | 0.304 0.003 | 40.8 | - | - | - | 35858198 | model | boxes |
X-101-32x8d-FPN | Faster | 1x | 1 | 7.0 | 0.799 | 40.0 | 0.233 0.004 | 41.3 | - | - | - | 36761737 | model | boxes |
X-101-32x8d-FPN | Faster | 2x | 1 | 7.0 | 0.800 | 80.0 | 0.233 0.003 | 40.6 | - | - | - | 36761786 | model | boxes |
R-50-C4 | Mask | 1x | 1 | 6.6 | 0.620 | 31.0 | 0.181 0.018 | 35.8 | 31.4 | - | - | 35858791 | model | boxes | masks |
R-50-C4 | Mask | 2x | 1 | 6.6 | 0.620 | 62.0 | 0.182 0.017 | 37.8 | 32.8 | - | - | 35858828 | model | boxes | masks |
R-50-FPN | Mask | 1x | 2 | 8.6 | 0.889 | 22.2 | 0.099 0.019 | 37.7 | 33.9 | - | - | 35858933 | model | boxes | masks |
R-50-FPN | Mask | 2x | 2 | 8.6 | 0.897 | 44.9 | 0.099 0.018 | 38.6 | 34.5 | - | - | 35859007 | model | boxes | masks |
R-101-FPN | Mask | 1x | 2 | 10.2 | 1.008 | 25.2 | 0.126 0.018 | 40.0 | 35.9 | - | - | 35861795 | model | boxes | masks |
R-101-FPN | Mask | 2x | 2 | 10.2 | 0.993 | 49.7 | 0.126 0.017 | 40.9 | 36.4 | - | - | 35861858 | model | boxes | masks |
X-101-64x4d-FPN | Mask | 1x | 1 | 7.6 | 1.217 | 60.9 | 0.309 0.018 | 42.4 | 37.5 | - | - | 36494496 | model | boxes | masks |
X-101-64x4d-FPN | Mask | 2x | 1 | 7.6 | 1.210 | 121.0 | 0.309 0.015 | 42.2 | 37.2 | - | - | 35859745 | model | boxes | masks |
X-101-32x8d-FPN | Mask | 1x | 1 | 7.7 | 0.961 | 48.1 | 0.239 0.019 | 42.1 | 37.3 | - | - | 36761843 | model | boxes | masks |
X-101-32x8d-FPN | Mask | 2x | 1 | 7.7 | 0.975 | 97.5 | 0.240 0.016 | 41.7 | 36.9 | - | - | 36762092 | model | boxes | masks |
Notes:
- For these models, RPN and the detector are trained jointly and end-to-end.
- Inference time is fully image-to-detections, including proposal generation.
RetinaNet Baselines
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | RetinaNet | 1x | 2 | 6.8 | 0.483 | 12.1 | 0.125 | 35.7 | - | - | - | 36768636 | model | boxes |
R-50-FPN | RetinaNet | 2x | 2 | 6.8 | 0.482 | 24.1 | 0.127 | 35.7 | - | - | - | 36768677 | model | boxes |
R-101-FPN | RetinaNet | 1x | 2 | 8.7 | 0.666 | 16.7 | 0.156 | 37.7 | - | - | - | 36768744 | model | boxes |
R-101-FPN | RetinaNet | 2x | 2 | 8.7 | 0.666 | 33.3 | 0.154 | 37.8 | - | - | - | 36768840 | model | boxes |
X-101-64x4d-FPN | RetinaNet | 1x | 2 | 12.6 | 1.613 | 40.3 | 0.341 | 39.8 | - | - | - | 36768875 | model | boxes |
X-101-64x4d-FPN | RetinaNet | 2x | 2 | 12.6 | 1.625 | 81.3 | 0.339 | 39.2 | - | - | - | 36768907 | model | boxes |
X-101-32x8d-FPN | RetinaNet | 1x | 2 | 12.7 | 1.343 | 33.6 | 0.277 | 39.5 | - | - | - | 36769563 | model | boxes |
X-101-32x8d-FPN | RetinaNet | 2x | 2 | 12.7 | 1.340 | 67.0 | 0.276 | 38.6 | - | - | - | 36769641 | model | boxes |
Notes: none
Mask R-CNN with Bells & Whistles
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X-152-32x8d-FPN-IN5k | Mask | s1x | 1 | 9.6 | 1.188 | 85.8 | 12.100 0.046 | 48.1 | 41.5 | - | - | 37129812 | model | boxes | masks |
[above without test-time aug.] | 0.325 0.018 | 45.2 | 39.7 | - | - |
Notes:
- A deeper backbone architecture is used: ResNeXt-152-32x8d-FPN
- The backbone ResNeXt-152-32x8d model was trained on ImageNet-5k (not the usual ImageNet-1k)
- Training uses multi-scale jitter over scales {640, 672, 704, 736, 768, 800}
- Row 1: test-time augmentations are multi-scale testing over {400, 500, 600, 700, 900, 1000, 1100, 1200} and horizontal flipping (on each scale)
- Row 2: same model as row 1, but without any test-time augmentation (i.e., same as the common baseline configuration)
- Like the other results, this is a single model result (it is not an ensemble of models)
Keypoint Detection Baselines
Common Settings for Keypoint Detection Baselines (That Differ from Boxes and Masks)
Our keypoint detection baselines differ from our box and mask baselines in a couple of details:
- Due to less training data for the keypoint detection task compared with boxes and masks, we enable multi-scale jitter during training for all keypoint detection models. (Testing is still without any test-time augmentations by default.)
- Models are trained only on images from
coco_2014_train
unioncoco_2014_valminusminival
that contain at least one person with keypoint annotations (all other images are discarded from the training set). - Metrics are reported for the person class only (still run on the entire
coco_2014_minival
dataset).
Person-Specific RPN Baselines
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | RPN | 1x | 2 | 6.4 | 0.391 | 9.8 | 0.082 | - | - | - | 64.0 | 35998996 | model | props: 1, 2, 3 |
R-101-FPN | RPN | 1x | 2 | 8.1 | 0.504 | 12.6 | 0.109 | - | - | - | 65.2 | 35999521 | model | props: 1, 2, 3 |
X-101-64x4d-FPN | RPN | 1x | 2 | 11.5 | 1.394 | 34.9 | 0.289 | - | - | - | 65.9 | 35999553 | model | props: 1, 2, 3 |
X-101-32x8d-FPN | RPN | 1x | 2 | 11.6 | 1.104 | 27.6 | 0.224 | - | - | - | 66.2 | 36760438 | model | props: 1, 2, 3 |
Notes:
- Metrics are for the person category only.
- Inference time only includes RPN proposal generation.
- "prop. AR" is proposal average recall at 1000 proposals per image.
- Proposal download links ("props"): "1" is
coco_2014_train
; "2" iscoco_2014_valminusminival
; and "3" iscoco_2014_minival
. These include all images, not just the ones with valid keypoint annotations.
Keypoint-Only Mask R-CNN Baselines Using Precomputed RPN Proposals
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | Kps | 1x | 2 | 7.7 | 0.533 | 13.3 | 0.081 0.087 | 52.7 | - | 64.1 | - | 37651787 | model | boxes | kps |
R-50-FPN | Kps | s1x | 2 | 7.7 | 0.533 | 19.2 | 0.080 0.085 | 53.4 | - | 65.5 | - | 37651887 | model | boxes | kps |
R-101-FPN | Kps | 1x | 2 | 9.4 | 0.668 | 16.7 | 0.109 0.080 | 53.5 | - | 65.0 | - | 37651996 | model | boxes | kps |
R-101-FPN | Kps | s1x | 2 | 9.4 | 0.668 | 24.1 | 0.108 0.076 | 54.6 | - | 66.0 | - | 37652016 | model | boxes | kps |
X-101-64x4d-FPN | Kps | 1x | 2 | 12.8 | 1.477 | 36.9 | 0.288 0.077 | 55.8 | - | 66.7 | - | 37731079 | model | boxes | kps |
X-101-64x4d-FPN | Kps | s1x | 2 | 12.9 | 1.478 | 53.4 | 0.286 0.075 | 56.3 | - | 67.1 | - | 37731142 | model | boxes | kps |
X-101-32x8d-FPN | Kps | 1x | 2 | 12.9 | 1.215 | 30.4 | 0.219 0.084 | 55.4 | - | 66.2 | - | 37730253 | model | boxes | kps |
X-101-32x8d-FPN | Kps | s1x | 2 | 12.9 | 1.214 | 43.8 | 0.218 0.071 | 55.9 | - | 67.0 | - | 37731010 | model | boxes | kps |
Notes:
- Metrics are for the person category only.
- Each row uses precomputed RPN proposals from the corresponding table row above that uses the same backbone.
- Inference time excludes proposal generation.
End-to-End Keypoint-Only Mask R-CNN Baselines
backbone | type | lr schd | im/ gpu | train mem (GB) | train time (s/iter) | train time total (hr) | inference time (s/im) | box AP | mask AP | kp AP | prop. AR | model id | download links |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | Kps | 1x | 2 | 9.0 | 0.832 | 20.8 | 0.097 0.092 | 53.6 | - | 64.2 | - | 37697547 | model | boxes | kps |
R-50-FPN | Kps | s1x | 2 | 9.0 | 0.828 | 29.9 | 0.096 0.089 | 54.3 | - | 65.4 | - | 37697714 | model | boxes | kps |
R-101-FPN | Kps | 1x | 2 | 10.6 | 0.923 | 23.1 | 0.124 0.084 | 54.5 | - | 64.8 | - | 37697946 | model | boxes | kps |
R-101-FPN | Kps | s1x | 2 | 10.6 | 0.921 | 33.3 | 0.123 0.083 | 55.3 | - | 65.8 | - | 37698009 | model | boxes | kps |
X-101-64x4d-FPN | Kps | 1x | 2 | 14.1 | 1.655 | 41.4 | 0.302 0.079 | 56.3 | - | 66.0 | - | 37732355 | model | boxes | kps |
X-101-64x4d-FPN | Kps | s1x | 2 | 14.1 | 1.731 | 62.5 | 0.322 0.074 | 56.9 | - | 66.8 | - | 37732415 | model | boxes | kps |
X-101-32x8d-FPN | Kps | 1x | 2 | 14.2 | 1.410 | 35.3 | 0.235 0.080 | 56.0 | - | 66.0 | - | 37792158 | model | boxes | kps |
X-101-32x8d-FPN | Kps | s1x | 2 | 14.2 | 1.408 | 50.8 | 0.236 0.075 | 56.9 | - | 67.0 | - | 37732318 | model | boxes | kps |
Notes:
- Metrics are for the person category only.
- For these models, RPN and the detector are trained jointly and end-to-end.
- Inference time is fully image-to-detections, including proposal generation.