Releases: pytorch/vision
TorchVision 0.16 - Transforms speedups, CutMix/MixUp, and MPS support!
Highlights
[BETA] Transforms and augmentations
Major speedups
The new transforms in torchvision.transforms.v2
support image classification, segmentation, detection, and video tasks. They are now 10%-40% faster than before! This is mostly achieved thanks to 2X-4X improvements made to v2.Resize()
, which now supports native uint8
tensors for Bilinear and Bicubic mode. Output results are also now closer to PIL's! Check out our performance recommendations to learn more.
Additionally, torchvision
now ships with libjpeg-turbo
instead of libjpeg
, which should significantly speed-up the jpeg decoding utilities (read_image
, decode_jpeg
), and avoid compatibility issues with PIL.
CutMix and MixUp
Long-awaited support for the CutMix
and MixUp
augmentations is now here! Check our tutorial to learn how to use them.
Towards stable V2 transforms
In the previous release 0.15 we BETA-released a new set of transforms in torchvision.transforms.v2
with native support for tasks like segmentation, detection, or videos. We have now stabilized the design decisions of these transforms and made further improvements in terms of speedups, usability, new transforms support, etc.
We're keeping the torchvision.transforms.v2
and torchvision.tv_tensors
namespaces as BETA until 0.17 out of precaution, but we do not expect disruptive API changes in the future.
Whether you’re new to Torchvision transforms, or you’re already experienced with them, we encourage you to start with Getting started with transforms v2 in order to learn more about what can be done with the new v2 transforms.
Browse our main docs for general information and performance tips. The available transforms and functionals are listed in the API reference. Additional information and tutorials can also be found in our example gallery, e.g. Transforms v2: End-to-end object detection/segmentation example or How to write your own v2 transforms.
[BETA] MPS support
The nms
and roi-align kernels (roi_align
, roi_pool
, ps_roi_align
, ps_roi_pool
) now support MPS. Thanks to Li-Huai (Allan) Lin for this contribution!
Detailed Changes
Deprecations / Breaking changes
All changes below happened in the transforms.v2
and datapoints
namespaces, which were BETA and protected with a warning. We do not expect other disruptive changes to these APIs moving forward!
[transforms.v2] to_grayscale()
is not deprecated anymore (#7707)
[transforms.v2] Renaming: torchvision.datapoints.Datapoint
-> torchvision.tv_tensors.TVTensor
(#7904, #7894)
[transforms.v2] Renaming: BoundingBox
-> BoundingBoxes
(#7778)
[transforms.v2] Renaming: BoundingBoxes.spatial_size
-> BoundingBoxes.canvas_size
(#7734)
[transforms.v2] All public method on TVTensor
classes (previously: Datapoint
classes) were removed
[transforms.v2] transforms.v2.utils
is now private. (#7863)
[transforms.v2] Remove wrap_like
class method and add tv_tensors.wrap()
function (#7832)
New Features
[transforms.v2] Add support for MixUp
and CutMix
(#7731, #7784)
[transforms.v2] Add PermuteChannels
transform (#7624)
[transforms.v2] Add ToPureTensor
transform (#7823)
[ops] Add MPS kernels for nms
and roi
ops (#7643)
Improvements
[io] Added support for CMYK images in decode_jpeg
(#7741)
[io] Package torchvision with libjpeg-turbo
instead of libjpeg
(#7672, #7840)
[models] Downloaded weights are now sha256-validated (#7219)
[transforms.v2] Massive Resize
speed-up by adding native uint8
support for bilinear and bicubic modes (#7557, #7668)
[transforms.v2] Enforce pickleability for v2 transforms and wrapped datasets (#7860)
[transforms.v2] Allow catch-all "others" key in fill
dicts. (#7779)
[transforms.v2] Allow passthrough for Resize
(#7521)
[transforms.v2] Add scale
option to ToDtype
. Remove ConvertDtype
. (#7759, #7862)
[transforms.v2] Improve UX for Compose
(#7758)
[transforms.v2] Allow users to choose whether to return TVTensor
subclasses or pure Tensor
(#7825)
[transforms.v2] Remove import-time warning for v2 namespaces (#7853, 7897)
[transforms.v2] Speedup hsv2rgb
(#7754)
[models] Add filter
parameters to list_models()
(#7718)
[models] Assert RAFT
input resolution is 128 x 128 or higher (#7339)
[ops] Replaced gpuAtomicAdd
by fastAtomicAdd
(#7596)
[utils] Add GPU support for draw_segmentation_masks
(#7684)
[ops] Add deterministic, pure-Python roi_align
implementation (#7587)
[tv_tensors] Make TVTensors
deepcopyable (#7701)
[datasets] Only return small set of targets by default from dataset wrapper (#7488)
[references] Added support for v2 transforms and tensors
/ tv_tensors
backends (#7732, #7511, #7869, #7665, #7629, #7743, #7724, #7742)
[doc] A lot of documentation improvements (#7503, #7843, #7845, #7836, #7830, #7826, #7484, #7795, #7480, #7772, #7847, #7695, #7655, #7906, #7889, #7883, #7881, #7867, #7755, #7870, #7849, #7854, #7858, #7621, #7857, #7864, #7487, #7859, #7877, #7536, #7886, #7679, #7793, #7514, #7789, #7688, #7576, #7600, #7580, #7567, #7459, #7516, #7851, #7730, #7565, #7777)
Bug Fixes
[datasets] Fix split=None
in MovingMNIST
(#7449)
[io] Fix heap buffer overflow in decode_png
(#7691)
[io] Fix blurry screen in video decoder (#7552)
[models] Fix weight download URLs for some models (#7898)
[models] Fix ShuffleNet
ONNX export (#7686)
[models] Fix detection models with pytorch 2.0 (#7592, #7448)
[ops] Fix segfault in DeformConv2d
when mask
is None (#7632)
[transforms.v2] Stricter SanitizeBoundingBoxes
labels_getter
heuristic (#7880)
[transforms.v2] Make sure RandomPhotometricDistort
transforms all images the same (#7442)
[transforms.v2] Fix v2.Lambda
’s transformed types (#7566)
[transforms.v2] Don't call round()
on float images for Resize
(#7669)
[transforms.v2] Let SanitizeBoundingBoxes
preserve output type (#7446)
[transforms.v2] Fixed int type support for sigma in GaussianBlur
(#7887)
[transforms.v2] Fixed issue with jitted AutoAugment
transforms (#7839)
[transforms] Fix Resize
pass-through logic (#7519)
[utils] Fix color in draw_segmentation_masks
(#7520)
Others
[tests] Various test improvements / fixes (#7693, #7816, #7477, #7783, #7716, #7355, #7879, #7874, #7882, #7447, #7856, #7892, #7902, #7884, #7562, #7713, #7708, #7712, #7703, #7641, #7855, #7842, #7717, #7905, #7553, #7678, #7908, #7812, #7646, #7841, #7768, #7828, #7820, #7550, #7546, #7833, #7583, #7810, #7625, #7651)
[CI] Various CI improvements (#7485, #7417, #7526, #7834, #7622, #7611, #7872, #7628, #7499, #7616, #7475, #7639, #7498, #7467, #7466, #7441, #7524, #7648, #7640, #7551, #7479, #7634, #7645, #7578, #7572, #7571, #7591, #7470, #7574, #7569, #7435, #7635, #7590, #7589, #7582, #7656, #7900, #7815, #7555, #7694, #7558, #7533, #7547, #7505, #7502, #7540, #7573)
[Code Quality] Various code quality improvements (#7559, #7673, #7677, #7771, #7770, #7710, #7709, #7687, #7454, #7464, #7527, #7462, #7662, #7593, #7797, #7805, #7786, #7831, #7829, #7846, #7806, #7814, #7606, #7613, #7608, #7597, #7792, #7781, #7685, #7702, #7500, #7804, #7747, #7835, #7726, #7796)
Contributors
We're grateful for our community, which helps us improve torchvision by submitting issues and PRs, and providing feedback and suggestions. The following persons have contributed patches for this release:
Adam J. Stewart, Aditya Oke , Andrey Talman, Camilo De La Torre, Christoph Reich, Danylo Baibak, David Chiu, David Garcia, Dennis M. Pöpperl, Dhuige, Duc Mguyen, Edward Z. Yang, Eric Sauser , Fansure Grin, Huy Do, Illia Vysochyn, Johannes, Kai Wana, Kobrin Eli, kurtamohler, Li-Huai (Allan) Lin, Liron Ilouz, Masahiro Hiramori, Mateusz Guzek, Max Chuprov, Minh-Long Luu (刘明龙), Minliang Lin, mpearce25, Nicolas Granger, Nicolas Hug , Nikita Shulga, Omkar Salpekar, Paul Mulders, Philip Meier , ptrblck, puhuk, Radek Bartoň, Richard Barnes , Riza Velioglu, Sahil Goyal, Shu, Sim Sun, SvenDS9, Tommaso Bianconcini, Vadim Zubov, vfdev-5
TorchVision 0.15.2 Release
This is a minor release, which is compatible with PyTorch 2.0.1 and contains some minor bug fixes.
Highlights
Bug Fixes
TorchVision 0.15 - New transforms API!
Highlights
[BETA] New transforms API
TorchVision is extending its Transforms API! Here is what’s new:
- You can use them not only for Image Classification but also for Object Detection, Instance & Semantic Segmentation and Video Classification.
- You can use new functional transforms for transforming Videos, Bounding Boxes and Segmentation Masks.
The API is completely backward compatible with the previous one, and remains the same to assist the migration and adoption. We are now releasing this new API as Beta in the torchvision.transforms.v2
namespace, and we would love to get early feedback from you to improve its functionality. Please reach out to us if you have any questions or suggestions.
import torchvision.transforms.v2 as transforms
# Exactly the same interface as V1:
trans = transforms.Compose([
transforms.ColorJitter(contrast=0.5),
transforms.RandomRotation(30),
transforms.CenterCrop(480),
])
imgs, bboxes, masks, labels = trans(imgs, bboxes, masks, labels)
You can read more about these new transforms in our docs, and you can also check out our examples:
Note that this API is still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in #6753, and you can also check out #7319 to learn more about the APIs that we suspect might involve future changes.
[BETA] New Video Swin Transformer
We added a Video SwinTransformer model is based on the Video Swin Transformer paper.
import torch
from torchvision.models.video import swin3d_t
video = torch.rand(1, 3, 32, 800, 600)
# or swin3d_b, swin3d_s
model = swin3d_t(weights="DEFAULT")
model.eval()
with torch.inference_mode():
prediction = model(video)
print(prediction)
The model has the following accuracies on the Kinetics-400 dataset:
Model | Acc@1 | Acc@5 |
---|---|---|
swin3d_t | 77.7 | 93.5 |
swin3d_s | 79.5 | 94.1 |
swin3d_b | 79.4 | 94.4 |
We would like to thank oke-aditya for this contribution.
Detailed Changes (PRs)
BC-breaking changes
[models] Fixed a bug inside ops.MLP
when backpropagating with dropout>0
by implicitly setting the inplace
argument of nn.Dropout
to False
(#7209)
[models, transforms] remove functionality scheduled for 0.15 after deprecation (#7176)
We removed deprecated functionalities according to the deprecation cycle: gen_bar_updater
, model_urls
/quant_model_urls
in models
.
Deprecations
[transforms] Change default of antialias parameter from None to 'warn' (#7160)
For all transforms / functionals that have the interpolate parameter, we change its current default from None
to "warn" value that behaves exactly like None
, but raises a warning indicating users to explicitly set either True
, False
or None
. In v0.17.0 we plan remove "warn" and set the default to True.
[transforms] Deprecate functional_pil and functional_tensor and make them private (#7269)
Since v0.15.0 torchvision.transforms.functional_pil
and torchvision.transforms.functional_tensor
have become private and will be removed in v0.17.0. Please use torchvision.transforms.functional
or torchvision.transforms.v2.functional
instead.
[transforms] Undeprecate PIL int constants for interpolation (#7241)
We restored the support for integer interpolation mode (Pillow constants) which was deprecated since v0.13.0 (as PIL un-deprecated those as well).
New Features
[transforms] New transforms API (see highlight)
[models] Add Video SwinTransformer (see highlight) (#6521)
Improvements
[transforms] introduce nearest-exact interpolation (#6754)
[transforms] add sequence fill support for ElasticTransform (#7141)
[transforms] perform out of bounds check for single values and two tuples in ColorJitter (#7133)
[datasets] Fixes use download of SBU dataset (#7046) (#7051)
[hub] Add video models to torchhub (#7083)
[hub] Expose maxvit and swin_v2 models to torchhub (#7078)
[io] suppress warning in VideoReader (#6976, 6971)
[io] Set pytorch vision decoder probesize for getting stream info based on the value from decode setting (#6900) (#6950)
[io] improve warning message for missing image extension (#7150)
[io] Read video from memory newapi (#6771)
[models] Allow dropout overwrites on EfficientNet (#7031)
[models] Don't use named args in MHA calls to allow applying pytorch forward hooks to VIT (#6956)
[onnx] Support exporting RoiAlign align=True to ONNX with opset 16 (#6685)
[ops] Handle invalid reduction values (#6675)
[datasets] Add MovingMNIST dataset (#7042)
Add torchvision maintainers guide (#7109)
[Documentation] Various doc improvements (#7041, #6947, #6690, #7142, #7156, #7025, #7048, #7074, #6936, #6694, #7161, #7164, #6912, #6854, #6926, #7065, #6813)
[CI] Various CI improvements (#6864, #6863, #6855, #6856, #6803, #6893, #6865, #6804, #6866, #6742, #7273, #6999, #6713, #6972, #6954, #6968, #6987, #7004, #7010, #7014, #6915, #6797, #6759, #7060, #6857, #7212, #7199, #7186, #7183, #7178, #7163, #7181, #6789, #7110, #7088, #6955, #6788, #6970)
[tests] Various tests improvements (#7020, #6939, #6658, #7216, #6996, #7363, #7379, #7218, #7286, #6901, #7059, #7202, #6708, #7013, #7206, #7204, #7233)
Bug Fixes
[datasets] fix MNIST byte flipping (#7081)
[models] properly support deepcopying and serialization of model weights (#7107)
[models] Use inplace=None
as default in ops.MLP
(#7209)
[models] Fix dropout issue in swin transformers (#7224)
[reference scripts] Fix quantized classif reference - missing args (#7072)
[models, tests] [FBcode->GH] Fix GRACE_HOPPER file internal discovery (#6719)
[transforms] Replace getbands()
with get_image_num_channels()
(#6941)
[transforms] Switch view()
with reshape()
on equalize (#6772)
[transforms] add sequence fill support for ElasticTransform (#7141)
[transforms] make RandomErasing scriptable for integer value (#7134)
[video] fix bug in output format for pyav (#6672)
[video, datasets] [bugfix] Fix the output format for VideoClips.subset (#6700)
[onnx] Fix dtype for NonMaxSuppression (#7056)
Code Quality
[datasets] Remove unused import (#7245)
[models] Fix error message typo (#6682)
[models] make weights deepcopyable (#6883)
[models] Fix missing f-string prefix in error message (#6684)
[onnx] [ONNX] Rephrase ONNX RoiAlign warning for aligned=True (#6704)
[onnx] [ONNX] misc improvements (#7249)
[ops] Raise kernel launch errors instead of just print error message in cuda ops (#7080)
[ops, tests] Remove torch.jit.fuser("fuser2") in test (#7069)
[tests] replace assert torch.allclose with torch.testing.assert_allclose (#6895)
[transforms] Remove old TODO about using _log_api_usage_once()
(#7277)
[transforms] Fixed repr for ElasticTransform (#6758)
[transforms] Use is False
for some antialias checks (#7234)
[datasets, models] Various type-hints improvements (#6844, #6929, #6843, #7087, #6735, #6845, #6846)
[all] switch to C++17 following the core library (#7116)
Prototype
Most of these PRs (not all) relate to the transforms V2 work (#7122, #7120, #7113, #7270, #7037, #6665, #6944, #6919, #7033, #7138, #6718, #6068, #7194, #6997, #6647, #7279, #7232, #7225, #6663, #7235, #7236, #7275, #6791, #6786, #7203, #7009, #7278, #7238, #7230, #7118, #7119, #6876, #7190, #6995, #6879, #6904, #6921, #6905, #6977, #6714, #6924, #6984, #6631, #7276, #6757, #7227, #7197, #7170, #7228, #7246, #7255, #7254, #7253, #7248, #7256, #7257, #7252, #6724, #7215, #7260, #7261, #7244, #7271, #7231, #6738, #7268, #7258, #6933, #6891, #6890, #7012, #6896, #6881, #6880, #6877, #7045, #6858, #6830, #6935, #6938, #6914, #6907, #6897, #6903, #6859, #6835, #6837, #6807, #6776, #6784, #6795, #7135, #6930, #7153, #6762, #6681, #7139, #6831, #6826, #6821, #6819, #6820, #6805, #6811, #6783, #6978, #6667, #6741, #6763, #6774, #6748, #6749, #6722, #6756, #6712, #6733, #6736, #6874, #6767, #6902, #6847, #6851, #6777, #6770, #6800, #6812, #6702, #7223, #6906, #7226, #6860, #6934, #6726, #6730, #7196, #7211, #7229, #7177, #6923, #6949, #6913, #6775, #7091, #7136, #7154, #6833, #6824, #6785, #6710, #6653, #6751, #6503, #7266, #6729, #6989, #7002, #6892, #6888, #6894, #6988, #6940, #6942, #6945, #6983, #6773, #6832, #6834, #6828, #6801, #7084)
Contributors
We're grateful for our community, which helps us improve torchvision by submitting issues and PRs, and providing feedback and suggestions. The following persons have contributed patches for this release:
Aditya Gandhamal, Aditya Oke, Aidyn-A, Akira Noda, Andrey Talman, Bowen Bao, Bruno Korbar, Chen Liu, cyy, David Berard, deepsghimire, Erjia Guan, F-G Fernandez, Jithun Nair, Joao Gomes, John Detloff, Justin Chu, Karan Desai, lezcano, mpearce25, Nghia, Nicolas Hug, Nikita Shulga, nps1ngh, Omkar Salpekar, Philip Meier, Robert Perrotta, RoiEX, Samantha Andow, Sergii Dymchenko, shunsuke yokokawa, Sim Sun, Toni Blaslov, toni057, Vasilis Vryniotis, vfdev-5, Vladislav Sovrasov, vsuryamurthy, Yosua Michael Maranatha, Yuxin Wu
TorchVision 0.14.1 Release
This is a minor release, which is compatible with PyTorch 1.13.1. There are no new features added.
TorchVision 0.14, including new model registration API, new models, weights, augmentations, and more
Highlights
[BETA] New Model Registration API
Following up on the multi-weight support API that was released on the previous version, we have added a new model registration API to help users retrieve models and weights. There are now 4 new methods under the torchvision.models
module: get_model
, get_model_weights
, get_weight
, and list_models
. Here are examples of how we can use them:
import torchvision
from torchvision.models import get_model, get_model_weights, list_models
max_params = 5000000
tiny_models = []
for model_name in list_models(module=torchvision.models):
weights_enum = get_model_weights(model_name)
if len([w for w in weights_enum if w.meta["num_params"] <= max_params]) > 0:
tiny_models.append(model_name)
print(tiny_models)
# ['mnasnet0_5', 'mnasnet0_75', 'mnasnet1_0', 'mobilenet_v2', ...]
model = get_model(tiny_models[0], weights="DEFAULT")
print(sum(x.numel() for x in model.state_dict().values()))
# 2239188
As of now, this API is still beta and there might be changes in the future in order to improve its usability based on your feedback.
New Architecture and Model Variants
Classification Models
We’ve added the Swin Transformer V2 architecture along with pre-trained weights for its tiny/small/base variants. In addition, we have added support for the MaxViT transformer. Here is an example on how to use the models:
import torch
from torchvision.models import *
image = torch.rand(1, 3, 224, 224)
model = swin_v2_t(weights="DEFAULT").eval()
# model = maxvit_t(weights="DEFAULT").eval()
prediction = model(image)
Here is the table showing the accuracy of the models tested on ImageNet1K dataset.
Model | Acc@1 | Acc@1
change over V1 |
Acc@5 | Acc@5
change over V1 |
swin_v2_t | 82.072 |
+0.598 |
96.132 |
+0.356 |
swin_v2_s | 83.712 |
+0.516 |
96.816 |
+0.456 |
swin_v2_b | 84.112 |
+0.530 |
96.864 |
+0.224 |
maxvit_t | 83.700 |
- |
96.722 |
- |
We would like to thank Ren Pang and Teodor Poncu for contributing the 2 models to torchvision.
[BETA] Video Classification Model
We added two new video classification models, MViT and S3D. MViT is a state of the art video classification transformer model which has 80.757% accuracy on Kinetics400 dataset, while S3D is a relatively small model with good accuracy for its size. These models can be used as follows:
import torch
from torchvision.models.video import *
video = torch.rand(3, 32, 800, 600)
model = mvit_v2_s(weights="DEFAULT")
# model = s3d(weights="DEFAULT")
model.eval()
prediction = model(images)
Here is the table showing the accuracy of the new video classification models tested in the Kinetics400 dataset.
Model | Acc@1 | Acc@5 |
mvit_v1_b | 81.474 |
95.776 |
mvit_v2_s | 83.196 |
96.36 |
s3d | 83.582 |
96.64 |
We would like to thank Haoqi Fan, Yanghao Li, Christoph Feichtenhofer and Wan-Yen Lo for their work on PyTorchVideo and their support during the development of the MViT model. We would like to thank Sophia Zhi for her contribution implementing the S3D model in torchvision.
New Primitives & Augmentations
In this release we’ve added the SimpleCopyPaste augmentation in our reference scripts and we up-streamed the PolynomialLR scheduler to PyTorch Core. We would like to thank Lezwon Castelino and Federico Pozzi for their contributions. We are continuing our efforts to modernize TorchVision by adding more SoTA primitives, Augmentations and architectures with the help of our community. If you are interested in contributing, have a look at the following issue.
Upcoming Prototype APIs
We are currently working on extending our existing Transforms and Functional API to provide native support for Video, Object Detection, Semantic and Instance Segmentation. This will enable us to offer better support to the existing Computer Vision tasks and make importable from the TorchVision binary SoTA augmentations such as MixUp, CutMix, Large Scale Jitter and SimpleCopyPaste. The API is still under development and thus was not included in the release but you can read more about it on our blogpost and provide your feedback on the dedicated Github issue.
Backward Incompatible Changes
We’ve removed some APIs that have been deprecated since version 0.12 (or before). Here is the list of things that we removed and their replacement:
- The
Kinetics400
class has been removed. Users must now use the newerKinetics
class which is a direct replacement. - The class
_DeprecatedConvBNAct
,ConvBNReLU
, andConvBNActivation
were removed fromtorchvision.models.mobilenetv2
and are replaced with the more genericConv2dNormActivation
class. - The
torchvision.models.mobilenetv3.SqueezeExcitation
has been removed in favor oftorchvision.ops.SqueezeExcitation
. - The class methods
convert_to_roi_format
,infer_scale
,setup_scales
fromtorchvision.ops.MultiScaleRoiAlign
have been removed. - We have removed the
resample
andfillcolor
parameters from the Transforms API. They have been replaced withinterpolation
andfill
respectively. - We’ve removed the
range
parameter fromtorchvision.utils.make_grid
as it was replaced by thevalue_range
parameter to avoid shadowing the Python built-in method.
Detailed Changes (PRs)
Deprecations
[models] Remove cpp model in v0.14 due to deprecation (#6632)
[utils, ops, transforms, models, datasets] Remove deprecated APIs for 0.14 (#6258)
New Features
[datasets] Add various Stereo Matching datasets (#6345, #6346, #6311, #6347, #6349, #6348, #6350, #6351)
[models] Add the S3D architecture to TorchVision (#6412, #6537)
[models] add crestereo implementation (#6310, #6629)
[models] MaxVit model (#6342)
[models] Make get_model_builder public (#6560)
[models] Add registration mechanism for models (#6333, #6369)
[models] Add MViT architecture in TorchVision for both V1 and V2 (#6198, #6373)
[models] Add SwinV2 mode variant (#6246, #6266)
[reference scripts] Add stereo matching reference scripts (#6549, #6554, #6605)
[transforms] Added elastic transform in torchvision.transforms (#4938)
[build] Add M1 binary builds (#5948, #6135, #6140, #6110, #6132, #6324, #6122, #6409)
Improvements
[build] Various torchvision binary build improvements (#6396, #6201, #6230, #6199)
[build] Install NVJPEG on Windows for 11.6 and 11.7 CUDA (#6578)
[models] Change weights return type to Mapping in models api (#6097)
[models] Vectorize box decoding and encoding in FCOS (#6203, #6278)
[ci] Add CUDA 11.7 builds (#6425)
[ci] Various CI improvements (#6590, #6290, #6170, #6218)
[documentation] Various documentations improvements (#6276, #6163, #6450, #6294, #6572, #6176, #6340, #6314, #6427, #6536, #6215, #6150)
[documentation] Add new .. betastatus::
directive and document Beta APIs (#6115)
[hub] Expose on Hub the public methods of the registration API (#6364)
[io, documentation] DOC: add limitation of decode_jpeg in the function docstring (#6637)
[models] Make the assert message more verbose in vision transformer (#6583)
[ops] Generalize ConvNormActivation function to accept tuple for some parameters (#6251)
[reference scripts] Update the datase...
Minor release
TorchVision 0.13, including new Multi-weights API, new pre-trained weights, and more
Highlights
Models
Multi-weight support API
TorchVision v0.13 offers a new Multi-weight support API for loading different weights to the existing model builder methods:
from torchvision.models import *
# Old weights with accuracy 76.130%
resnet50(weights=ResNet50_Weights.IMAGENET1K_V1)
# New weights with accuracy 80.858%
resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
# Best available weights (currently alias for IMAGENET1K_V2)
# Note that these weights may change across versions
resnet50(weights=ResNet50_Weights.DEFAULT)
# Strings are also supported
resnet50(weights="IMAGENET1K_V2")
# No weights - random initialization
resnet50(weights=None)
The new API bundles along with the weights important details such as the preprocessing transforms and meta-data such as labels. Here is how to make the most out of it:
from torchvision.io import read_image
from torchvision.models import resnet50, ResNet50_Weights
img = read_image("test/assets/encode_jpeg/grace_hopper_517x606.jpg")
# Step 1: Initialize model with the best available weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.eval()
# Step 2: Initialize the inference transforms
preprocess = weights.transforms()
# Step 3: Apply inference preprocessing transforms
batch = preprocess(img).unsqueeze(0)
# Step 4: Use the model and print the predicted category
prediction = model(batch).squeeze(0).softmax(0)
class_id = prediction.argmax().item()
score = prediction[class_id].item()
category_name = weights.meta["categories"][class_id]
print(f"{category_name}: {100 * score:.1f}%")
You can read more about the new API in the docs. To provide your feedback, please use this dedicated Github issue.
New architectures and model variants
Classification
The Swin Transformer and EfficienetNetV2 are two popular classification models which are often used for downstream vision tasks. This release includes 6 pre-trained weights for their classification variants. Here is how to use the new models:
import torch
from torchvision.models import *
image = torch.rand(1, 3, 224, 224)
model = swin_t(weights="DEFAULT").eval()
prediction = model(image)
image = torch.rand(1, 3, 384, 384)
model = efficientnet_v2_s(weights="DEFAULT").eval()
prediction = model(image)
In addition to the above, we also provide new variants for existing architectures such as ShuffleNetV2, ResNeXt and MNASNet. The accuracies of all the new pre-trained models obtained on ImageNet-1K are seen below:
Model | Acc@1 | Acc@5 |
---|---|---|
swin_t | 81.474 | 95.776 |
swin_s | 83.196 | 96.36 |
swin_b | 83.582 | 96.64 |
efficientnet_v2_s | 84.228 | 96.878 |
efficientnet_v2_m | 85.112 | 97.156 |
efficientnet_v2_l | 85.808 | 97.788 |
resnext101_64x4d | 83.246 | 96.454 |
resnext101_64x4d (quantized) | 82.898 | 96.326 |
shufflenet_v2_x1_5 | 72.996 | 91.086 |
shufflenet_v2_x1_5 (quantized) | 72.052 | 90.700 |
shufflenet_v2_x2_0 | 76.230 | 93.006 |
shufflenet_v2_x2_0 (quantized) | 75.354 | 92.488 |
mnasnet0_75 | 71.180 | 90.496 |
mnas1_3 | 76.506 | 93.522 |
We would like to thank Hu Ye for contributing to TorchVision the Swin Transformer implementation.
[BETA] Object Detection and Instance Segmentation
We have introduced 3 new model variants for RetinaNet, FasterRCNN and MaskRCNN that include several post-paper architectural optimizations and improved training recipes. All models can be used similarly:
import torch
from torchvision.models.detection import *
images = [torch.rand(3, 800, 600)]
model = retinanet_resnet50_fpn_v2(weights="DEFAULT")
# model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT")
# model = maskrcnn_resnet50_fpn_v2(weights="DEFAULT")
model.eval()
prediction = model(images)
Below we present the metrics of the new variants on COCO val2017. In parenthesis we denote the improvement over the old variants:
Model | Box mAP | Mask mAP |
---|---|---|
retinanet_resnet50_fpn_v2 | 41.5 (+5.1) | - |
fasterrcnn_resnet50_fpn_v2 | 46.7 (+9.7) | - |
maskrcnn_resnet50_fpn_v2 | 47.4 (+9.5) | 41.8 (+7.2) |
We would like to thank Ross Girshick, Piotr Dollar, Vaibhav Aggarwal, Francisco Massa and Hu Ye for their past research and contributions to this work.
New pre-trained weights
SWAG weights
The ViT and RegNet model variants offer new pre-trained SWAG (Supervised Weakly from hashtAGs) weights. One of the biggest of these models achieves a whopping 88.6% accuracy on ImageNet-1K. We currently offer two versions of the weights: 1) fine-tuned end-to-end weights on ImageNet-1K (highest accuracy) and 2) frozen trunk weights with a linear classifier fit on ImageNet-1K (great for transfer learning). Below we see the detailed accuracies of each model variant:
Model Weights | Acc@1 | Acc@5 |
---|---|---|
RegNet_Y_16GF_Weights.IMAGENET1K_SWAG_E2E_V1 | 86.012 | 98.054 |
RegNet_Y_16GF_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 83.976 | 97.244 |
RegNet_Y_32GF_Weights.IMAGENET1K_SWAG_E2E_V1 | 86.838 | 98.362 |
RegNet_Y_32GF_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 84.622 | 97.48 |
RegNet_Y_128GF_Weights.IMAGENET1K_SWAG_E2E_V1 | 88.228 | 98.682 |
RegNet_Y_128GF_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 86.068 | 97.844 |
ViT_B_16_Weights.IMAGENET1K_SWAG_E2E_V1 | 85.304 | 97.65 |
ViT_B_16_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 81.886 | 96.18 |
ViT_L_16_Weights.IMAGENET1K_SWAG_E2E_V1 | 88.064 | 98.512 |
ViT_L_16_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 85.146 | 97.422 |
ViT_H_14_Weights.IMAGENET1K_SWAG_E2E_V1 | 88.552 | 98.694 |
ViT_H_14_Weights.IMAGENET1K_SWAG_LINEAR_V1 | 85.708 | 97.73 |
The weights can be loaded normally as follows:
from torchvision.models import *
model1 = vit_h_14(weights="IMAGENET1K_SWAG_E2E_V1")
model2 = vit_h_14(weights="IMAGENET1K_SWAG_LINEAR_V1")
The SWAG weights are released under the Attribution-NonCommercial 4.0 International license. We would like to thank Laura Gustafson, Mannat Singh and Aaron Adcock for their work and support in making the weights available to TorchVision.
Model Refresh
The release of the Multi-weight support API enabled us to refresh the most popular models and offer more accurate weights. We improved on average each model by ~3 points. The new recipe used was learned on top of ResNet50 and its details were covered on a previous blogpost.
Model | Old weights | New weights |
---|---|---|
efficientnet_b1 | 78.642 | 79.838 |
mobilenet_v2 | 71.878 | 72.154 |
mobilenet_v3_large | 74.042 | 75.274 |
regnet_y_400mf | 74.046 | 75.804 |
regnet_y_800mf | 76.42 | 78.828 |
regnet_y_1_6gf | 77.95 | 80.876 |
regnet_y_3_2gf | 78.948 | 81.982 |
regnet_y_8gf | 80.032 | 82.828 |
regnet_y_16gf | 80.424 | 82.886 |
regnet_y_32gf | 80.878 | 83.368 |
regnet_x_400mf | 72.834 | 74.864 |
regnet_x_800mf | 75.212 | 77.522 |
regnet_x_1_6gf | 77.04 | 79.668 |
regnet_x_3_2gf | 78.364 | 81.196 |
regnet_x_8gf | 79.344 | 81.682 |
regnet_x_16gf | 80.058 | 82.716 |
regnet_x_32gf | 80.622 | 83.014 |
resnet50 | 76.13 | 80.858 |
resnet50 (quantized) | 75.92 | 80.282 |
resnet101 | 77.374 | 81.886 |
resnet152 | 78.312 | 82.284 |
resnext50_32x4d | 77.618 | 81.198 |
resnext101_32x8d | 79.312 | 82.834 |
resnext101_32x8d (quantized) | 78.986 | 82.574 |
wide_resnet50_2 | 78.468 | 81.602 |
wide_resnet101_2 | 78.848 | 82.51 |
We would like to thank Piotr Dollar, Mannat Singh and Hugo Touvron for their past research and contributions to this work.
Ops and Transforms
New Augmentations, Layers and Losses
This release brings a bunch of new primitives which can be used to produce SOTA models. Some highlights include the addition of AugMix data-augmentation method, the DropBlock layer, the cIoU/dIoU loss and many more. We would like to thank Aditya Oke, Abhijit Deo, Yassine Alouini and Hu Ye for contributing to the project and for helping us maintain TorchVision relevant and fresh.
Documentation
We completely revamped our models documentation to make them easier to browse, and added various key information such as supported image sizes, or image pre-processing steps of pre-trained weights. We now have a main model page with various summary tables of available weights, and each model has a dedicated page. Each model builder is also documented in their own page, with more details about the available weights, including accuracy, minimal image size, lin...
TorchVision 0.12, including new Models, Datasets, GPU Video Decoding, and more
Highlights
New Models
Four new model families have been released in the latest version along with pre-trained weights for their variants: FCOS, RAFT, Vision Transformer (ViT) and ConvNeXt.
Object Detection
FCOS is a popular, fully convolutional, anchor-free model for object detection. In this release we include a community-contributed model implementation as well as pre-trained weights. The model was trained on COCO train2017 and can be used as follows:
import torch
from torchvision import models
x = [torch.rand(3, 224, 224)]
fcos = models.detection.fcos_resnet50_fpn(pretrained=True).eval()
predictions = fcos(x)
The box AP of the pre-trained model on COCO val2017 is 39.2 (see #4961 for more details).
We would like to thank Hu Ye and Zhiqiang Wang for contributing to the model implementation and initial training. This was the first community-contributed model in a long while, and given its success, we decided to use the learnings from this process and create a new model contribution guidelines.
Optical Flow support and RAFT model
Torchvision now supports optical flow! Optical flow models try to predict movement in a video: given two consecutive frames, the model predicts where each pixel of the first frame ends up in the second frame. Check out our new tutorial on Optical Flow!
We implemented a torchscript-compatible RAFT model with pre-trained weights (both normal and “small” versions), and added support for training and evaluating optical flow models. Our training scripts support distributed training across processes and nodes, leading to much faster training time than the original implementation. We also added 5 new optical flow datasets: Flying Chairs, Flying Things, Sintel, Kitti, and HD1K.
Image Classification
Vision Transformer (ViT) and ConvNeXt are two popular architectures which can be used as image classifiers or as backbones for downstream vision tasks. In this release we include 8 pre-trained weights for their classification variants. The models were trained on ImageNet and can be used as follows:
import torch
from torchvision import models
x = torch.rand(1, 3, 224, 224)
vit = models.vit_b_16(pretrained=True).eval()
convnext = models.convnext_tiny(pretrained=True).eval()
predictions1 = vit(x)
predictions2 = convnext(x)
The accuracies of the pre-trained models obtained on ImageNet val are seen below:
Model | Acc@1 | Acc@5 |
---|---|---|
vit_b_16 | 81.072 | 95.318 |
vit_b_32 | 75.912 | 92.466 |
vit_l_16 | 79.662 | 94.638 |
vit_l_32 | 76.972 | 93.07 |
convnext_tiny | 82.52 | 96.146 |
convnext_small | 83.616 | 96.65 |
convnext_base | 84.062 | 96.87 |
convnext_large | 84.414 | 96.976 |
The above models have been trained using an adjusted version of our new training recipe and this allows us to offer models with accuracies significantly higher than the ones on the original papers.
GPU Video Decoding
In this release, we add support for GPU video decoding in the video reading API. To use hardware-accelerated decoding, we just need to pass a cuda device to the video reading API as shown below:
import torchvision
reader = torchvision.io.VideoReader(file_name, device='cuda:0')
for frame in reader:
print(frame)
We also support seeking to anyframe or a keyframe in the video before reading, as shown below:
reader.seek(seek_time)
New Datasets
We have implemented 14 new classification datasets: CLEVR, GTSRB, FER2013, SUN397, Country211, Flowers102, fvgc_aircraft, OxfordIIITPet, DTD, Food 101, Rendered SST2, Stanford cars, PCAM, and EuroSAT.
As part of our work on Optical Flow support (see above for more details), we also added 5 new optical flow datasets: Flying Chairs, Flying Things, Sintel, Kitti, and HD1K.
Documentation
New documentation layout
We have updated our documentation pages to be more compact and easier to browse. Each function / class is now documented in a separate page, clearing up some space in the per-module pages, and easing the discovery of the proposed APIs. Compare e.g. our previous docs vs the new ones. Please let us know if you have any feedback!
Model contribution guidelines
New model contribution guidelines have been published following the success of the FCOS model which was contributed by the community. These guidelines aim to be an overview of the model contribution process for anyone who would like to suggest, implement and train a new model.
Upcoming Prototype APIs
We are currently working on a prototype API which adds Multi-weight support on all of our model builder methods. This will enable us to offer multiple pre-trained weights, associated with their meta-data and inference transforms. The API is still under review and thus was not included in the release but you can read more about it on our blogpost and provide your feedback on the dedicated Github issue.
Changes in our deprecation policy
Up until now, torchvision would almost never remove deprecated APIs. In order to be more aligned and consistent with pytorch core, we are updating our deprecation policy. We are now following a 2-release deprecation cycle: deprecated APIs will raise a warning for 2 versions, and will be removed after that. To reflect these changes and to smooth the transition, we have decided to:
- Remove all APIs that had been deprecated before or on v0.8, released 1.5 years ago.
- Update the removal timeline of all other deprecated APIs to v0.14, to reflect the new 2-cycle policy starting now in v0.12.
Backward-incompatible changes
[models.quantization] Removed the Quantized shufflenet_v2_x1_5 and shufflenet_v2_x2_0 model builders which had no associated weights, rendering them useless. Additionally we added pre-trained weights for the shufflenet_v2_x0_5 quantized variant.. (#4854)
[ops] Change to stable sort in nms implementations - this change can lead to different behavior in rare cases therefore it has been flagged as backwards-incompatible (#4767)
[transforms] Changed the center and the parametrization of shear X/Y in Auto Augment transforms to align with the original papers (#5285) (#5384)
Deprecations
Note: in order to be more aligned with pytorch core, we are updating our deprecation policy. Please read more above in the “Highlights” section.
[ops] The ops.poolers.MultiScaleRoIAlign
public methods setup_setup_scales
, convert_to_roi_format
, and infer_scale
have been deprecated and will be removed in 0.14 (#4951) (#4810)
New Features
[datasets] New optical flow datasets added: FlyingChairs, Kitti, Sintel, FlyingThings3D, and HD1K (#4860) (#4845) (#4858) (#4890) (#5004) (#4889) (#4888) (#4870)
[datasets] New classification datasets support for FLAVA: CLEVR, GTSRB, FER2013, SUN397, Country211, Flowers102, fvgc_aircraft, OxfordIIITPet, DTD, Food 101, Rendered SST2, Stanford cars, PCAM, and EuroSAT (#5120) (#5130) (#5117) (#5132) (#5138) (#5177) (#5178) (#5116) (#5115) (#5119) (#5220) (#5166) (#5203) (#5114) (#5164) (#5280)
[models] Add VisionTransformer model (#5173) (#5210) (#5172) (#5085) (#5226) (#5025) (#5086) (#5159)
[models] Add ConvNeXt model (#5330) (#5253)
[models] Add RAFT models and support for optical flow model training (#5022) (#5070) (#5174) (#5381) (#5078) (#5076) (#5081) (#5079) (#5026) (#5027) (#5082) (#5060) (#4868) (#4657) (#4732)
[models] Add FCOS model (#4961) (#5267)
[utils] Add utility to convert optical flow to an image (#5134) (#5308)
[utils] Add utility to draw keypoints (#4216)
[video] Add video GPU decoder (#5019) (#5191) (#5215) (#5256) (#4474) (#3179) (#4878) (#5328) (#5327) (#5183) (#4947) (#5192)
Improvements
[datasets] Migrate mnist dataset from np.frombuffer (#4598)
[io, tests] Switch from np.frombuffer to torch.frombuffer (#4578)
[models] Update ResNet-50 accuracy with Repeated Augmentation (#5201)
[models] Add regnet_y_128gf factory function, and several regnet model weights (#5176) (#4530)
[models] Adding min_size to classification and video models (#5223)
[models] Remove in-place mutation in DefaultBoxGenerator (#5279)
[models] Added Dropout parameter to Models Constructors (#4580)
[models] Allow to use custom norm_layer (#4621)
[models] Add In...
Minor release
This is a minor release compatible with PyTorch 1.10.2 and a minor bug fix.
Highlights
Bug Fixes
- [CI] Skip jpeg comparison tests with PIL (#5232)
Minor bugfix release
This minor release bumps the pinned PyTorch version to v1.10.1 and contains some minor bug fixes.
Highlights
Bug Fixes
- [CI] Fix clang_format issue (#5061)
- [CI, MOBILE] Fix binary_libtorchvision_ops_android job (#5062)
- [CI] Add numpy as explicit dependency to build_cmake.sh (#5065)
- [MODELS] Amend the weights only if quantize=True. (#5066)
- [TRANSFORMS] Fix augmentation space to be uint8 compatible (#5067)
- [DATASETS] Fix WIDERFace download links (#5068)
- [BUILD, WINDOWS] Workaround for loading bundled DLLs (#5094)