公众号关注 “ML-CVer”
设为 “星标”,DLCV消息即可送达!
导读
通过堆叠神经网络层数(增加深度)可以非常有效地增强表征,提升特征学习效果,但是会出现深层网络的性能退化问题,ResNet的出现能够解决这个问题。本文用论文解读的方式展现了ResNet的实现方式、分类、目标检测等任务上相比SOTA更好的效果。
论文标题:Deep Residual Learning for Image Recognition


1 motivation
通过总结前人的经验,我们常会得出这样的结论:通过堆叠神经网络层数(增加深度)可以非常有效地增强表征,提升特征学习效果。为什么深度的网络表征效果会好?深度学习很不好解释,大概的解释可以是:网络的不同层可以提取不同抽象层次的特征,越深的层提取的特征越抽象。因此深度网络可以整合low-medium-high各种层次的特征,增强网络表征能力。那好,我们就直接增加网络深度吧!但是事情好像并没有那么简单!梯度优化问题:我们不禁发问:Is learning better networks as easy as stacking more layers?首先,深度网络优化是比较困难的,比如会出现梯度爆炸/梯度消失等问题。不过,这个问题已经被normalized initialization和batch normalization等措施解决得差不多了。退化问题:好,那就直接上deeper network吧!但是新问题又来了:deeper network收敛是收敛了,却出现了效果上的degradationdeeper network准确率饱和后,很快就退化了为什么会这样呢?网络更深了,参数更多了,应该拟合能力更强了才对啊!噢,一定是过拟合了。但似乎也不是过拟合的问题:

比如如果现在恒等映射(identity mapping)是最优的,那么似乎通过堆叠一些非线性层的网络将残差映射为0,从而拟合这个恒等映射,最种做法是更容易的。

deep residual nets比它对应版本的plain nets更好优化,training error更低。
deep residual nets能够从更深的网络中获得更好的表征,从而提升分类效果。
2 solution
ResNet想做什么?learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.理解不了没关系,接着往下看。2.1 Residual Learning
前提:如果假设多个非线性层能够渐近一个复杂的函数,那么多个非线性层也一定可以渐近这个残差函数。令
2.2 Identity Mapping by Shortcuts

2.3 网络架构

for the same output feature map size, the layers have the same number of filters
if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer
3 dataset and experiments
3.1 ImageNet on Classification3.1.1 与plain network的对比实验这个实验是核心,为了说明residual network能够非常完美地解决“深度增加带来的degradation”问题!!!
首先排除过拟合,因为train error也会升高
其次排除梯度消失,网络中使用了batch normalization,并且作者也做实验验证了梯度的存在
事实上,34-layers plain network也是可以实现比较好的准确率的,这说明网络在一定程度上也是work了的。
作者猜测:We conjecture that the deep plain nets may have exponentially low convergence rates. 层数的提升会在一定程度上指数级别影响收敛速度。下面是Residual Network与plain network的量化对比:

而ResNet却真正实现了网络层数增加,train error和val error都降低了,证明了网络深度确实可以帮助提升网络的性能。degradation problem在一定程度上得到了解决。
相对于plain 34-layers,ResNet 34-layers的top-1 error rate也降低了3.5%。resnet实现了在没有增加任何参数的情况下,获得了更低error rate,网络更加高效。
从plain/residual 18-layers的比较来看,两者的error rate差不多,但是ResNet却能够收敛得更快。
3.1.2 Identity v.s. Projection shortcuts
所谓projection shortcuts,就是:shortcuts包括了一个可学习参数(可以用来对齐维度,使得element-wise相加可以实现):
3.1.3 Deeper Bottleneck Architectures.
为了探索更深层的网络,保证训练时间在可控范围内,作者重又设计了bottleneck版本的building block
作者在这里是想探索深度的真正瓶颈,而不是追求很低的error rate,因此在这里使用了更加精简的bottleneck building block50-layers:将34-layers的每个2-layer block换成了3-layers bottleneck block101-layers/152-layers:增加更多的3-layers bottleneck block网络具体参数可参考如下图:


3.2 CIFAR-10实验与分析


residual functions(即
) might be generally closer to zero than the non-residual functions.
When there are more layers, an individual layer of ResNets tends to modify the signal less.(也就是后面逐渐就接近identity mapping,要拟合的残差越来越小,离目标越来越近)
4 code review
ResNet实现非常简单,网上各种实现多如牛毛,这里仅随意找了个实现版本作为对照:代码基于CIFAR-10的:2层的BasicBlock:class BasicBlock(nn.Module):
expansion = 1
def __init__(self, in_planes, planes, stride=1, option='A'):
super(BasicBlock, self).__init__()
self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(planes)
self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(planes)
self.shortcut = nn.Sequential()
if stride != 1 or in_planes != planes:
if option == 'A':
"""
For CIFAR10 ResNet paper uses option A.
"""
self.shortcut = LambdaLayer(lambda x:
F.pad(x[:, :, ::2, ::2], (0, 0, 0, 0, planes//4, planes//4), "constant", 0))
elif option == 'B':
self.shortcut = nn.Sequential(
nn.Conv2d(in_planes, self.expansion * planes, kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(self.expansion * planes)
)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
out = F.relu(out)
return out
ResNet骨架:解释一下:forward函数中定义resnet骨架:首:1层conv
身:由BasicBlock构成layer1、layer2、layer3,个数分别为
,因为每个BasicBlock有2层,所以总层数是
尾:1层fc
class ResNet(nn.Module):
def __init__(self, block, num_blocks, num_classes=10):
super(ResNet, self).__init__()
self.in_planes = 16
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(16)
self.layer1 = self._make_layer(block, 16, num_blocks[0], stride=1)
self.layer2 = self._make_layer(block, 32, num_blocks[1], stride=2)
self.layer3 = self._make_layer(block, 64, num_blocks[2], stride=2)
self.linear = nn.Linear(64, num_classes)
self.apply(_weights_init)
def _make_layer(self, block, planes, num_blocks, stride):
strides = [stride] + [1]*(num_blocks-1)
layers = []
for stride in strides:
layers.append(block(self.in_planes, planes, stride))
self.in_planes = planes * block.expansion
return nn.Sequential(*layers)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.layer1(out)
out = self.layer2(out)
out = self.layer3(out)
out = F.avg_pool2d(out, out.size()[3])
out = out.view(out.size(0), -1)
out = self.linear(out)
return out
最后,像堆积木一样,通过设置layer1、layer2、layer3的BasicBlock个数来堆出不同层的ResNet:def resnet20():
return ResNet(BasicBlock, [3, 3, 3])
def resnet32():
return ResNet(BasicBlock, [5, 5, 5])
def resnet44():
return ResNet(BasicBlock, [7, 7, 7])
def resnet56():
return ResNet(BasicBlock, [9, 9, 9])
def resnet110():
return ResNet(BasicBlock, [18, 18, 18])
def resnet1202():
return ResNet(BasicBlock, [200, 200, 200])
5 conclusion
ResNet核心就是residual learning和shortcut identity mapping,实现方式极其简单,却取得了极其好的效果,在分类、目标检测等任务上均是大比分领先SOTA,这种非常general的创新是非常不容易的,这也是ResNet备受推崇的原因吧!另外给我的启示就是:不仅仅是"talk is cheap, show me the code"了,而是"code is also relatively cheap, show me ur sense and thinking"!
推荐阅读:
推荐!CV预训练模型全集!
国内外优秀的计算机视觉团队汇总|最新版
YOLOv4中的数据增强