问题来源
事情是这样的:
本来我们讲论文,把文章中的图改成上边这样,让读者更了解模型结构就行了。
但是同读论文的师兄发出如下疑问:
下边是ControlNet的论文原文写的:
Stable Diffusion uses a pre-processing method similar to VQ-GAN to convert the entire dataset of 512×512512 times 512images into smaller 64×6464 times 64 “latent images” for stabilized training.
This requires ControlNets to convert image-based conditions to 64×6464 times 64 feature space to match the convolution size.
We use a tiny network E(⋅)mathcal{E}(cdot) of four convolution layers with 4×44 times 4 kernels and 2×22 times 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions cic_{mathrm{i}} into feature maps with
cf=E(ci)c_{mathrm{f}}=mathcal{E}left(c_{mathrm{i}}right)
大致意思是说:
为了和Stable Diffusion保持一致,所以我们这里也将控制条件压缩到潜变量空间上,加了个小的压缩网络,从512×512512 times 512压缩四次,把控制条件压缩到64×6464 times 64 。
然而实际上如果你卷积层计算四次之后,应该是压缩到32×3232 times 32。这里是有表述错误的。
无奈之下只好去看论文源码了。
ControlNet/cldm.py at main · lllyasviel/ControlNet (github.com)
代码
这部分的代码实现如下:
self.input_hint_block = TimestepEmbedSequential(
conv_nd(dims, hint_channels, 16, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 16, 16, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 16, 32, 3, padding=1, stride=2),
nn.SiLU(),
conv_nd(dims, 32, 32, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 32, 96, 3, padding=1, stride=2),
nn.SiLU(),
conv_nd(dims, 96, 96, 3, padding=1),
nn.SiLU(),
conv_nd(dims, 96, 256, 3, padding=1, stride=2),
nn.SiLU(),
zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
)
conv_nd
的实现比较简单,作者是将一维卷积、二维卷积和三维卷积集合到一起了。
代码如下:
def conv_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D convolution module.
"""
if dims == 1:
return nn.Conv1d(*args, **kwargs)
elif dims == 2:
return nn.Conv2d(*args, **kwargs)
elif dims == 3:
return nn.Conv3d(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
根据公式:Wt=(Wt−1−kernel+2∗padding)/stride+1W_t = (W_{t-1} – kernel + 2*padding ) / stride + 1
-
conv_nd(dims, x,y,3, padding=1)
: Wt=(Wt−1−3+2)/1+1W_t = (W_{t-1} – 3 + 2 ) / 1 + 1这个卷积改变的只有通道数,对张量本身的大小是没有影响的。
-
conv_nd(dims, x,y,3, padding=1, stride=2)
: Wt=(Wt−1−3+2)/2+1W_t = (W_{t-1} – 3 + 2 ) / 2 + 1这个卷积才会改变中间变量的大小,将大小减半。、
所以实际起到压缩作用的应该是
conv_nd(dims, x,y,3, padding=1, stride=2)
,进行三次压缩:-
512×512→256×256512 times 512 → 256 times 256
-
256×256→128×128256 times 256 → 128 times 128
-
128×128→64×64128 times 128 → 64 times 64
-
所以根据代码的话,我们应该将我们添加的网络改成这样。
按照论文的表述应该写成:
We use a tiny network E(⋅)mathcal{E}(cdot) of four convolution layers with 3×33 times 3 kernels and 2×22 times 2 strides (activated by SiLU, channels are 32, 96, 256, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions cic_{mathrm{i}} into feature maps with
cf=E(ci)c_{mathrm{f}}=mathcal{E}left(c_{mathrm{i}}right)
最后图长这样:
本文正在参加「金石计划」