释放双眼，带上耳机，听听看~！

本文是PyTorch的教程Dynamic Quantization的学习笔记，介绍了如何用Dynamic Quantization加速一个LSTM模型，包括代码步骤和性能比较。

本文已参与「新人创作礼」活动，一起开启掘金创作之路。

本文首发于CSDN。

本文是PyTorch的教程Dynamic Quantization — PyTorch Tutorials 1.11.0+cu102 documentation的学习笔记。事实上由于我对该领域的不了解，本篇笔记基本上就是翻译+一点点我对其的理解。
本文介绍如何用Dynamic Quantization加速一个LSTM^{all-notes-in-one/dynamicquantization.ipynb at main · PolarisRisingWar/all-notes-in-one（我所使用的环境是Python3.8+PyTorch1.8.2+cuda11.1（cudatoolkit），源是pytorch-lts，但是据我所知，别的大多数PyTorch版本也都支持这套代码的运行）}

@[toc]

1. 介绍

设计模型时需要权衡一些特征，如调整模型层数、RNN参数量，在准确率和performance（如模型尺寸和/或 model latency或throughput^{(beta) Dynamic Quantization on an LSTM Word Language Model — PyTorch Tutorials 1.11.0+cu102 documentation）}

3. 代码步骤

Set Up：定义一个简单的LSTM，导入模块，创建一些随机输入张量。
Do the Quantization：初始化一个浮点数的模型，并创建一个它的quantized版。
Look at Model Size：在这一步可以看到模型尺寸有所变小。
Look at Latency：在这一步运行2个模型，比较模型运行时间（latency）。
Look at Accuracy：在这一步运行2个模型，比较模型输出。

3.1 Set Up

# import the modules used here in this recipe
import torch
import torch.quantization
import torch.nn as nn
import copy
import os
import time

# define a very, very simple LSTM for demonstration purposes
# in this case, we are wrapping nn.LSTM, one layer, no pre or post processing
# inspired by
# https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html, by Robert Guthrie
# and https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html
class lstm_for_demonstration(nn.Module):
  """Elementary Long Short Term Memory style model which simply wraps nn.LSTM
     Not to be used for anything other than demonstration.
  """
  def __init__(self,in_dim,out_dim,depth):
     super(lstm_for_demonstration,self).__init__()
     self.lstm = nn.LSTM(in_dim,out_dim,depth)

  def forward(self,inputs,hidden):
     out,hidden = self.lstm(inputs,hidden)
     return out, hidden


torch.manual_seed(29592)  # set the seed for reproducibility

#shape parameters
model_dimension=8
sequence_length=20
batch_size=1
lstm_depth=1

# random data for input
inputs = torch.randn(sequence_length,batch_size,model_dimension)
# hidden is actually is a tuple of the initial hidden state and the initial cell state
hidden = (torch.randn(lstm_depth,batch_size,model_dimension), torch.randn(lstm_depth,batch_size,model_dimension))

3.2 Do the Quantization

在这一部分，我们将运用torch.quantization.quantize_dynamic()函数。
其入参为模型，想要quantize的submodules（如果出现的话），目标datatype，返回一个原模型的quantized版本（nn.Module类）。

 # here is our floating point instance
float_lstm = lstm_for_demonstration(model_dimension, model_dimension,lstm_depth)

# this is the call that does the work
quantized_lstm = torch.quantization.quantize_dynamic(
    float_lstm, {nn.LSTM, nn.Linear}, dtype=torch.qint8
)

# show the changes that were made
print('Here is the floating point version of this module:')
print(float_lstm)
print('')
print('and now the quantized version:')
print(quantized_lstm)

输出：

Here is the floating point version of this module:
lstm_for_demonstration(
  (lstm): LSTM(8, 8)
)

and now the quantized version:
lstm_for_demonstration(
  (lstm): DynamicQuantizedLSTM(8, 8)
)

3.3 Look at Model Size

现在我们已经quantize了模型，将FP32的模型参数替换为INT8（和一些被记录的scale factors），这意味着我们减少了75%左右的模型储存空间。在本文所使用的默认值上的减少会小于75%，但如果你增加模型尺寸（如设置model dimension到80），这个压缩程度就会趋近于25%，因为此时模型尺寸受参数值的影响更大。

#临时储存模型，计算储存空间，然后删除
def print_size_of_model(model, label=""):
    torch.save(model.state_dict(), "temp.p")
    size=os.path.getsize("temp.p")
    print("model: ",label,' t','Size (KB):', size/1e3)
    os.remove('temp.p')
    return size

# compare the sizes
f=print_size_of_model(float_lstm,"fp32")
q=print_size_of_model(quantized_lstm,"int8")
print("{0:.2f} times smaller".format(f/q))

输出：

model:  fp32  	 Size (KB): 4.051
model:  int8  	 Size (KB): 2.963
1.37 times smaller

可以看到正如本节前文所说，这个压缩程度是大于25%的。

3.4 Look at Latency

quantized模型也会运行更快。这是因为：

转移参数花费的时间更少。
INT8的运算操作更快。

在这个简易版模型上你就能看到速度的提升（这是文档原话，但其实实验结果是原模型运行更快……），在复杂模型上一般会提升更多。但影响latency的原因很多。

原模型：

print("Floating point FP32")
%timeit float_lstm.forward(inputs, hidden)

输出：

Floating point FP32
830 µs ± 9.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

quantized模型：

print("Quantized INT8")
%timeit quantized_lstm.forward(inputs,hidden)

输出：

Quantized INT8
913 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3.5 Look at Accuracy

因为模型是随机初始化而非经过训练的，所以我们就不严格计算准确率的改变程度了（因为没有意义）。但是我们可以迅速、简单看一下quantized网络可以输出跟原网络差不多的结果。
更多分析请参考本文末提到的进阶教程。

计算输出的平均值，经比较后发现差异很小：

# run the float model
out1, hidden1 = float_lstm(inputs, hidden)
mag1 = torch.mean(abs(out1)).item()
print('mean absolute value of output tensor values in the FP32 model is {0:.5f} '.format(mag1))

# run the quantized model
out2, hidden2 = quantized_lstm(inputs, hidden)
mag2 = torch.mean(abs(out2)).item()
print('mean absolute value of output tensor values in the INT8 model is {0:.5f}'.format(mag2))

# compare them
mag3 = torch.mean(abs(out1-out2)).item()
print('mean absolute value of the difference between the output tensors is {0:.5f} or {1:.2f} percent'.format(mag3,mag3/mag1*100))

输出：

mean absolute value of output tensor values in the FP32 model is 0.12887 
mean absolute value of output tensor values in the INT8 model is 0.12912
mean absolute value of the difference between the output tensors is 0.00156 or 1.21 percent

6. 教程中提供的其他参考资料

(beta) Dynamic Quantization on an LSTM Word Language Model Tutorial：这一篇我已经计划要撰写学习笔记博文
Quantization API Documentaion
(beta) Dynamic Quantization on BERT：这一篇我已经计划要撰写学习笔记博文
Introduction to Quantization on PyTorch | PyTorch

7. 其他我自己找的参考资料

这两篇都写得很好，因为我不太懂所以看不太出哪种更好，看起来第一篇要更精准、深刻一些，第二篇更篇科普。如果我以后对模型量化这一领域需要进行更深了解的话，我会来阅读更多资料、了解更多信息的。

我后期计划撰写LSTM模型相关、尤其是在PyTorch上应用的博文，包括后文代码注释中的PyTorch官方教程。此处先留下位置，以后等我写了来补一下作为扩充阅读资料。 What is Latency in Machine Learning (ML)?
PLASTER: A Framework for Deep Learning Performance ↩

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

PyTorch教程：Dynamic Quantization学习笔记

1. 介绍

3. 代码步骤

3.1 Set Up

3.2 Do the Quantization

3.3 Look at Model Size

3.4 Look at Latency

3.5 Look at Accuracy

6. 教程中提供的其他参考资料

7. 其他我自己找的参考资料

神经网络训练及数据集处理

使用PromptLayer的Python教程

GeoSpy.ai

Globe Explorer

即梦Dreamina

Luma Dream Machine

Motionshop

StoryDiffusion

归档

{{userData.name}}已认证

1. 介绍

3. 代码步骤

3.1 Set Up

3.2 Do the Quantization

3.3 Look at Model Size

3.4 Look at Latency

3.5 Look at Accuracy

6. 教程中提供的其他参考资料

7. 其他我自己找的参考资料

Footnotes

神经网络训练及数据集处理

使用PromptLayer的Python教程

本地搭建AI模型-ChatGLM-6B: Pytorch安装与MinGw配置

基于深度学习的高精度狗狗检测识别系统（PyTorch+Pyside6+YOLOv5模型）

PyTorch详细实践指南：环境安装、张量操作、神经网络创建等

Pytorch手把手搭建全连接神经网络实现物品多分类