基于keras建模解决Python深度学习的二分类问题

释放双眼,带上耳机,听听看~!
本文介绍了如何使用keras建模解决Python深度学习的二分类问题,使用IMDB数据集进行机器学习实验和研究。

公众号:尤而小屋
作者:Peter
编辑:Peter

持续更新《Python深度学习》一书的精华内容,仅作为学习笔记分享。

本文是第二篇:基于keras建模解决Python深度学习的二分类问题,使用keras内置的IMDB数据集

基于keras建模解决Python深度学习的二分类问题

  • 二分类的最后一层使用sigmoid作为激活函数
  • 使用binary_crossentropy作为损失(二元交叉熵损失)

运行环境:Python3.9.13 + Keras2.12.0 + tensorflow2.12.0

In [1]:

import pandas as pd
import numpy as np

import tensorflow as tf
from keras.datasets import imdb  # 内置数据集

from keras import models
from keras import layers
from keras import optimizers  # 优化器
from tensorflow.keras.utils import to_categorical  # 实现one-hot编码

# from tensorflow.keras import optimizers
# 修改1
# from tensorflow.python.keras.optimizers import rmsprop_v2

导入IMDB数据

IMDB数据集是一个非常著名和广泛使用的电影数据集,包含了大量的电影和演员的信息。它由互联网电影数据库(IMDB)提供,包含了超过4700部电影和电视节目的信息,以及超过50万名演员和工作人员的信息。

IMDB数据集非常适合用于电影推荐、电影属性预测、演员演技评估等任务。您可以利用这个数据集来训练和测试机器学习模型,以实现自动电影推荐、电影属性预测、演员演技评估等。

使用IMDB数据集可以进行以下类型的机器学习实验和研究:

  • 电影推荐:利用机器学习算法根据用户的观影历史和喜好,向用户推荐适合他们观看的电影。
  • 电影属性预测:根据电影的属性(例如类型、导演、主演等),利用机器学习算法预测电影的评分和评论。
  • 演员演技评估:利用机器学习算法评估演员的表演技巧和水平,以及他们在电影中的重要性。

总之,IMDB数据集是一个非常丰富和有用的数据集,可以用于电影推荐、电影属性预测、演员演技评估等任务。通过使用这个数据集,您可以深入了解电影和演员的信息,以及它们之间的关系和影响。

IMDB数据集已经内置Keras库中

In [2]:

from keras.datasets import imdb

In [3]:

# 取出训练集中最常出现的前10000个词语
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

In [4]:

train_data[:2]

Out[4]:

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95])],
      dtype=object)

两个labels相关的数据都是0和1的二分类标签:其中0代表负面neg,1代表正面pos

In [5]:

train_labels[:3]  

Out[5]:

array([1, 0, 0], dtype=int64)

In [6]:

test_labels[:3]

Out[6]:

array([0, 1, 1], dtype=int64)

前10000个单词说明单词索引不超过9999:

In [7]:

max([max(sequence) for sequence in train_data])

Out[7]:

9999

单词和索引的互换:

In [8]:

word_index = imdb.get_word_index()

reverse_word_index = dict([value, key] for (key, value) in word_index.items())  # 翻转过程
reverse_word_index

# 结果

{34701: 'fawn',
 52006: 'tsukino',
 52007: 'nunnery',
 16816: 'sonja',
 63951: 'vani',
 1408: 'woods',
 16115: 'spiders',
 2345: 'hanging',
 2289: 'woody',
 52008: 'trawling',
 52009: "hold's",
 11307: 'comically',
 40830: 'localized'
 .......
 }

将评论解析为英文单词:

In [9]:

decoded_review = ' '.join([reverse_word_index.get(i-3, "?") for i in train_data[0]])
decoded_review

Out[9]:

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

训练集整数序列编码

将整数序列编码为二进制矩阵:

In [10]:

import numpy as np

def vectorize_sequences(seq, dim=10000):  
    """
    seq: 输入序列
    dim:10000,维度
    """
    results = np.zeros((len(seq), dim))  # 创建全0矩阵  length * dim
    for i, s in enumerate(seq):
        results[i,s] = 1.   # 将该位置的值从0变成1,如果没有出现则还是0
    return results

X_train = vectorize_sequences(train_data)
X_test = vectorize_sequences(test_data)

In [11]:

X_train[0]

Out[11]:

array([0., 1., 1., ..., 0., 0., 0.])

标签向量化

In [12]:

y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")

上面已经将训练集和测试集都处理完后,就可以将数据喂入到神经网络中:

构建网络

In [13]:

from keras import models
from keras import layers

In [14]:

X_train.shape

Out[14]:

(25000, 10000)

为什么在深度学习中需要激活函数?

  1. 在深度学习中,激活函数是神经网络中的一个重要组成部分。它们被用来引入非线性性到神经网络中,这是神经网络能够学习非线性模式的关键。
  2. 如果没有激活函数,神经网络只能学习线性模式,这限制了它们的应用范围。激活函数使得神经元能够以一种非线性方式响应输入信号,从而使得神经网络能够更好地适应复杂的数据分布。
  3. 此外,激活函数还可以帮助引入梯度,这是深度学习中的关键概念。在反向传播算法中,激活函数将导致梯度的非线性变化,这使得网络能够更好地学习和优化。
  4. 最后,激活函数还可以用来控制输出值的范围,例如ReLU和Sigmoid函数可以将输出值映射到0到1之间,这可以帮助控制网络的输出值范围,并防止出现梯度消失或梯度爆炸等问题。

因此,激活函数在深度学习中起着非常重要的作用,它们不仅可以引入非线性性,还可以帮助引入梯度,控制输出值范围,从而提升神经网络的性能。

In [15]:

model = models.Sequential()

model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],)))
model.add(layers.Dense(16, activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))

编译网络compile

配置优化器和损失后编译网络

In [16]:

# 写法1

model.compile(optimizer='rmsprop',  # 优化器
              loss='binary_crossentropy',  # 二进制交叉熵
              metrics=['accuracy']   # 评价指标
             )

In [17]:

# 写法2:有改动

model.compile(
    # 原文
    # optimizer= optimizers.RMSprop(lr=0.001),   # 正则项
    optimizer= tf.keras.optimizers.RMSprop(learning_rate=0.001),   # 添加前缀tf;lr也要改成learning_rate
    loss='binary_crossentropy',  # 交叉熵
    metrics=['accuracy']   # 使用全称
)

模型训练fit

In [18]:

# 留出验证集和真实训练集

x_val = X_train[:10000]  # 前10000个
partial_x_train = X_train[10000:] # 10000个之后  真实训练集

y_val = y_train[:10000]
partial_y_train = y_train[10000:]  # 真实训练集

In [19]:

history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val)
                   )
Epoch 1/20
30/30 [==============================] - 1s 23ms/step - loss: 0.5108 - accuracy: 0.7746 - val_loss: 0.3802 - val_accuracy: 0.8653
Epoch 2/20
30/30 [==============================] - 0s 13ms/step - loss: 0.3102 - accuracy: 0.8959 - val_loss: 0.3066 - val_accuracy: 0.8850
Epoch 3/20
30/30 [==============================] - 0s 14ms/step - loss: 0.2343 - accuracy: 0.9200 - val_loss: 0.2997 - val_accuracy: 0.8815
Epoch 4/20
30/30 [==============================] - 0s 14ms/step - loss: 0.1912 - accuracy: 0.9371 - val_loss: 0.2921 - val_accuracy: 0.8828
......
Epoch 19/20
30/30 [==============================] - 0s 13ms/step - loss: 0.0164 - accuracy: 0.9975 - val_loss: 0.5394 - val_accuracy: 0.8723
Epoch 20/20
30/30 [==============================] - 0s 11ms/step - loss: 0.0165 - accuracy: 0.9971 - val_loss: 0.5563 - val_accuracy: 0.8714

关于History对象

In [20]:

his_dict = history.history  # 字典类型

In [21]:

his_dict.keys()

Out[21]:

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

模型概览summary

In [22]:

model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 16)                160016    
                                                                 
 dense_1 (Dense)             (None, 16)                272       
                                                                 
 dense_2 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________

模型指标评估

In [23]:

model.evaluate(X_test, y_test)
782/782 [==============================] - 1s 880us/step - loss: 0.6017 - accuracy: 0.8582

Out[23]:

[0.601686954498291, 0.8581600189208984]

模型指标可视化

In [24]:

import matplotlib.pyplot as plt

loss = his_dict["loss"]
val_loss = his_dict["val_loss"]
acc = his_dict["accuracy"]
val_acc = his_dict["val_accuracy"]

In [25]:

epochs = range(1, len(loss) + 1)  # 作为横轴

In [26]:

# 1、损失loss

plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.title("Training and Validation Loss")
plt.show()

基于keras建模解决Python深度学习的二分类问题

# 2、精度acc

plt.clf()  #  清空图像
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.xlabel("Epochs")
plt.ylabel("Acc")
plt.legend()

plt.title("Training and Validation Acc")
plt.show()

基于keras建模解决Python深度学习的二分类问题

重新训练

可以看到随着网络训练的进行,loss在训练集上越来越小,acc在训练集上越来越大;但是在验证集上并非如此。

也是说,模型在训练集上表现得很好,但是在验证集上不行,出现了过拟合

重新训练一个模型,共4轮epochs=4

指定轮次训练

In [28]:

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(X_train.shape[1],)))  # 原文1000
model.add(layers.Dense(16,activation="relu"))
model.add(layers.Dense(1, activation="sigmoid"))  

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["acc"])  # 原文accuracy  改成acc

# 编译模型
model.compile(optimizer='rmsprop',  # 优化器
              loss='binary_crossentropy',  # 二进制交叉熵
              metrics=['accuracy']   # 评价指标
             )
# 训练
history = model.fit(X_train,  # 在完整数据集上训练
                    y_train,
                    epochs=4,
                    batch_size=512,
                    validation_data=(x_val, y_val)
                   )
Epoch 1/4
49/49 [==============================] - 1s 16ms/step - loss: 0.4807 - accuracy: 0.8072 - val_loss: 0.3134 - val_accuracy: 0.9003
Epoch 2/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2772 - accuracy: 0.9021 - val_loss: 0.2212 - val_accuracy: 0.9259
Epoch 3/4
49/49 [==============================] - 1s 11ms/step - loss: 0.2173 - accuracy: 0.9200 - val_loss: 0.1930 - val_accuracy: 0.9283
Epoch 4/4
49/49 [==============================] - 0s 10ms/step - loss: 0.1840 - accuracy: 0.9324 - val_loss: 0.1472 - val_accuracy: 0.9544

预测结果及可视化

最终模型预测:

In [29]:

results = model.predict(X_test)
results
782/782 [==============================] - 1s 790us/step

Out[29]:

array([[0.19428788],
       [0.9998849 ],
       [0.8095433 ],
       ...,
       [0.1104579 ],
       [0.07548532],
       [0.65479356]], dtype=float32)

模型网络对某些样本十分可信;比如概率是0.998(表示1)或者0.1(表示0);有一些则模棱两可。

In [30]:

results.flatten()  # 将二维展开成一维flatten

Out[30]:

array([0.19428788, 0.9998849 , 0.8095433 , ..., 0.1104579 , 0.07548532,
       0.65479356], dtype=float32)

通过np.round函数直接将概率转成0-1分类:

In [31]:

y_predict = np.round(results.flatten())
y_predict

Out[31]:

array([0., 1., 1., ..., 0., 0., 1.], dtype=float32)

In [32]:

y_test

Out[32]:

array([0., 1., 1., ..., 0., 0., 0.], dtype=float32)

In [33]:

from sklearn.metrics import classification_report, confusion_matrix, r2_score, recall_score

In [34]:

confusion_matrix(y_predict,y_test)  # 混淆矩阵

Out[34]:

array([[11169,  1512],
       [ 1331, 10988]], dtype=int64)

In [35]:

print(classification_report(y_predict, y_test))
              precision    recall  f1-score   support

         0.0       0.89      0.88      0.89     12681
         1.0       0.88      0.89      0.89     12319

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000

In [36]:

import seaborn as sns

sns.heatmap(confusion_matrix(y_predict,y_test),  # 混淆矩阵
            annot=True, # 显示数值
            #cmap=plt.cm.Blues,
            fmt='.0f' # 指定格式
           )

plt.show()

基于keras建模解决Python深度学习的二分类问题

本网站的内容主要来自互联网上的各种资源,仅供参考和信息分享之用,不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益,请联系我们,我们将尽快采取行动,包括删除或更正。
AI教程

探索大语言模型发展历程:GPT-1的起源

2023-11-25 15:50:14

AI教程

Agent智能体架构设计方法论:认知模型

2023-11-25 16:08:14

个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索