释放双眼，带上耳机，听听看~！

本文介绍了肺癌风险预测系统的数据处理过程，包括数据来源、字段信息以及数据处理方法。通过数据处理，可以更好地了解肺癌风险预测系统的有效性和实用性。

一、肺癌风险预测

肺癌预测系统数据处理

1.背景描述

癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险，也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。

2.数据说明

字段总数：16
实例数：284
字段信息：
1.性别：M（男性），F（女性）
2.年龄：病人的年龄
3.吸烟：YES=2 , NO=1
4.黄色的手指：YES=2 , NO=1
5.焦虑：YES=2 , NO=1
6.同伴压力: YES=2 , NO=1
7.慢性疾病：YES=2 , NO=1
8.疲劳：YES=2 , NO=1
9.过敏症：YES=2 , NO=1
10.喘息：YES=2 , NO=1
11.酒精：YES=2 , NO=1
12.咳嗽： YES=2 , NO=1
13.呼吸急促：YES=2 , NO=1
14.吞咽困难：YES=2 , NO=1
15.胸部疼痛：YES=2 , NO=1
16.肺癌：YES , NO

3.数据来源

www.kaggle.com/datasets/na…

二、数据处理

1.读取数据

import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            309 non-null    object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB

df.isnull().sum()

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

可见没有空值

2.数据序列化

df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)

import matplotlib.pyplot as plt
%matplotlib inline

3.查看数据分布

figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16)) 
i=0

for column in df.columns:
    x=int(i/4)
    y=i%4
    df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
    i=i+1

肺癌预测系统数据处理

从上图可见，数据得癌症的比较多，其他的较为均衡。

4.抽烟与患病关系

smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")

ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")

Text(0.5,1,'Lung Cancer & Smoking_NO')

肺癌预测系统数据处理

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])

fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>

肺癌预测系统数据处理

6.绘制热力图

import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)

<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>

肺癌预测系统数据处理

可见性别、年龄和是否抽烟与患肺癌相关性不大。

7.构造X、y

# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]

y.value_counts()

1    270
0     39
Name: LUNG_CANCER, dtype: int64

sns.countplot(y)

<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>

肺癌预测系统数据处理

8.数据均衡

安装完要重启才能生效，不然报错，具体如下：

肺癌预测系统数据处理

from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user

clear_output()

from imblearn.over_sampling import SMOTE

help(SMOTE)

Help on class SMOTE in module imblearn.over_sampling._smote.base:

class SMOTE(BaseSMOTE)
 |  SMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |  
 |  Class to perform over-sampling using SMOTE.
 |  
 |  This object is an implementation of SMOTE - Synthetic Minority
 |  Over-sampling Technique as presented in [1]_.
 |  
 |  Read more in the :ref:`User Guide <smote_adasyn>`.
 |  
 |  Parameters
 |  ----------
 |  sampling_strategy : float, str, dict or callable, default='auto'
 |      Sampling information to resample the data set.
 |  
 |      - When ``float``, it corresponds to the desired ratio of the number of
 |        samples in the minority class over the number of samples in the
 |        majority class after resampling. Therefore, the ratio is expressed as
 |        :math:`alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the
 |        number of samples in the minority class after resampling and
 |        :math:`N_{M}` is the number of samples in the majority class.
 |  
 |          .. warning::
 |             ``float`` is only available for **binary** classification. An
 |             error is raised for multi-class classification.
 |  
 |      - When ``str``, specify the class targeted by the resampling. The
 |        number of samples in the different classes will be equalized.
 |        Possible choices are:
 |  
 |          ``'minority'``: resample only the minority class;
 |  
 |          ``'not minority'``: resample all classes but the minority class;
 |  
 |          ``'not majority'``: resample all classes but the majority class;
 |  
 |          ``'all'``: resample all classes;
 |  
 |          ``'auto'``: equivalent to ``'not majority'``.
 |  
 |      - When ``dict``, the keys correspond to the targeted classes. The
 |        values correspond to the desired number of samples for each targeted
 |        class.
 |  
 |      - When callable, function taking ``y`` and returns a ``dict``. The keys
 |        correspond to the targeted classes. The values correspond to the
 |        desired number of samples for each class.
 |  
 |  random_state : int, RandomState instance, default=None
 |      Control the randomization of the algorithm.
 |  
 |      - If int, ``random_state`` is the seed used by the random number
 |        generator;
 |      - If ``RandomState`` instance, random_state is the random number
 |        generator;
 |      - If ``None``, the random number generator is the ``RandomState``
 |        instance used by ``np.random``.
 |  
 |  k_neighbors : int or object, default=5
 |      The nearest neighbors used to define the neighborhood of samples to use
 |      to generate the synthetic samples. You can pass:
 |  
 |      - an `int` corresponding to the number of neighbors to use. A
 |        `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this
 |        case.
 |      - an instance of a compatible nearest neighbors algorithm that should
 |        implement both methods `kneighbors` and `kneighbors_graph`. For
 |        instance, it could correspond to a
 |        :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to
 |        any compatible class.
 |  
 |  n_jobs : int, default=None
 |      Number of CPU cores used during the cross-validation loop.
 |      ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
 |      ``-1`` means using all processors. See
 |      `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_
 |      for more details.
 |  
 |      .. deprecated:: 0.10
 |         `n_jobs` has been deprecated in 0.10 and will be removed in 0.12.
 |         It was previously used to set `n_jobs` of nearest neighbors
 |         algorithm. From now on, you can pass an estimator where `n_jobs` is
 |         already set instead.
 |  
 |  Attributes
 |  ----------
 |  sampling_strategy_ : dict
 |      Dictionary containing the information to sample the dataset. The keys
 |      corresponds to the class labels from which to sample and the values
 |      are the number of samples to sample.
 |  
 |  nn_k_ : estimator object
 |      Validated k-nearest neighbours created from the `k_neighbors` parameter.
 |  
 |  n_features_in_ : int
 |      Number of features in the input dataset.
 |  
 |      .. versionadded:: 0.9
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during `fit`. Defined only when `X` has feature
 |      names that are all strings.
 |  
 |      .. versionadded:: 0.10
 |  
 |  See Also
 |  --------
 |  SMOTENC : Over-sample using SMOTE for continuous and categorical features.
 |  
 |  SMOTEN : Over-sample using the SMOTE variant specifically for categorical
 |      features only.
 |  
 |  BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.
 |  
 |  SVMSMOTE : Over-sample using the SVM-SMOTE variant.
 |  
 |  ADASYN : Over-sample using ADASYN.
 |  
 |  KMeansSMOTE : Over-sample applying a clustering before to oversample using
 |      SMOTE.
 |  
 |  Notes
 |  -----
 |  See the original papers: [1]_ for more details.
 |  
 |  Supports multi-class resampling. A one-vs.-rest scheme is used as
 |  originally proposed in [1]_.
 |  
 |  References
 |  ----------
 |  .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE:
 |     synthetic minority over-sampling technique," Journal of artificial
 |     intelligence research, 321-357, 2002.
 |  
 |  Examples
 |  --------
 |  >>> from collections import Counter
 |  >>> from sklearn.datasets import make_classification
 |  >>> from imblearn.over_sampling import SMOTE
 |  >>> X, y = make_classification(n_classes=2, class_sep=2,
 |  ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
 |  ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
 |  >>> print('Original dataset shape %s' % Counter(y))
 |  Original dataset shape Counter({1: 900, 0: 100})
 |  >>> sm = SMOTE(random_state=42)
 |  >>> X_res, y_res = sm.fit_resample(X, y)
 |  >>> print('Resampled dataset shape %s' % Counter(y_res))
 |  Resampled dataset shape Counter({0: 900, 1: 900})
 |  
 |  Method resolution order:
 |      SMOTE
 |      BaseSMOTE
 |      imblearn.over_sampling.base.BaseOverSampler
 |      imblearn.base.BaseSampler
 |      imblearn.base.SamplerMixin
 |      sklearn.base.BaseEstimator
 |      sklearn.base._OneToOneFeatureMixin
 |      imblearn.base._ParamsValidationMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes inherited from BaseSMOTE:
 |  
 |  __annotations__ = {'_parameter_constraints': <class 'dict'>}
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from imblearn.base.BaseSampler:
 |  
 |  fit(self, X, y)
 |      Check inputs and statistics of the sampler.
 |      
 |      You should use ``fit_resample`` in all cases.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Data array.
 |      
 |      y : array-like of shape (n_samples,)
 |          Target array.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Return the instance itself.
 |  
 |  fit_resample(self, X, y)
 |      Resample the dataset.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, dataframe, sparse matrix} of shape                 (n_samples, n_features)
 |          Matrix containing the data which have to be sampled.
 |      
 |      y : array-like of shape (n_samples,)
 |          Corresponding label for each sample in X.
 |      
 |      Returns
 |      -------
 |      X_resampled : {array-like, dataframe, sparse matrix} of shape                 (n_samples_new, n_features)
 |          The array containing the resampled data.
 |      
 |      y_resampled : array-like of shape (n_samples_new,)
 |          The corresponding label of `X_resampled`.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base._OneToOneFeatureMixin:
 |  
 |  get_feature_names_out(self, input_features=None)
 |      Get output feature names for transformation.
 |      
 |      Parameters
 |      ----------
 |      input_features : array-like of str or None, default=None
 |          Input features.
 |      
 |          - If `input_features` is `None`, then `feature_names_in_` is
 |            used as feature names in. If `feature_names_in_` is not defined,
 |            then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
 |          - If `input_features` is an array-like, then `input_features` must
 |            match `feature_names_in_` if `feature_names_in_` is defined.
 |      
 |      Returns
 |      -------
 |      feature_names_out : ndarray of str objects
 |          Same as input features.

sampling_strategy 有以下参数：

” minority’ ‘ ‘:只重新采样少数类
” not minority’ ‘ ‘:重采样除minority类外的所有类
” not majority’ ‘ ‘:重采样除majority类外的所有类
” all’ ‘ ‘:重采样所有类
” auto’ ‘ ‘:相当于’ ” not majority’

from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)

sns.countplot(y)

<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>

肺癌预测系统数据处理

三、模型训练与评估

1.数据集划分

from sklearn.model_selection import train_test_split,cross_val_score

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=2023)

2.数据标准化

返回值为标准化后的数据
加载了 StandardScaler 类，并初始化了 StandardScaler 对象 scaler，使用 fit 方法，StandardScaler 从训练数据中估计每个特征维度的参数 μ (样本均值)和 σ (标准差)。通过调用 transform 方法，使用估计的参数 μ 和 σ 对训练和测试数据进行标准化。

from sklearn.preprocessing import StandardScaler

help(StandardScaler)

Help on class StandardScaler in module sklearn.preprocessing._data:

class StandardScaler(sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  StandardScaler(*, copy=True, with_mean=True, with_std=True)
 |  
 |  Standardize features by removing the mean and scaling to unit variance.
 |  
 |  The standard score of a sample `x` is calculated as:
 |  
 |      z = (x - u) / s
 |  
 |  where `u` is the mean of the training samples or zero if `with_mean=False`,
 |  and `s` is the standard deviation of the training samples or one if
 |  `with_std=False`.
 |  
 |  Centering and scaling happen independently on each feature by computing
 |  the relevant statistics on the samples in the training set. Mean and
 |  standard deviation are then stored to be used on later data using
 |  :meth:`transform`.
 |  
 |  Standardization of a dataset is a common requirement for many
 |  machine learning estimators: they might behave badly if the
 |  individual features do not more or less look like standard normally
 |  distributed data (e.g. Gaussian with 0 mean and unit variance).
 |  
 |  For instance many elements used in the objective function of
 |  a learning algorithm (such as the RBF kernel of Support Vector
 |  Machines or the L1 and L2 regularizers of linear models) assume that
 |  all features are centered around 0 and have variance in the same
 |  order. If a feature has a variance that is orders of magnitude larger
 |  that others, it might dominate the objective function and make the
 |  estimator unable to learn from other features correctly as expected.
 |  
 |  This scaler can also be applied to sparse CSR or CSC matrices by passing
 |  `with_mean=False` to avoid breaking the sparsity structure of the data.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_scaler>`.
 |  
 |  Parameters
 |  ----------
 |  copy : bool, default=True
 |      If False, try to avoid a copy and do inplace scaling instead.
 |      This is not guaranteed to always work inplace; e.g. if the data is
 |      not a NumPy array or scipy.sparse CSR matrix, a copy may still be
 |      returned.
 |  
 |  with_mean : bool, default=True
 |      If True, center the data before scaling.
 |      This does not work (and will raise an exception) when attempted on
 |      sparse matrices, because centering them entails building a dense
 |      matrix which in common use cases is likely to be too large to fit in
 |      memory.
 |  
 |  with_std : bool, default=True
 |      If True, scale the data to unit variance (or equivalently,
 |      unit standard deviation).
 |  
 |  Attributes
 |  ----------
 |  scale_ : ndarray of shape (n_features,) or None
 |      Per feature relative scaling of the data to achieve zero mean and unit
 |      variance. Generally this is calculated using `np.sqrt(var_)`. If a
 |      variance is zero, we can't achieve unit variance, and the data is left
 |      as-is, giving a scaling factor of 1. `scale_` is equal to `None`
 |      when `with_std=False`.
 |  
 |      .. versionadded:: 0.17
 |         *scale_*
 |  
 |  mean_ : ndarray of shape (n_features,) or None
 |      The mean value for each feature in the training set.
 |      Equal to ``None`` when ``with_mean=False``.
 |  
 |  var_ : ndarray of shape (n_features,) or None
 |      The variance for each feature in the training set. Used to compute
 |      `scale_`. Equal to ``None`` when ``with_std=False``.
 |  
 |  n_features_in_ : int
 |      Number of features seen during :term:`fit`.
 |  
 |      .. versionadded:: 0.24
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during :term:`fit`. Defined only when `X`
 |      has feature names that are all strings.
 |  
 |      .. versionadded:: 1.0
 |  
 |  n_samples_seen_ : int or ndarray of shape (n_features,)
 |      The number of samples processed by the estimator for each feature.
 |      If there are no missing samples, the ``n_samples_seen`` will be an
 |      integer, otherwise it will be an array of dtype int. If
 |      `sample_weights` are used it will be a float (if no missing data)
 |      or an array of dtype float that sums the weights seen so far.
 |      Will be reset on new calls to fit, but increments across
 |      ``partial_fit`` calls.
 |  
 |  See Also
 |  --------
 |  scale : Equivalent function without the estimator API.
 |  
 |  :class:`~sklearn.decomposition.PCA` : Further removes the linear
 |      correlation across features with 'whiten=True'.
 |  
 |  Notes
 |  -----
 |  NaNs are treated as missing values: disregarded in fit, and maintained in
 |  transform.
 |  
 |  We use a biased estimator for the standard deviation, equivalent to
 |  `numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to
 |  affect model performance.
 |  
 |  For a comparison of the different scalers, transformers, and normalizers,
 |  see :ref:`examples/preprocessing/plot_all_scaling.py
 |  <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.preprocessing import StandardScaler
 |  >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
 |  >>> scaler = StandardScaler()
 |  >>> print(scaler.fit(data))
 |  StandardScaler()
 |  >>> print(scaler.mean_)
 |  [0.5 0.5]
 |  >>> print(scaler.transform(data))
 |  [[-1. -1.]
 |   [-1. -1.]
 |   [ 1.  1.]
 |   [ 1.  1.]]
 |  >>> print(scaler.transform([[2, 2]]))
 |  [[3. 3.]]
 |  
 |  Method resolution order:
 |      StandardScaler
 |      sklearn.base._OneToOneFeatureMixin
 |      sklearn.base.TransformerMixin
 |      sklearn.base.BaseEstimator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, copy=True, with_mean=True, with_std=True)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X, y=None, sample_weight=None)
 |      Compute the mean and std to be used for later scaling.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The data used to compute the mean and standard deviation
 |          used for later scaling along the features axis.
 |      
 |      y : None
 |          Ignored.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Individual weights for each sample.
 |      
 |          .. versionadded:: 0.24
 |             parameter *sample_weight* support to StandardScaler.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Fitted scaler.
 |  
 |  inverse_transform(self, X, copy=None)
 |      Scale back the data to the original representation.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The data used to scale along the features axis.
 |      copy : bool, default=None
 |          Copy the input X or not.
 |      
 |      Returns
 |      -------
 |      X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features)
 |          Transformed array.
 |  
 |  partial_fit(self, X, y=None, sample_weight=None)
 |      Online computation of mean and std on X for later scaling.
 |      
 |      All of X is processed as a single batch. This is intended for cases
 |      when :meth:`fit` is not feasible due to very large number of
 |      `n_samples` or because X is read from a continuous stream.
 |      
 |      The algorithm for incremental mean and std is given in Equation 1.5a,b
 |      in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. "Algorithms
 |      for computing the sample variance: Analysis and recommendations."
 |      The American Statistician 37.3 (1983): 242-247:
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          The data used to compute the mean and standard deviation
 |          used for later scaling along the features axis.
 |      
 |      y : None
 |          Ignored.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Individual weights for each sample.
 |      
 |          .. versionadded:: 0.24
 |             parameter *sample_weight* support to StandardScaler.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Fitted scaler.
 |  
 |  transform(self, X, copy=None)
 |      Perform standardization by centering and scaling.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix of shape (n_samples, n_features)
 |          The data used to scale along the features axis.
 |      copy : bool, default=None
 |          Copy the input X or not.
 |      
 |      Returns
 |      -------
 |      X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features)
 |          Transformed array.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base._OneToOneFeatureMixin:
 |  
 |  get_feature_names_out(self, input_features=None)
 |      Get output feature names for transformation.
 |      
 |      Parameters
 |      ----------
 |      input_features : array-like of str or None, default=None
 |          Input features.
 |      
 |          - If `input_features` is `None`, then `feature_names_in_` is
 |            used as feature names in. If `feature_names_in_` is not defined,
 |            then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
 |          - If `input_features` is an array-like, then `input_features` must
 |            match `feature_names_in_` if `feature_names_in_` is defined.
 |      
 |      Returns
 |      -------
 |      feature_names_out : ndarray of str objects
 |          Same as input features.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base._OneToOneFeatureMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.TransformerMixin:
 |  
 |  fit_transform(self, X, y=None, **fit_params)
 |      Fit to data, then transform it.
 |      
 |      Fits transformer to `X` and `y` with optional parameters `fit_params`
 |      and returns a transformed version of `X`.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |          Input samples.
 |      
 |      y :  array-like of shape (n_samples,) or (n_samples, n_outputs),                 default=None
 |          Target values (None for unsupervised transformations).
 |      
 |      **fit_params : dict
 |          Additional fit parameters.
 |      
 |      Returns
 |      -------
 |      X_new : ndarray array of shape (n_samples, n_features_new)
 |          Transformed array.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

print(X_train[0])

[-0.7710306   1.41036889  1.08508956  1.25031642  1.39864376  1.39096463 -0.72288062  0.93078432 -0.70710678  1.36833491 -0.73479518  1.39096463  0.88551735  1.53202723 -0.72288062]

3.随机森林训练

from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_prdrf=rf.predict(X_test)

4.模型评估

from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(y_test,y_prdrf))
cvs_rf=round(cross_val_score(rf,X,y,scoring="accuracy",cv=10).mean(),2)
print("Cross validation score for Random Forest Classifier model is:",cvs_rf)

              precision    recall  f1-score   support

           0       0.95      0.99      0.97        79
           1       0.98      0.93      0.95        56

    accuracy                           0.96       135
   macro avg       0.97      0.96      0.96       135
weighted avg       0.96      0.96      0.96       135

Cross validation score for Random Forest Classifier model is: 0.96

5.绘制混淆矩阵

sns.heatmap(confusion_matrix(y_test,y_prdrf),annot=True,cmap='viridis')
plt.xlabel("Predicted")
plt.ylabel("Truth")
plt.title("Confusion matrix- Random Forest Classifier")

Text(0.5,1,'Confusion matrix- Random Forest Classifier')

肺癌预测系统数据处理

可以看出还是相当准确的。

本文正在参加「金石计划」

本网站的内容主要来自互联网上的各种资源，仅供参考和信息分享之用，不代表本网站拥有相关版权或知识产权。如您认为内容侵犯您的权益，请联系我们，我们将尽快采取行动，包括删除或更正。

{{userData.name}}已认证

肺癌预测系统数据处理

一、肺癌风险预测

1.背景描述

2.数据说明

3.数据来源

二、数据处理

1.读取数据

2.数据序列化

3.查看数据分布

4.抽烟与患病关系

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

6.绘制热力图

7.构造X、y

8.数据均衡

三、模型训练与评估

1.数据集划分

2.数据标准化

3.随机森林训练

4.模型评估

5.绘制混淆矩阵

深入探讨注意力机制的理论基础和实际应用

使用Alpaca-LoRa进行模型训练和部署

GeoSpy.ai

Globe Explorer

即梦Dreamina

Luma Dream Machine

Motionshop

Kling AI | Sora-Like Video Model

归档

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

{{userData.name}}已认证

一、肺癌风险预测

1.背景描述

2.数据说明

3.数据来源

二、数据处理

1.读取数据

2.数据序列化

3.查看数据分布

4.抽烟与患病关系

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

6.绘制热力图

7.构造X、y

8.数据均衡

三、模型训练与评估

1.数据集划分

2.数据标准化

3.随机森林训练

4.模型评估

5.绘制混淆矩阵

深入探讨注意力机制的理论基础和实际应用

使用Alpaca-LoRa进行模型训练和部署

Pandas快速入门：基础操作与常用工作流

Pandas AI: 强强联手，让数据分析更快速便捷

A股市场成交量计算及数据导出方法分享

深度学习解决方案的数据处理技巧

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO