一、肺癌风险预测
1.背景描述
癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险,也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。
2.数据说明
字段总数:16
实例数:284
字段信息:
1.性别:M(男性),F(女性)
2.年龄:病人的年龄
3.吸烟:YES=2 , NO=1
4.黄色的手指:YES=2 , NO=1
5.焦虑:YES=2 , NO=1
6.同伴压力: YES=2 , NO=1
7.慢性疾病:YES=2 , NO=1
8.疲劳:YES=2 , NO=1
9.过敏症:YES=2 , NO=1
10.喘息:YES=2 , NO=1
11.酒精:YES=2 , NO=1
12.咳嗽: YES=2 , NO=1
13.呼吸急促:YES=2 , NO=1
14.吞咽困难:YES=2 , NO=1
15.胸部疼痛:YES=2 , NO=1
16.肺癌:YES , NO
3.数据来源
二、数据处理
1.读取数据
import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
GENDER | AGE | SMOKING | YELLOW_FINGERS | ANXIETY | PEER_PRESSURE | CHRONIC DISEASE | FATIGUE | ALLERGY | WHEEZING | ALCOHOL CONSUMING | COUGHING | SHORTNESS OF BREATH | SWALLOWING DIFFICULTY | CHEST PAIN | LUNG_CANCER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | M | 69 | 1 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | YES |
1 | M | 74 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | YES |
2 | F | 59 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | NO |
3 | M | 63 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 2 | NO |
4 | F | 63 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 1 | NO |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GENDER 309 non-null object
1 AGE 309 non-null int64
2 SMOKING 309 non-null int64
3 YELLOW_FINGERS 309 non-null int64
4 ANXIETY 309 non-null int64
5 PEER_PRESSURE 309 non-null int64
6 CHRONIC DISEASE 309 non-null int64
7 FATIGUE 309 non-null int64
8 ALLERGY 309 non-null int64
9 WHEEZING 309 non-null int64
10 ALCOHOL CONSUMING 309 non-null int64
11 COUGHING 309 non-null int64
12 SHORTNESS OF BREATH 309 non-null int64
13 SWALLOWING DIFFICULTY 309 non-null int64
14 CHEST PAIN 309 non-null int64
15 LUNG_CANCER 309 non-null object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB
df.isnull().sum()
GENDER 0
AGE 0
SMOKING 0
YELLOW_FINGERS 0
ANXIETY 0
PEER_PRESSURE 0
CHRONIC DISEASE 0
FATIGUE 0
ALLERGY 0
WHEEZING 0
ALCOHOL CONSUMING 0
COUGHING 0
SHORTNESS OF BREATH 0
SWALLOWING DIFFICULTY 0
CHEST PAIN 0
LUNG_CANCER 0
dtype: int64
可见没有空值
2.数据序列化
df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)
import matplotlib.pyplot as plt
%matplotlib inline
3.查看数据分布
figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16))
i=0
for column in df.columns:
x=int(i/4)
y=i%4
df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
i=i+1
从上图可见,数据得癌症的比较多,其他的较为均衡。
4.抽烟与患病关系
smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")
ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")
Text(0.5,1,'Lung Cancer & Smoking_NO')
5.过敏、饮酒、吞咽困难、胸疼与患癌关系
import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>
6.绘制热力图
import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>
可见性别、年龄和是否抽烟与患肺癌相关性不大。
7.构造X、y
# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]
y.value_counts()
1 270
0 39
Name: LUNG_CANCER, dtype: int64
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>
8.数据均衡
安装完要重启才能生效,不然报错,具体如下:
from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user
clear_output()
from imblearn.over_sampling import SMOTE
help(SMOTE)
Help on class SMOTE in module imblearn.over_sampling._smote.base:
class SMOTE(BaseSMOTE)
| SMOTE(*, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
|
| Class to perform over-sampling using SMOTE.
|
| This object is an implementation of SMOTE - Synthetic Minority
| Over-sampling Technique as presented in [1]_.
|
| Read more in the :ref:`User Guide <smote_adasyn>`.
|
| Parameters
| ----------
| sampling_strategy : float, str, dict or callable, default='auto'
| Sampling information to resample the data set.
|
| - When ``float``, it corresponds to the desired ratio of the number of
| samples in the minority class over the number of samples in the
| majority class after resampling. Therefore, the ratio is expressed as
| :math:`alpha_{os} = N_{rm} / N_{M}` where :math:`N_{rm}` is the
| number of samples in the minority class after resampling and
| :math:`N_{M}` is the number of samples in the majority class.
|
| .. warning::
| ``float`` is only available for **binary** classification. An
| error is raised for multi-class classification.
|
| - When ``str``, specify the class targeted by the resampling. The
| number of samples in the different classes will be equalized.
| Possible choices are:
|
| ``'minority'``: resample only the minority class;
|
| ``'not minority'``: resample all classes but the minority class;
|
| ``'not majority'``: resample all classes but the majority class;
|
| ``'all'``: resample all classes;
|
| ``'auto'``: equivalent to ``'not majority'``.
|
| - When ``dict``, the keys correspond to the targeted classes. The
| values correspond to the desired number of samples for each targeted
| class.
|
| - When callable, function taking ``y`` and returns a ``dict``. The keys
| correspond to the targeted classes. The values correspond to the
| desired number of samples for each class.
|
| random_state : int, RandomState instance, default=None
| Control the randomization of the algorithm.
|
| - If int, ``random_state`` is the seed used by the random number
| generator;
| - If ``RandomState`` instance, random_state is the random number
| generator;
| - If ``None``, the random number generator is the ``RandomState``
| instance used by ``np.random``.
|
| k_neighbors : int or object, default=5
| The nearest neighbors used to define the neighborhood of samples to use
| to generate the synthetic samples. You can pass:
|
| - an `int` corresponding to the number of neighbors to use. A
| `~sklearn.neighbors.NearestNeighbors` instance will be fitted in this
| case.
| - an instance of a compatible nearest neighbors algorithm that should
| implement both methods `kneighbors` and `kneighbors_graph`. For
| instance, it could correspond to a
| :class:`~sklearn.neighbors.NearestNeighbors` but could be extended to
| any compatible class.
|
| n_jobs : int, default=None
| Number of CPU cores used during the cross-validation loop.
| ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
| ``-1`` means using all processors. See
| `Glossary <https://scikit-learn.org/stable/glossary.html#term-n-jobs>`_
| for more details.
|
| .. deprecated:: 0.10
| `n_jobs` has been deprecated in 0.10 and will be removed in 0.12.
| It was previously used to set `n_jobs` of nearest neighbors
| algorithm. From now on, you can pass an estimator where `n_jobs` is
| already set instead.
|
| Attributes
| ----------
| sampling_strategy_ : dict
| Dictionary containing the information to sample the dataset. The keys
| corresponds to the class labels from which to sample and the values
| are the number of samples to sample.
|
| nn_k_ : estimator object
| Validated k-nearest neighbours created from the `k_neighbors` parameter.
|
| n_features_in_ : int
| Number of features in the input dataset.
|
| .. versionadded:: 0.9
|
| feature_names_in_ : ndarray of shape (`n_features_in_`,)
| Names of features seen during `fit`. Defined only when `X` has feature
| names that are all strings.
|
| .. versionadded:: 0.10
|
| See Also
| --------
| SMOTENC : Over-sample using SMOTE for continuous and categorical features.
|
| SMOTEN : Over-sample using the SMOTE variant specifically for categorical
| features only.
|
| BorderlineSMOTE : Over-sample using the borderline-SMOTE variant.
|
| SVMSMOTE : Over-sample using the SVM-SMOTE variant.
|
| ADASYN : Over-sample using ADASYN.
|
| KMeansSMOTE : Over-sample applying a clustering before to oversample using
| SMOTE.
|
| Notes
| -----
| See the original papers: [1]_ for more details.
|
| Supports multi-class resampling. A one-vs.-rest scheme is used as
| originally proposed in [1]_.
|
| References
| ----------
| .. [1] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, "SMOTE:
| synthetic minority over-sampling technique," Journal of artificial
| intelligence research, 321-357, 2002.
|
| Examples
| --------
| >>> from collections import Counter
| >>> from sklearn.datasets import make_classification
| >>> from imblearn.over_sampling import SMOTE
| >>> X, y = make_classification(n_classes=2, class_sep=2,
| ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
| ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
| >>> print('Original dataset shape %s' % Counter(y))
| Original dataset shape Counter({1: 900, 0: 100})
| >>> sm = SMOTE(random_state=42)
| >>> X_res, y_res = sm.fit_resample(X, y)
| >>> print('Resampled dataset shape %s' % Counter(y_res))
| Resampled dataset shape Counter({0: 900, 1: 900})
|
| Method resolution order:
| SMOTE
| BaseSMOTE
| imblearn.over_sampling.base.BaseOverSampler
| imblearn.base.BaseSampler
| imblearn.base.SamplerMixin
| sklearn.base.BaseEstimator
| sklearn.base._OneToOneFeatureMixin
| imblearn.base._ParamsValidationMixin
| builtins.object
|
| Methods defined here:
|
| __init__(self, *, sampling_strategy='auto', random_state=None, k_neighbors=5, n_jobs=None)
| Initialize self. See help(type(self)) for accurate signature.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| ----------------------------------------------------------------------
| Data and other attributes inherited from BaseSMOTE:
|
| __annotations__ = {'_parameter_constraints': <class 'dict'>}
|
| ----------------------------------------------------------------------
| Methods inherited from imblearn.base.BaseSampler:
|
| fit(self, X, y)
| Check inputs and statistics of the sampler.
|
| You should use ``fit_resample`` in all cases.
|
| Parameters
| ----------
| X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
| Data array.
|
| y : array-like of shape (n_samples,)
| Target array.
|
| Returns
| -------
| self : object
| Return the instance itself.
|
| fit_resample(self, X, y)
| Resample the dataset.
|
| Parameters
| ----------
| X : {array-like, dataframe, sparse matrix} of shape (n_samples, n_features)
| Matrix containing the data which have to be sampled.
|
| y : array-like of shape (n_samples,)
| Corresponding label for each sample in X.
|
| Returns
| -------
| X_resampled : {array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features)
| The array containing the resampled data.
|
| y_resampled : array-like of shape (n_samples_new,)
| The corresponding label of `X_resampled`.
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.BaseEstimator:
|
| __getstate__(self)
|
| __repr__(self, N_CHAR_MAX=700)
| Return repr(self).
|
| __setstate__(self, state)
|
| get_params(self, deep=True)
| Get parameters for this estimator.
|
| Parameters
| ----------
| deep : bool, default=True
| If True, will return the parameters for this estimator and
| contained subobjects that are estimators.
|
| Returns
| -------
| params : dict
| Parameter names mapped to their values.
|
| set_params(self, **params)
| Set the parameters of this estimator.
|
| The method works on simple estimators as well as on nested objects
| (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
| parameters of the form ``<component>__<parameter>`` so that it's
| possible to update each component of a nested object.
|
| Parameters
| ----------
| **params : dict
| Estimator parameters.
|
| Returns
| -------
| self : estimator instance
| Estimator instance.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from sklearn.base.BaseEstimator:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base._OneToOneFeatureMixin:
|
| get_feature_names_out(self, input_features=None)
| Get output feature names for transformation.
|
| Parameters
| ----------
| input_features : array-like of str or None, default=None
| Input features.
|
| - If `input_features` is `None`, then `feature_names_in_` is
| used as feature names in. If `feature_names_in_` is not defined,
| then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
| - If `input_features` is an array-like, then `input_features` must
| match `feature_names_in_` if `feature_names_in_` is defined.
|
| Returns
| -------
| feature_names_out : ndarray of str objects
| Same as input features.
sampling_strategy 有以下参数:
- ” minority’ ‘ ‘:只重新采样少数类
- ” not minority’ ‘ ‘:重采样除minority类外的所有类
- ” not majority’ ‘ ‘:重采样除majority类外的所有类
- ” all’ ‘ ‘:重采样所有类
- ” auto’ ‘ ‘:相当于’ ” not majority’
from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>
三、模型训练与评估
1.数据集划分
from sklearn.model_selection import train_test_split,cross_val_score
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=2023)
2.数据标准化
-
返回值为标准化后的数据
-
加载了 StandardScaler 类,并初始化了 StandardScaler 对象 scaler,使用 fit 方法,StandardScaler 从训练数据中估计每个特征维度的参数 μ (样本均值)和 σ (标准差)。 通过调用 transform 方法,使用估计的参数 μ 和 σ 对训练和测试数据进行标准化。
from sklearn.preprocessing import StandardScaler
help(StandardScaler)
Help on class StandardScaler in module sklearn.preprocessing._data:
class StandardScaler(sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
| StandardScaler(*, copy=True, with_mean=True, with_std=True)
|
| Standardize features by removing the mean and scaling to unit variance.
|
| The standard score of a sample `x` is calculated as:
|
| z = (x - u) / s
|
| where `u` is the mean of the training samples or zero if `with_mean=False`,
| and `s` is the standard deviation of the training samples or one if
| `with_std=False`.
|
| Centering and scaling happen independently on each feature by computing
| the relevant statistics on the samples in the training set. Mean and
| standard deviation are then stored to be used on later data using
| :meth:`transform`.
|
| Standardization of a dataset is a common requirement for many
| machine learning estimators: they might behave badly if the
| individual features do not more or less look like standard normally
| distributed data (e.g. Gaussian with 0 mean and unit variance).
|
| For instance many elements used in the objective function of
| a learning algorithm (such as the RBF kernel of Support Vector
| Machines or the L1 and L2 regularizers of linear models) assume that
| all features are centered around 0 and have variance in the same
| order. If a feature has a variance that is orders of magnitude larger
| that others, it might dominate the objective function and make the
| estimator unable to learn from other features correctly as expected.
|
| This scaler can also be applied to sparse CSR or CSC matrices by passing
| `with_mean=False` to avoid breaking the sparsity structure of the data.
|
| Read more in the :ref:`User Guide <preprocessing_scaler>`.
|
| Parameters
| ----------
| copy : bool, default=True
| If False, try to avoid a copy and do inplace scaling instead.
| This is not guaranteed to always work inplace; e.g. if the data is
| not a NumPy array or scipy.sparse CSR matrix, a copy may still be
| returned.
|
| with_mean : bool, default=True
| If True, center the data before scaling.
| This does not work (and will raise an exception) when attempted on
| sparse matrices, because centering them entails building a dense
| matrix which in common use cases is likely to be too large to fit in
| memory.
|
| with_std : bool, default=True
| If True, scale the data to unit variance (or equivalently,
| unit standard deviation).
|
| Attributes
| ----------
| scale_ : ndarray of shape (n_features,) or None
| Per feature relative scaling of the data to achieve zero mean and unit
| variance. Generally this is calculated using `np.sqrt(var_)`. If a
| variance is zero, we can't achieve unit variance, and the data is left
| as-is, giving a scaling factor of 1. `scale_` is equal to `None`
| when `with_std=False`.
|
| .. versionadded:: 0.17
| *scale_*
|
| mean_ : ndarray of shape (n_features,) or None
| The mean value for each feature in the training set.
| Equal to ``None`` when ``with_mean=False``.
|
| var_ : ndarray of shape (n_features,) or None
| The variance for each feature in the training set. Used to compute
| `scale_`. Equal to ``None`` when ``with_std=False``.
|
| n_features_in_ : int
| Number of features seen during :term:`fit`.
|
| .. versionadded:: 0.24
|
| feature_names_in_ : ndarray of shape (`n_features_in_`,)
| Names of features seen during :term:`fit`. Defined only when `X`
| has feature names that are all strings.
|
| .. versionadded:: 1.0
|
| n_samples_seen_ : int or ndarray of shape (n_features,)
| The number of samples processed by the estimator for each feature.
| If there are no missing samples, the ``n_samples_seen`` will be an
| integer, otherwise it will be an array of dtype int. If
| `sample_weights` are used it will be a float (if no missing data)
| or an array of dtype float that sums the weights seen so far.
| Will be reset on new calls to fit, but increments across
| ``partial_fit`` calls.
|
| See Also
| --------
| scale : Equivalent function without the estimator API.
|
| :class:`~sklearn.decomposition.PCA` : Further removes the linear
| correlation across features with 'whiten=True'.
|
| Notes
| -----
| NaNs are treated as missing values: disregarded in fit, and maintained in
| transform.
|
| We use a biased estimator for the standard deviation, equivalent to
| `numpy.std(x, ddof=0)`. Note that the choice of `ddof` is unlikely to
| affect model performance.
|
| For a comparison of the different scalers, transformers, and normalizers,
| see :ref:`examples/preprocessing/plot_all_scaling.py
| <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.
|
| Examples
| --------
| >>> from sklearn.preprocessing import StandardScaler
| >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
| >>> scaler = StandardScaler()
| >>> print(scaler.fit(data))
| StandardScaler()
| >>> print(scaler.mean_)
| [0.5 0.5]
| >>> print(scaler.transform(data))
| [[-1. -1.]
| [-1. -1.]
| [ 1. 1.]
| [ 1. 1.]]
| >>> print(scaler.transform([[2, 2]]))
| [[3. 3.]]
|
| Method resolution order:
| StandardScaler
| sklearn.base._OneToOneFeatureMixin
| sklearn.base.TransformerMixin
| sklearn.base.BaseEstimator
| builtins.object
|
| Methods defined here:
|
| __init__(self, *, copy=True, with_mean=True, with_std=True)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit(self, X, y=None, sample_weight=None)
| Compute the mean and std to be used for later scaling.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The data used to compute the mean and standard deviation
| used for later scaling along the features axis.
|
| y : None
| Ignored.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Individual weights for each sample.
|
| .. versionadded:: 0.24
| parameter *sample_weight* support to StandardScaler.
|
| Returns
| -------
| self : object
| Fitted scaler.
|
| inverse_transform(self, X, copy=None)
| Scale back the data to the original representation.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The data used to scale along the features axis.
| copy : bool, default=None
| Copy the input X or not.
|
| Returns
| -------
| X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features)
| Transformed array.
|
| partial_fit(self, X, y=None, sample_weight=None)
| Online computation of mean and std on X for later scaling.
|
| All of X is processed as a single batch. This is intended for cases
| when :meth:`fit` is not feasible due to very large number of
| `n_samples` or because X is read from a continuous stream.
|
| The algorithm for incremental mean and std is given in Equation 1.5a,b
| in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. "Algorithms
| for computing the sample variance: Analysis and recommendations."
| The American Statistician 37.3 (1983): 242-247:
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| The data used to compute the mean and standard deviation
| used for later scaling along the features axis.
|
| y : None
| Ignored.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Individual weights for each sample.
|
| .. versionadded:: 0.24
| parameter *sample_weight* support to StandardScaler.
|
| Returns
| -------
| self : object
| Fitted scaler.
|
| transform(self, X, copy=None)
| Perform standardization by centering and scaling.
|
| Parameters
| ----------
| X : {array-like, sparse matrix of shape (n_samples, n_features)
| The data used to scale along the features axis.
| copy : bool, default=None
| Copy the input X or not.
|
| Returns
| -------
| X_tr : {ndarray, sparse matrix} of shape (n_samples, n_features)
| Transformed array.
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base._OneToOneFeatureMixin:
|
| get_feature_names_out(self, input_features=None)
| Get output feature names for transformation.
|
| Parameters
| ----------
| input_features : array-like of str or None, default=None
| Input features.
|
| - If `input_features` is `None`, then `feature_names_in_` is
| used as feature names in. If `feature_names_in_` is not defined,
| then names are generated: `[x0, x1, ..., x(n_features_in_)]`.
| - If `input_features` is an array-like, then `input_features` must
| match `feature_names_in_` if `feature_names_in_` is defined.
|
| Returns
| -------
| feature_names_out : ndarray of str objects
| Same as input features.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from sklearn.base._OneToOneFeatureMixin:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.TransformerMixin:
|
| fit_transform(self, X, y=None, **fit_params)
| Fit to data, then transform it.
|
| Fits transformer to `X` and `y` with optional parameters `fit_params`
| and returns a transformed version of `X`.
|
| Parameters
| ----------
| X : array-like of shape (n_samples, n_features)
| Input samples.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None
| Target values (None for unsupervised transformations).
|
| **fit_params : dict
| Additional fit parameters.
|
| Returns
| -------
| X_new : ndarray array of shape (n_samples, n_features_new)
| Transformed array.
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.BaseEstimator:
|
| __getstate__(self)
|
| __repr__(self, N_CHAR_MAX=700)
| Return repr(self).
|
| __setstate__(self, state)
|
| get_params(self, deep=True)
| Get parameters for this estimator.
|
| Parameters
| ----------
| deep : bool, default=True
| If True, will return the parameters for this estimator and
| contained subobjects that are estimators.
|
| Returns
| -------
| params : dict
| Parameter names mapped to their values.
|
| set_params(self, **params)
| Set the parameters of this estimator.
|
| The method works on simple estimators as well as on nested objects
| (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
| parameters of the form ``<component>__<parameter>`` so that it's
| possible to update each component of a nested object.
|
| Parameters
| ----------
| **params : dict
| Estimator parameters.
|
| Returns
| -------
| self : estimator instance
| Estimator instance.
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
print(X_train[0])
[-0.7710306 1.41036889 1.08508956 1.25031642 1.39864376 1.39096463 -0.72288062 0.93078432 -0.70710678 1.36833491 -0.73479518 1.39096463 0.88551735 1.53202723 -0.72288062]
3.随机森林训练
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_prdrf=rf.predict(X_test)
4.模型评估
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_prdrf))
cvs_rf=round(cross_val_score(rf,X,y,scoring="accuracy",cv=10).mean(),2)
print("Cross validation score for Random Forest Classifier model is:",cvs_rf)
precision recall f1-score support
0 0.95 0.99 0.97 79
1 0.98 0.93 0.95 56
accuracy 0.96 135
macro avg 0.97 0.96 0.96 135
weighted avg 0.96 0.96 0.96 135
Cross validation score for Random Forest Classifier model is: 0.96
5.绘制混淆矩阵
sns.heatmap(confusion_matrix(y_test,y_prdrf),annot=True,cmap='viridis')
plt.xlabel("Predicted")
plt.ylabel("Truth")
plt.title("Confusion matrix- Random Forest Classifier")
Text(0.5,1,'Confusion matrix- Random Forest Classifier')
可以看出还是相当准确的。
本文正在参加「金石计划」