当前位置：首页 > article >正文

kaggle比赛入门 - House Prices - Advanced Regression Techniques（第三部分）

article 2026/5/12 22:38:50

本文承接上一篇。

1. 数据预处理流水线（pipelines）

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# define transformers for numerical and categorical columns
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())
])categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant',fill_value='missing')),('onehot', OneHotEncoder(handle_unknown='ignore',sparse_output=False))
])

这段代码定义了两个数据预处理流水线（pipelines），分别针对数值型数据和类别型数据进行特定的转换操作。以下是逐步解释：

1. 目标

对数值型和类别型特征进行预处理，以便后续建模使用。这是实现机器学习流水线的重要步骤。

数值型特征：处理缺失值、归一化或标准化。
类别型特征：处理缺失值、转换为数值格式（如独热编码）。

2. 数值型特征的预处理

numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())
])

Pipeline：
- 定义一个包含多个处理步骤的流水线，按照步骤顺序对数值型特征进行转换。
步骤解析：
1. ('imputer', SimpleImputer(strategy='mean'))：
  - SimpleImputer 用于填补缺失值。
  - 参数 strategy='mean' 表示使用数值特征的平均值填补缺失数据。
2. ('scaler', StandardScaler())：
  - StandardScaler 用于对数据进行标准化。
  - 将特征值转换为均值为 0、标准差为 1 的标准正态分布，便于模型处理不同尺度的特征。

3. 类别型特征的预处理

categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

步骤解析：
1. ('imputer', SimpleImputer(strategy='constant', fill_value='missing'))：
  - 使用 SimpleImputer 填补缺失值。
  - 参数 strategy='constant' 指定填充值为固定值。
  - 参数 fill_value='missing' 表示将缺失值替换为字符串 'missing'，以明确标记缺失数据。
2. ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))：
  - OneHotEncoder 将类别型特征转换为独热编码形式。
  - 参数 handle_unknown='ignore'：忽略测试数据中未见过的类别，而不抛出错误。
  - 参数 sparse_output=False：返回密集矩阵（非稀疏格式）。

4. 总结功能

数值型特征流水线：
- 填充缺失值（使用均值）。
- 标准化（缩放到均值为 0，标准差为 1）。
类别型特征流水线：
- 填充缺失值（使用固定值 'missing'）。
- 将类别型特征编码为数值独热编码格式。

应用场景

结合 ColumnTransformer，将这两个流水线分别应用到数据集中数值型和类别型的列。
用于机器学习模型的前置处理，确保数据一致性和模型训练效果。
减少重复代码，自动化数据预处理流程。

通过这种模块化的流水线设计，数据预处理变得更加清晰、灵活且易于扩展。

2. 从 DataFrame 中区分出类别型列和数值型列

# update categorical and numerical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns

这段代码的目的是从 DataFrame 中区分出类别型列和数值型列，以便对它们进行分别处理。以下是详细的解释：

代码拆解

1. 提取类别型列

categorical_columns = df.select_dtypes(include=['object', 'category']).columns

df.select_dtypes(include=...)：
- 这是 Pandas 提供的一个方法，用于选择具有特定数据类型的列。
- 参数 include=['object', 'category'] 表示选择数据类型为 object 或 category 的列。
解释数据类型：
- object：通常用于存储字符串数据（例如文本）。
- category：表示类别型数据（例如有限的离散值集合，如 'low', 'medium', 'high'）。
.columns：
- 获取选择出的列的名称，返回一个 Index 对象，包含所有满足条件的列名。
结果：
- categorical_columns 是一个包含 DataFrame 中所有类别型列名称的列表。

2. 提取数值型列

numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns

include=['int64', 'float64']：
- 选择数据类型为 int64 或 float64 的列。
- int64：表示整数型数据。
- float64：表示浮点数型数据。
.columns：
- 获取选择出的列的名称。
结果：
- numerical_columns 是一个包含 DataFrame 中所有数值型列名称的列表。

应用场景

区分列类型：
- 类别型列通常需要编码（如独热编码或标签编码）。
- 数值型列可能需要归一化、标准化、或填充缺失值。
结合流水线预处理：
- 可以将 categorical_columns 和 numerical_columns 作为输入，分别传递给之前定义的 categorical_transformer 和 numerical_transformer 进行处理。
灵活的数据操作：
- 有助于自动化数据预处理，避免手动指定列名，提高代码的可复用性。

numerical_columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual','OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1','BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF','LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd','Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF','OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea','MiscVal', 'MoSold', 'YrSold', 'SalePrice', 'PropertyAge'],dtype='object')

# remove target variable from numerical columns
numerical_columns = numerical_columns.drop('SalePrice')

numerical_columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual','OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1','BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF','LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd','Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF','OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea','MiscVal', 'MoSold', 'YrSold', 'PropertyAge'],dtype='object')

3. 整合一个预处理器和流水线，用于对数据进行统一、系统化的处理

# combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_columns),('cat', categorical_transformer, categorical_columns)
], remainder = 'passthrough')# create a pipeline with the preprocessor
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

这段代码结合了前面定义的数值和类别型数据的处理逻辑，并将它们整合到一个预处理器和流水线中，用于对数据进行统一、系统化的处理。以下是详细的解释：

代码拆解

1. `ColumnTransformer`

preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_columns),('cat', categorical_transformer, categorical_columns)
], remainder='passthrough')

ColumnTransformer：
- 用于对不同列（根据列的数据类型或功能分类）应用不同的转换器。
参数详解：
- transformers：
  - 是一个列表，定义了每一类列的处理方式：
    - ('num', numerical_transformer, numerical_columns)：
      - 将之前定义的 numerical_transformer 应用于 numerical_columns（数值型列）。
    - ('cat', categorical_transformer, categorical_columns)：
      - 将之前定义的 categorical_transformer 应用于 categorical_columns（类别型列）。
- remainder='passthrough'：
  - 对于未在 transformers 中指定的列（如果存在），直接保留原始数据，不进行任何处理。
结果：
- preprocessor 是一个组合的转换器，会对数值型列和类别型列分别应用其对应的预处理方法。

2. `Pipeline`

pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

Pipeline：
- 是一种结构化的方法，用于将一系列数据预处理步骤组合在一起。
- 它按照顺序应用各个步骤，将前一步的输出作为后一步的输入。
参数详解：
- steps：
  - 是一个列表，定义流水线的各步骤。
  - ('preprocessor', preprocessor)：
    - 第一步是应用之前定义的 preprocessor，用于对数据进行预处理。
结果：
- pipeline 是一个完整的预处理流水线，能够统一对输入数据执行数值型和类别型列的预处理操作。

工作流程

数值型列 numerical_columns 会经过以下处理：
- 缺失值填充（使用 SimpleImputer(strategy='mean')）。
- 数据标准化（使用 StandardScaler()）。
类别型列 categorical_columns 会经过以下处理：
- 缺失值填充（使用 SimpleImputer(strategy='constant', fill_value='missing')）。
- 独热编码（使用 OneHotEncoder()）。
其他未指定的列（如果存在）会被直接保留下来（因为 remainder='passthrough'）。

使用场景

这是用于机器学习模型构建中的数据预处理部分，可以方便地将 pipeline 与后续的模型结合（例如回归、分类模型）。
提高代码的可复用性和可读性，适用于不同的数据集。

4. 对数据集进行预处理

# apply the pipeline to your dataset
X = df.drop('SalePrice', axis=1)
y = np.log(df['SalePrice']) # normalize dependent variable
X_preprocessed = pipeline.fit_transform(X)

这段代码应用了前面定义的预处理流水线 pipeline，对数据集进行预处理，以便用于后续的机器学习建模。以下是代码的逐步解析：

代码拆解

1. 定义特征和目标变量

X = df.drop('SalePrice', axis=1)
y = np.log(df['SalePrice']) # normalize dependent variable

X:
- 使用 df.drop('SalePrice', axis=1) 从数据集中移除目标变量 SalePrice，只保留特征数据。
- 目的：
  - 为机器学习模型准备输入特征矩阵 X。
y:
- 使用 np.log(df['SalePrice']) 对目标变量 SalePrice 取自然对数。
- 目的：
  - 规范化（normalize）目标变量，使其更接近正态分布。
  - 机器学习模型（例如线性回归）对正态分布的目标变量更容易建模。

2. 应用预处理流水线

X_preprocessed = pipeline.fit_transform(X)

pipeline.fit_transform(X):
- 对 X 应用之前定义的 pipeline，执行以下步骤：
  1. 数值型列：
    - 填充缺失值（使用均值填充）。
    - 数据标准化（调整为均值为 0、标准差为 1 的分布）。
  2. 类别型列：
    - 填充缺失值（使用 'missing' 填充）。
    - 独热编码（将类别转换为数值型二进制列）。
  3. 其他列：
    - 如果存在未指定的列，直接保留原始数据（因为 remainder='passthrough'）。
- 结果：
  - 返回一个预处理后的特征矩阵 X_preprocessed，可以直接用于训练机器学习模型。
fit_transform 的作用：
- fit：
  - 计算数据的统计信息（如均值、标准差、类别种类等），为后续的转换做好准备。
- transform：
  - 根据前面计算的信息，对数据进行转换（如标准化、编码等）。

结果分析

X_preprocessed 是预处理后的特征矩阵，包含数值型列的标准化值和类别型列的独热编码值。
y 是经过自然对数变换的目标变量，规范化后更适合建模。

5. 对三种模型进行超参数搜索

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

# split the data into training and testing sets
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# define the models
models = {'LinearRegression': LinearRegression(),'RandomForest': RandomForestRegressor(random_state=42),'XGBoost': XGBRegressor(random_state=42)
}

# define the hyperparameter grids for each model 
param_grids = {'LinearRegression': {},'RandomForest': {'n_estimators': [100, 200, 500],'max_depth': [None, 10, 30],'min_samples_split': [2, 5, 10],},'XGBoost':{'n_estimators': [100, 200, 500],'learning_rate': [0.01, 0.1, 0.3],'max_depth': [3, 6, 10],}
}

这段代码定义了用于模型超参数调优的参数网格字典 param_grids，为多个模型提供其可选的超参数组合。以下是逐步解析：

代码解析

1. 变量 `param_grids`

param_grids 是一个字典，包含多个模型的名称作为键，以及它们对应的超参数网格（也是字典）作为值。
作用：
- 用于超参数搜索（如网格搜索或随机搜索）以优化模型性能。

2. 超参数网格定义

1) `LinearRegression`

'LinearRegression': {}

解释：
- 对于线性回归模型，没有定义需要调优的超参数，因此参数网格为空字典 {}。
- 线性回归模型通常没有过多可调参数（如正则化项等需使用其他变体如岭回归）。

2) `RandomForest`

'RandomForest': {'n_estimators': [100, 200, 500],'max_depth': [None, 10, 30],'min_samples_split': [2, 5, 10],
}

解释：
- 这是随机森林模型的超参数网格：
  1. n_estimators：
    - 随机森林中树的数量。
    - 候选值为 [100, 200, 500]，较多的树可能提升模型稳定性但增加计算时间。
  2. max_depth：
    - 树的最大深度。
    - 候选值为 [None, 10, 30]，None 表示树的深度不限。
  3. min_samples_split：
    - 内部节点再分裂所需的最小样本数。
    - 候选值为 [2, 5, 10]，值越大越能限制过拟合。

3) `XGBoost`

'XGBoost': {'n_estimators': [100, 200, 500],'learning_rate': [0.01, 0.1, 0.3],'max_depth': [3, 6, 10],
}

解释：
- 这是 XGBoost 模型的超参数网格：
  1. n_estimators：
    - XGBoost 中树的数量。
    - 候选值为 [100, 200, 500]。
  2. learning_rate：
    - 学习率，控制每次迭代的步长。
    - 候选值为 [0.01, 0.1, 0.3]，较低的值可以提高模型稳定性，但需要更多迭代次数。
  3. max_depth：
    - 每棵树的最大深度。
    - 候选值为 [3, 6, 10]，较大的深度允许捕捉更复杂的关系，但可能导致过拟合。

结果

这段代码为三种模型（LinearRegression, RandomForest, XGBoost）分别定义了超参数搜索空间。
后续可以结合这些参数网格执行调优，寻找性能最优的超参数组合。

6. 通过网格搜索找到每个模型的最佳超参数组合

# 3-fold cross-validation
cv = KFold(n_splits=3, shuffle=True, random_state=42)# train and tune the models
grids = {}
for model_name, model in models.items():print(f'Training and tuning {model_name}...')grids[model_name] = GridSearchCV(estimator=model,param_grid=param_grids[model_name],cv=cv,scoring='neg_mean_squared_error',n_jobs=-1,verbose=2)grids[model_name].fit(X_train, y_train)best_params = grids[model_name].best_params_best_score = np.sqrt(-1 * grids[model_name].best_score_)print(f'Best parameters for {model_name}: {best_params}')print(f'Best RMSE for {model_name}: {best_score}\n')

这段代码使用 3 折交叉验证 对多个机器学习模型进行训练和调参，目标是通过网格搜索（GridSearchCV）找到每个模型的最佳超参数组合，同时评估其预测性能（以 RMSE 为度量）。以下是逐步解释：

代码解析

1. 创建 3 折交叉验证对象

cv = KFold(n_splits=3, shuffle=True, random_state=42)

KFold:
- 实现 K 折交叉验证，这里 n_splits=3 表示将数据分成 3 份，每次用 1 份作为验证集，剩余 2 份作为训练集，重复 3 次。
参数解释：
- n_splits=3：3 折交叉验证。
- shuffle=True：在划分数据之前随机打乱，以减少偏倚。
- random_state=42：设置随机种子，以保证结果的可复现性。

2. 初始化一个空字典存储网格搜索对象

grids = {}

作用：
- 用来存储每个模型对应的 GridSearchCV 对象及其搜索结果。

3. 迭代训练和调参

for model_name, model in models.items():...

models.items():
- models 是一个字典，其中的键是模型名称（如 'LinearRegression'、'RandomForest'），值是对应的模型对象（如 LinearRegression(), RandomForestRegressor() 等）。
作用：
- 遍历模型字典，对每个模型分别执行以下步骤。

4. 打印当前模型名称

print(f'Training and tuning {model_name}...')

作用：
- 输出当前正在训练和调参的模型名称，方便跟踪进度。

5. 初始化 `GridSearchCV` 对象

grids[model_name] = GridSearchCV(estimator=model,param_grid=param_grids[model_name],cv=cv,scoring='neg_mean_squared_error',n_jobs=-1,verbose=2
)

GridSearchCV:
- 网格搜索对象，用于对指定的超参数网格进行搜索。
参数解释：
1. estimator=model：模型实例，例如 LinearRegression()、RandomForestRegressor()。
2. param_grid=param_grids[model_name]：
  - 超参数网格。
  - 来源于之前定义的 param_grids 字典，对应于当前模型。
3. cv=cv：指定交叉验证的划分方式，即 3 折交叉验证。
4. scoring='neg_mean_squared_error'：
  - 使用负均方误差作为评估指标（越大越好，负号是因为 scikit-learn 默认要求评分指标越大越好）。
5. n_jobs=-1：启用多线程，利用所有 CPU 核心以加速搜索。
6. verbose=2：显示搜索进度，数字越大，输出的详细程度越高。

6. 执行网格搜索

grids[model_name].fit(X_train, y_train)

作用：
- 使用 GridSearchCV 对当前模型进行训练、验证和超参数调优。
- 数据：
  - X_train：训练特征数据。
  - y_train：训练目标数据（对数化后的 SalePrice）。

7. 提取最佳超参数和最佳评分

best_params = grids[model_name].best_params_
best_score = np.sqrt(-1 * grids[model_name].best_score_)

best_params：
- 从 GridSearchCV 获取当前模型的最佳超参数组合。
best_score：
- 获取最佳负均方误差（best_score_ 是负值），转换为均方误差的平方根（RMSE）。

8. 打印结果

print(f'Best parameters for {model_name}: {best_params}')
print(f'Best RMSE for {model_name}: {best_score}\n')

作用：
- 输出当前模型的最佳超参数组合和对应的 RMSE，以便于结果分析和比较。

总结

核心功能：
- 使用 3 折交叉验证对多个模型分别执行超参数调优。
- 使用 RMSE 评估模型性能，输出每个模型的最佳参数和最优 RMSE。
运行逻辑：
- 定义交叉验证规则。
- 遍历所有模型，依次进行网格搜索。
- 保存最佳结果供后续比较。

Training and tuning LinearRegression...Fitting 3 folds for each of 1 candidates, totalling 3 fitsBest parameters for LinearRegression: {}Best RMSE for LinearRegression: 5394655149.930396Training and tuning RandomForest...Fitting 3 folds for each of 27 candidates, totalling 81 fitsBest parameters for RandomForest: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 500}Best RMSE for RandomForest: 0.15332719681534118Training and tuning XGBoost...Fitting 3 folds for each of 27 candidates, totalling 81 fitsBest parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}Best RMSE for XGBoost: 0.13797005834667236

7. 使用多层感知器回归器对数据进行回归建模

from sklearn.neural_network import MLPRegressorX_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()# create a MLPRegressor instance
mlp = MLPRegressor(random_state = 42,max_iter=10000,n_iter_no_change=3,learning_rate_init=0.001,verbose=True)# define the parameter grid for tuning
param_grid = {'hidden_layer_sizes': [(10,), (10, 10), (10, 10, 10), (25)],'activation': ['relu', 'tanh'],'solver': ['adam'],'alpha': [0.0001, 0.001, 0.01],'learning_rate': ['constant', 'invscaling', 'adaptive'],
}# create the GridSearchCV object
grid_search_mlp = GridSearchCV(mlp, param_grid,scoring='neg_mean_squared_error',cv=3, n_jobs=-1, verbose=1)# fit the model on the training data
grid_search_mlp.fit(X_train_scaled, y_train)# print the best parameters found during the search
print("Best parameters found: ", grid_search_mlp.best_params_)# evaluate the model
best_score = np.sqrt(-1 * grid_search_mlp.best_score_)
print("Best score: ", best_score)

Best parameters found:  {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': (10, 10), 'learning_rate': 'constant', 'solver': 'adam'}Best score:  0.22778267781613568

这段代码使用 MLPRegressor（多层感知器回归器） 对数据进行回归建模，并通过 GridSearchCV 进行超参数调优。以下是逐步的详细解释：

1. 导入和准备数据

from sklearn.neural_network import MLPRegressorX_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

MLPRegressor：
- 用于回归任务的神经网络模型，适合非线性和复杂关系建模。
X_train_scaled 和 X_test_scaled：
- 创建训练和测试数据的副本。这段代码假设输入数据已经被缩放（比如标准化或归一化），因为神经网络对数据的尺度非常敏感。

2. 创建 MLPRegressor 实例

mlp = MLPRegressor(random_state=42,max_iter=10000,n_iter_no_change=3,learning_rate_init=0.001,verbose=True)

MLPRegressor 参数说明：
- random_state=42：设置随机种子，保证结果可复现。
- max_iter=10000：最大迭代次数，防止训练时间过长。
- n_iter_no_change=3：如果验证集损失在连续 3 次迭代中没有改善，则提前停止训练。
- learning_rate_init=0.001：初始学习率，用于权重更新。
- verbose=True：输出训练过程的日志信息，方便观察模型收敛情况。

3. 定义参数网格

param_grid = {'hidden_layer_sizes': [(10,), (10, 10), (10, 10, 10), (25)],'activation': ['relu', 'tanh'],'solver': ['adam'],'alpha': [0.0001, 0.001, 0.01],'learning_rate': ['constant', 'invscaling', 'adaptive'],
}

param_grid 是一个超参数网格，供网格搜索调参使用。
- hidden_layer_sizes：
  - 定义隐藏层的结构和神经元个数。
  - 示例 (10,) 表示 1 层，10 个神经元；(10, 10) 表示 2 层，每层 10 个神经元。
- activation：
  - 激活函数，用于神经元的输出。
  - relu：修正线性单元（推荐使用，常见于深度学习）。
  - tanh：双曲正切函数，输出范围为 [-1, 1]。
- solver：
  - 用于优化权重的算法。
  - adam：一种高效的随机梯度下降优化算法，默认选择。
- alpha：
  - 正则化参数（L2 惩罚项），防止过拟合。
  - 值越大，正则化强度越大。
- learning_rate：
  - 学习率调节策略：
    - constant：固定学习率。
    - invscaling：随着训练次数增加，学习率逐步减小。
    - adaptive：当训练停止改善时自动降低学习率。

4. 创建 GridSearchCV 对象

grid_search_mlp = GridSearchCV(mlp,param_grid,scoring='neg_mean_squared_error',cv=3,n_jobs=-1,verbose=1
)

GridSearchCV 参数说明：
- estimator=mlp：待调参的模型，即 MLPRegressor。
- param_grid=param_grid：定义的超参数网格。
- scoring='neg_mean_squared_error'：
  - 使用负均方误差（MSE）作为评估指标，最终会取负值。
  - MSE 越小，模型性能越好。
- cv=3：3 折交叉验证。
- n_jobs=-1：使用所有可用 CPU 核心并行运行，提升搜索效率。
- verbose=1：输出网格搜索进度信息。

5. 训练并调参

grid_search_mlp.fit(X_train_scaled, y_train)

作用：
- 在训练集上运行 3 折交叉验证，搜索最佳的超参数组合。

6. 输出最佳超参数和最佳得分

print("Best parameters found: ", grid_search_mlp.best_params_)
best_score = np.sqrt(-1 * grid_search_mlp.best_score_)
print("Best score: ", best_score)

best_params_：

返回最佳超参数组合，例如：

Best parameters found: {'activation': 'relu', 'alpha': 0.001, ...}

best_score_：
- 返回最佳的负均方误差，转换为 RMSE（均方根误差）
- RMSE 是回归问题中衡量预测误差的常用指标，值越小越好。

总结

目的：
- 使用多层感知器（MLPRegressor）进行回归建模。
- 通过 GridSearchCV 调参，找到最优的隐藏层结构、激活函数、学习率策略等参数。
结果输出：
- 最优参数组合（best_params_）。
- 最佳模型在交叉验证中的 RMSE 值（best_score）。
实际应用：
- 用于非线性回归问题，特别是当数据有复杂关系时，神经网络是一种强大的建模工具。

8. 使用主成分分析 Principal Component Analysis 来对数据进行降维

from sklearn.decomposition import PCApca = PCA()
X_pca_pre = pca.fit_transform(X_preprocessed)# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)
cumulative_explained_variance

这段代码使用主成分分析 (PCA, Principal Component Analysis) 来对数据进行降维，同时计算每个主成分所解释的累计方差比例。以下是逐步的详细解释：

1. 导入 PCA 模块

from sklearn.decomposition import PCA

PCA：
- 一种无监督学习方法，用于数据降维和特征提取。
- 它通过线性变换将原始特征映射到一组新的特征（称为主成分）。
- 这些主成分按解释原始数据方差的能力排序，前几个主成分通常保留了大部分信息。

2. 创建 PCA 对象

pca = PCA()

PCA()：
- 默认情况下，PCA 会计算出所有可能的主成分。
- 可通过 n_components 参数指定希望保留的主成分个数。例如，PCA(n_components=2) 表示仅保留前 2 个主成分。

3. 对数据进行 PCA 转换

X_pca_pre = pca.fit_transform(X_preprocessed)

fit_transform：
- 这一步计算出数据的主成分方向（fit），并将原始数据投影到这些主成分上（transform）。
- 输入：
  - X_preprocessed 是预处理后的数据。
  - 数据通常需要归一化或标准化，因为 PCA 对特征的尺度敏感。
- 输出：
  - X_pca_pre 是转换后的数据，每列表示一个主成分的值。

4. 计算累计解释方差

cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)

explained_variance_ratio_：
- 表示每个主成分解释的方差占总方差的比例。
- 例如，[0.5, 0.3, 0.1, 0.1] 表示：
  - 第 1 个主成分解释了 50% 的方差。
  - 第 2 个主成分解释了 30% 的方差。
  - 第 3 和第 4 个主成分各解释了 10% 的方差。
np.cumsum：
- 计算累计方差比例。
- 例如，累计值为 [0.5, 0.8, 0.9, 1.0]，表示：
  - 前 1 个主成分解释了 50% 的总方差。
  - 前 2 个主成分解释了 80% 的总方差。
  - 前 3 个主成分解释了 90% 的总方差。

5. 输出累计解释方差

cumulative_explained_variance

输出为一个数组，表示每个主成分的累计解释方差比例。
示例输出：
```
array([0.5, 0.75, 0.9, 1.0])
```
- 前 1 个主成分解释了 50% 的方差。
- 前 2 个主成分累计解释了 75% 的方差。
- 前 3 个主成分累计解释了 90% 的方差。
- 所有主成分累计解释了 100% 的方差。

总结

这段代码的主要作用是：

使用 PCA 分解数据，提取主成分。
计算主成分的累计解释方差，帮助判断需要保留的主成分个数。
为后续的数据降维和模型构建奠定基础。

array([0.17243282, 0.24388356, 0.30683835, 0.34948682, 0.38080478,0.40728548, 0.43229814, 0.4553012 , 0.47784164, 0.49974638,0.52099392, 0.54160949, 0.56158708, 0.58125935, 0.60053254,0.61902405, 0.63641873, 0.65333506, 0.66977935, 0.68565874,0.70081592, 0.71479989, 0.72797553, 0.74013759, 0.75202852,0.76347094, 0.77276728, 0.78133766, 0.78941649, 0.7971427 ,0.80366548, 0.80983832, 0.8158552 , 0.82171262, 0.82712973,0.83247664, 0.8377077 , 0.84289461, 0.8479119 , 0.85243286,0.85687566, 0.86114926, 0.86530002, 0.86933408, 0.87314688,0.87683267, 0.88034602, 0.88383026, 0.88714117, 0.89039485,0.89353886, 0.896602  , 0.89953279, 0.90237039, 0.90520265,0.90797676, 0.91053189, 0.91306058, 0.91547626, 0.91779434,0.92005817, 0.92227619, 0.92438524, 0.92642266, 0.9283981 ,0.9303439 , 0.93223214, 0.93402124, 0.93575219, 0.93747069,0.93914422, 0.94078272, 0.94234897, 0.94387349, 0.94538287,0.94682297, 0.94823643, 0.94959758, 0.95093592, 0.95224158,0.95350946, 0.95472786, 0.95589638, 0.95704969, 0.95817143,0.95926274, 0.9603086 , 0.96132907, 0.96233372, 0.96330337,0.96425211, 0.96518196, 0.96608827, 0.96696542, 0.96781578,0.96861418, 0.96940699, 0.97018496, 0.97094682, 0.97167701,0.97240163, 0.97309352, 0.97378175, 0.97444138, 0.97508567,0.97571937, 0.97634927, 0.97696691, 0.97756939, 0.97815862,0.97873212, 0.97929725, 0.97984866, 0.98039201, 0.98092683,0.98145566, 0.98196764, 0.9824595 , 0.98293804, 0.98341039,0.98387423, 0.98433478, 0.98478139, 0.98521562, 0.98564182,0.98605927, 0.98646022, 0.98685517, 0.98723933, 0.98761082,0.98797424, 0.9883316 , 0.98868368, 0.98902194, 0.98935108,0.98967198, 0.98998727, 0.99029192, 0.99058997, 0.99087116,0.99115127, 0.99141999, 0.99168104, 0.99193879, 0.99218769,0.99242641, 0.9926581 , 0.99287949, 0.99309632, 0.99331018,0.99351827, 0.99372393, 0.99392369, 0.99411309, 0.99430108,0.99448488, 0.99466067, 0.99482919, 0.99499738, 0.99516156,0.99531813, 0.99547122, 0.99562218, 0.99576942, 0.99591437,0.99605398, 0.9961902 , 0.99632357, 0.99645277, 0.99657876,0.99670183, 0.99682237, 0.9969369 , 0.99704846, 0.9971582 ,0.99726432, 0.99736721, 0.99746831, 0.99756467, 0.99765843,0.99775056, 0.99784042, 0.99792818, 0.99801267, 0.99809567,0.99817047, 0.99824177, 0.99831226, 0.99837856, 0.99844265,0.99850534, 0.99856619, 0.99862442, 0.99868095, 0.99873669,0.99879126, 0.99884481, 0.99889584, 0.99894505, 0.99899145,0.99903706, 0.99908194, 0.99912437, 0.99916554, 0.99920626,0.9992444 , 0.99928079, 0.99931607, 0.9993509 , 0.9993834 ,0.99941427, 0.99944282, 0.99947029, 0.99949603, 0.9995212 ,0.99954584, 0.99957006, 0.9995939 , 0.9996169 , 0.99963857,0.9996596 , 0.99967979, 0.99969977, 0.9997183 , 0.9997365 ,0.99975451, 0.99977103, 0.99978734, 0.99980207, 0.99981655,0.9998307 , 0.9998434 , 0.99985601, 0.99986807, 0.99987946,0.99989058, 0.9999016 , 0.99991223, 0.9999227 , 0.99993297,0.99994215, 0.99995048, 0.99995846, 0.99996558, 0.99997209,0.99997818, 0.99998391, 0.99998814, 0.99999217, 0.999996  ,0.9999989 , 0.99999962, 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ,1.        , 1.        , 1.        , 1.        , 1.        ])

9. 通过累计解释方差阈值 (95%) 来选择主成分的数量

# choose the number of components based on the explained variance threshold
n_components = np.argmax(cumulative_explained_variance >= 0.95) + 1pca = PCA(n_components=n_components)
pipeline_pca = Pipeline(steps=[('preprocessor', preprocessor),('pca', pca)
])X_pca = pipeline_pca.fit_transform(X)

这段代码的主要功能是通过累计解释方差阈值 (95%) 来选择主成分的数量，然后将 PCA 集成到一个管道中，以便在预处理后直接进行降维。以下是逐步解释：

1. 根据解释方差选择主成分数量

n_components = np.argmax(cumulative_explained_variance >= 0.95) + 1

目的：
- 自动确定需要保留的主成分数量，使得累计解释方差达到或超过 95%。
步骤：
1. cumulative_explained_variance >= 0.95：
  - 比较每个累计解释方差值是否达到或超过 0.95（即 95%）。
  - 结果是一个布尔数组，例如 [False, False, True, True, ...]。
2. np.argmax()：
  - 返回第一个 True 的索引，表示第一个达到或超过 95% 的主成分位置。
3. +1：
  - 调整索引为数量，因为主成分的数量从 1 开始，而索引从 0 开始。

结果：

变量 n_components 存储保留的主成分数量。例如：

cumulative_explained_variance = [0.4, 0.7, 0.95, 1.0]
n_components = np.argmax([False, False, True, True]) + 1  # n_components = 3

2. 创建 PCA 实例

pca = PCA(n_components=n_components)

参数：
- n_components=n_components：设置保留的主成分数量，使累计解释方差接近 95%。

3. 创建包含 PCA 的管道

pipeline_pca = Pipeline(steps=[('preprocessor', preprocessor),('pca', pca)
])

Pipeline：
- 通过管道将多个数据处理步骤整合在一起，方便后续操作。
- 步骤：
  1. ('preprocessor', preprocessor)：
    - 对原始数据进行预处理（如数值标准化、分类变量编码等）。
    - preprocessor 是之前定义的列转换器 (ColumnTransformer)。
  2. ('pca', pca)：
    - 应用 PCA 降维，减少特征数量。
优点：
- 使用管道可以保证在数据预处理后立即进行 PCA，简化代码逻辑，减少重复工作。

4. 使用管道处理数据

X_pca = pipeline_pca.fit_transform(X)

fit_transform(X)：
- 执行整个管道的所有步骤：
  1. preprocessor.fit_transform(X)：
    - 对数据进行预处理（如缺失值填充、数值标准化、分类变量编码等）。
  2. pca.fit_transform(preprocessed_data)：
    - 在预处理后的数据上执行 PCA，将其投影到保留的主成分空间。
结果：
- X_pca 是降维后的数据，维度由原始特征数量减少到 n_components。

总结

这段代码的主要作用是：

动态确定需要保留的主成分数量，使得累计解释方差达到 95%。
构建一个包含预处理和 PCA 的管道，实现一站式数据处理和降维。
使用该管道对原始数据进行转换，输出降维后的特征矩阵 X_pca。

10. 再次通过网格搜索找到每个模型的最佳超参数组合

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)# define the models
models = {'LinearRegression': LinearRegression(),'RandomForest': RandomForestRegressor(random_state=42),'XGBoost': XGBRegressor(random_state=42)
}# define the hyperparameter grids for each model 
param_grids = {'LinearRegression': {},'RandomForest': {'n_estimators': [100, 200, 500],'max_depth': [None, 10, 30],'min_samples_split': [2, 5, 10],},'XGBoost':{'n_estimators': [100, 200, 500],'learning_rate': [0.01, 0.1, 0.3],'max_depth': [3, 6, 10],}
}# 3-fold cross-validation
cv = KFold(n_splits=3, shuffle=True, random_state=42)

# train and tune the models
grids_pca = {}
for model_name, model in models.items():print(f'Training and tuning {model_name}...')grids_pca[model_name] = GridSearchCV(estimator=model,param_grid=param_grids[model_name],cv=cv,scoring='neg_mean_squared_error',n_jobs=-1,verbose=2)grids_pca[model_name].fit(X_train_pca, y_train_pca)best_params = grids_pca[model_name].best_params_best_score = np.sqrt(-1 * grids_pca[model_name].best_score_)print(f'Best parameters for {model_name}: {best_params}')print(f'Best RMSE for {model_name}: {best_score}\n')

Training and tuning LinearRegression...Fitting 3 folds for each of 1 candidates, totalling 3 fitsBest parameters for LinearRegression: {}Best RMSE for LinearRegression: 0.16377046612265703Training and tuning RandomForest...Fitting 3 folds for each of 27 candidates, totalling 81 fitsBest parameters for RandomForest: {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 500}Best RMSE for RandomForest: 0.15205832992542345Training and tuning XGBoost...Fitting 3 folds for each of 27 candidates, totalling 81 fitsBest parameters for XGBoost: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}Best RMSE for XGBoost: 0.13770460992099054

11. 再次使用多层感知器回归器对数据进行回归建模

X_train_scaled_pca = X_train_pca.copy()
X_test_scaled_pca = X_test_pca.copy()# create a MLPRegressor instance
mlp = MLPRegressor(random_state = 42,max_iter=10000,n_iter_no_change=3,learning_rate_init=0.001,verbose=True)# define the parameter grid for tuning
param_grid = {'hidden_layer_sizes': [(10,), (10, 10), (10, 10, 10), (25)],'activation': ['relu', 'tanh'],'solver': ['adam'],'alpha': [0.0001, 0.001, 0.01, .1, 1],'learning_rate': ['constant', 'invscaling', 'adaptive'],
}# create the GridSearchCV object
grid_search_mlp_pca = GridSearchCV(mlp, param_grid,scoring='neg_mean_squared_error',cv=3, n_jobs=-1, verbose=1)# fit the model on the training data
grid_search_mlp_pca.fit(X_train_scaled_pca, y_train)# print the best parameters found during the search
print("Best parameters found: ", grid_search_mlp_pca.best_params_)# evaluate the model
best_score = np.sqrt(-1 * grid_search_mlp_pca.best_score_)
print("Best score: ", best_score)

Best parameters found:  {'activation': 'tanh', 'alpha': 1, 'hidden_layer_sizes': (10, 10), 'learning_rate': 'constant', 'solver': 'adam'}Best score:  0.20141574193541442

12. 使用分离出的test数据进行测试（分别使用无pca降维和有pca降维数据训练的模型）

from sklearn.metrics import mean_squared_error

grids.keys()

dict_keys(['LinearRegression', 'RandomForest', 'XGBoost'])

for i in grids.keys():print(i + ': ' + str(np.sqrt(mean_squared_error(grids[i].predict(X_test), y_test))))

LinearRegression: 1240663156.677283RandomForest: 0.1468371013683763XGBoost: 0.13519734581188902

grids_pca.keys()

dict_keys(['LinearRegression', 'RandomForest', 'XGBoost'])

for i in grids_pca.keys():print(i + ': ' + str(np.sqrt(mean_squared_error(grids_pca[i].predict(X_test_pca), y_test))))

LinearRegression: 0.1426615311522792RandomForest: 0.15286112373105223XGBoost: 0.14149572462432053

print(str(np.sqrt(mean_squared_error(grid_search_mlp.predict(X_test_scaled), y_test))))

0.1668391027224058

print(str(np.sqrt(mean_squared_error(grid_search_mlp_pca.predict(X_test_scaled_pca), y_test))))

0.1638504108133665

下一篇继续

kaggle比赛入门 - House Prices - Advanced Regression Techniques（第三部分）

本文承接上一篇。 1. 数据预处理流水线（pipelines） from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEnc…...

编程日记 2026/4/22 15:48:02

Linux 命令之技巧（Tips for Linux Commands）

Linux 命令之技巧简介 Linux ‌是一种免费使用和自由传播的类Unix操作系统，其内核由林纳斯本纳第克特托瓦兹（Linus Benedict Torvalds）于1991年10月5日首次发布。Linux继承了Unix以网络为核心的设计思想，是一个性能稳定的多用户…...

编程日记 2025/5/27 23:40:34

从 GShard 到 DeepSeek-V3：回顾 MoE 大模型负载均衡策略演进

作者：小天狼星不来客原文：https://zhuanlan.zhihu.com/p/19117825360 故事要从 GShard 说起——当时，人们意识到拥有数十亿甚至数万亿参数的模型可以通过某种形式的“稀疏化（sparsified）”来在保持高精度的同时加速训…...

编程日记 2026/3/20 7:25:25

【番外篇】鸿蒙扫雷天纪：运混沌灵智勘破雷劫天局

大家好啊，我是小象٩(๑ω๑)۶ 我的博客：Xiao Xiangζั͡ޓއއ 很高兴见到大家，希望能够和大家一起交流学习，共同进步。这一节课我们不学习新的知识，我们来做一个扫雷小游戏目录扫雷小游戏概述一、扫雷游戏分析…...

编程日记 2026/5/9 2:24:31

【反悔堆】力扣1642. 可以到达的最远建筑

给你一个整数数组 heights ，表示建筑物的高度。另有一些砖块 bricks 和梯子 ladders 。你从建筑物 0 开始旅程，不断向后面的建筑物移动，期间可能会用到砖块或梯子。当从建筑物 i 移动到建筑物 i1（下标从 0 开始 ）…...

编程日记 2026/5/9 2:21:35

字符串算法笔记

字符串笔记说到字符串，首先我们要注意的就是字符串的输入以及输出，因为字符串的输入格式以及要求也分为很多种，我们就来说几个比较常见的格式 g e t s gets gets 我们先来说这个函数的含义...

编程日记 2026/3/20 13:16:47

AWTK 骨骼动画控件用法

创建骨骼动画控件 atlas 指定纹理图集文件，skeleton 指定骨骼动画数据文件。可以是相对路径或绝对路径。atlas 中引用的图片文件需要和 skeleton 文件在同一目录下。 scale_x 和 scale_y 指定缩放比例，根据实际情况调整。 scale_time 指定播放速度&am…...

编程日记 2026/2/21 4:24:18

解决Oracle SQL语句性能问题（10.5）——常用Hint及语法（7）（其他Hint）

10.5.3. 常用hint 10.5.3.7. 其他Hint 1）cardinality：显式的指示优化器为SQL语句的某个行源指定势。该Hint具体语法如下所示。 SQL> select /*+ cardinality([@qb] [table] card ) */ ...; --注： 1）这里，第一个参数（@qb）为可选参数，指定查询语句块名；第二个参数…...

编程日记 2026/3/2 13:10:20

如何写美赛（MCM/ICM）论文中的Summary部分

美赛（MCM/ICM）作为一个数学建模竞赛，要求参赛者在有限的时间内解决一个复杂的实际问题，并通过数学建模、数据分析和计算机模拟等手段给出有效的解决方案。在美赛的论文中，Summary部分（通常也称为摘要）是非常关键的，它是整个论文的缩影，能让评审快速了解你解决问题的思…...

编程日记 2026/2/18 17:30:25

DataWhale组队学习 fun-transformer task5

1. 词向量：单词的“身份证” 首先，我们定义了四个单词的词向量，每个向量维度为3。你可以把这些词向量想象成每个单词的“身份证”。每个身份证上有3个特征，用来描述这个单词的“性格”或“特点”。 word_1 np.array([1, 0, 0])…...

编程日记 2025/11/28 13:45:56

【huawei】云计算的备份和容灾

目录 1 备份和容灾 2 灾备的作用？ ① 备份的作用 ② 容灾的作用 3 灾备的衡量指标 ① 数据恢复时间点（RPO，Recoyery Point Objective） ② 应用恢复时间（RTO，Recoyery Time Objective） 4…...

编程日记 2025/11/27 4:51:13

电力晶体管（GTR）全控性器件

电力晶体管（Giant Transistor，GTR）是一种全控性器件，以下是关于它的详细介绍：（模电普通晶体管三极管进行对比学习） 基本概念 GTR是一种耐高电压、大电流的双极结型晶体管（BJT&am…...

编程日记 2026/5/9 3:20:06

LQ1052 Fibonacci斐波那契数列

题目描述 Fibonacci斐波那契数列也称为兔子数列，它的递推公式为：FnFn-1Fn-2，其中F1F21。当n比较大时，Fn也非常大，现在小蓝想知道，Fn除以10007的余数是多少，请你编程告诉她。输入输入包含一…...

编程日记 2026/2/26 18:18:28

Cursor 帮你写一个小程序

Cursor注册地址首先下载客户端点击链接下载 1 打开微信开发者工具创建一个小程序项目选择TS-基础模版官方 2 然后使用Cursor打开小程序创建的项目 3 在CHAT聊天框输入自己的需求比如小程序功能描述：吃什么助手项目名称： 吃什么小程序功能目标…...

编程日记 2026/5/11 13:46:50

【机器学习】嘿马机器学习（算法篇）第13篇：决策树算法,学习目标【附代码文档】

本教程的知识点为：机器学习算法定位、 K-近邻算法 1.4 k值的选择 1 K值选择说明 1.6 案例：鸢尾花种类预测--数据集介绍 1 案例：鸢尾花种类预测 1.8 案例：鸢尾花种类预测—流程实现 1 再识K-近邻算法API 1.11 案例2：预测…...

编程日记 2026/5/10 6:42:02

echo ‘export PATH=/usr/local/bin:$PATH‘ ＞＞ ~/.bashrc这个和直接添加到/etc/profile有什么区别

echo export PATH/usr/local/bin:$PATH >> ~/.bashrc 和直接添加到 /etc/profile 都是用于修改 PATH 环境变量，但它们适用的范围和效果有所不同： 1. 修改 ~/.bashrc 文件作用范围：~/.bashrc 是针对当前用户的配置文件，它…...

编程日记 2026/5/9 1:24:05

菜鸟之路Day09一一集合进阶(二)

菜鸟之路Day09一一集合进阶(二) 作者：blue 时间：2025.1.27 文章目录菜鸟之路Day09一一集合进阶(二)0.概述1.泛型1.1泛型概述1.2泛型类1.3泛型方法1.4泛型接口1.5泛型通配符 2.Set系列集合2.1遍历方式2.2HashSet2.3LinkedHashSet2.4TreeSet 0.概述内…...

编程日记 2026/3/25 3:27:10

写在新年之际

各位关注我的小伙伴们，大家好！ 在这新年来临之际，首先祝大家新年快乐！愿新的一年充满机遇与收获，愿我们在各自的领域中继续突破和成长！ 回顾2024年，这是充满变革的一年，不仅世界局…...

编程日记 2026/3/25 6:06:31

【shell工具】编写一个批量扫描IP地址的shell脚本

批量扫描某个网段中的主机（并发） 创建目录编写脚本文件 mkdir /root/ip_scan_shell/ touch /root/ip_scan_shell/online_server.txt touch /root/ip_scan_shell/offline_server.txt touch /root/ip_scan_shell/ip_scan.sh写入下面shell到脚本文件中…...

编程日记 2026/5/11 9:11:06

分库分表后如何进行join操作

在分库分表后的系统中，进行表之间的 JOIN 操作比在单一数据库表中复杂得多，因为涉及的数据可能位于不同的物理节点或分片中。此时，传统的 SQL JOIN 语句不能直接用于不同分片的数据，以下是几种处理这样的跨分片 JOIN 操作的方法&a…...

编程日记 2025/5/20 3:05:46

004 mybatis基础应用之全局配置文件

文章目录配置内容properties标签typeAlias标签mappers标签配置内容 SqlMapConfig.xml中配置的内容和顺序如下： properties（属性） settings（全局配置参数） typeAliases（类型别名） typeHandler…...

编程日记 2026/4/27 10:08:43

vim如何设置制表符表示的空格数量

:set tabstop4 设置制表符表示的空格数量制表符就是tab键，一般默认是四个空格的数量示例： （vim如何使设置制表符表示的空格数量永久生效：vim如何使相关设置永久生效-CSDN博客）...

编程日记 2026/5/12 6:57:17

基于dlib/face recognition人脸识别推拉流实现

目录一.环境搭建二.推拉流代码三.人脸检测推拉流一.环境搭建 1.下载RTSP服务器MediaMTX与FFmpeg FFmpeg是一款功能强大的开源多媒体处理工具，而MediaMTX则是一个轻量级的流媒体服务器。两者结合，可以实现将本地视频或者实时摄像头画面推送到RTSP流，从而实现视频…...

编程日记 2026/5/2 6:12:41

LangChain：使用表达式语言优化提示词链

在 LangChain 里，LCEL 即 LangChain Expression Language（LangChain 表达式语言），本文为你详细介绍它的定义、作用、优势并举例说明，从简单示例到复杂组合示例，让你快速掌握LCEL表达式语言使用技巧。定义 …...

编程日记 2026/5/7 7:22:39

多线程编程杂谈( 下)

问题是否存在其它中途线程退出的方法？ 通过调用Linux系统函数 pthread_cancel(...) 可中途退出线程 Linux 提供了线程取消函数取消状态接受取消状态: PTHREAD_CANCEL_ENABLE拒绝取消状态: PTHREAD_CANCEL_DISABLE 取消请求延迟取消: PTHREAD_CANCEL_DEFERR…...

编程日记 2026/5/11 8:00:50

rdma-core debug

export MLX5_DEBUG_MASK0xff export MLX5_DEBUG_FILE/tmp/mlx5.txt git clone https://github.com/linux-rdma/rdma-core.git cd rdma-core ./build.sh 修改build/CMakeCache.txt MLX5_DEBUG:BOOLTRUE function install_rdma_core {local dir/swgwork/cmi/rdma-core/buil…...

编程日记 2025/9/28 14:36:15

本文承接上一篇。

1. 数据预处理流水线（pipelines）

1. 目标

2. 数值型特征的预处理

3. 类别型特征的预处理

4. 总结功能

应用场景

2. 从 DataFrame 中区分出类别型列和数值型列

代码拆解

1. 提取类别型列

2. 提取数值型列

应用场景

3. 整合一个预处理器和流水线，用于对数据进行统一、系统化的处理

代码拆解

1. ColumnTransformer

2. Pipeline

工作流程

使用场景

4. 对数据集进行预处理

代码拆解

1. 定义特征和目标变量

2. 应用预处理流水线

结果分析

5. 对三种模型进行超参数搜索

代码解析

1. 变量 param_grids

2. 超参数网格定义

1) LinearRegression

2) RandomForest

3) XGBoost

结果

6. 通过网格搜索找到每个模型的最佳超参数组合

代码解析

1. 创建 3 折交叉验证对象

2. 初始化一个空字典存储网格搜索对象

3. 迭代训练和调参

4. 打印当前模型名称

5. 初始化 GridSearchCV 对象

6. 执行网格搜索

7. 提取最佳超参数和最佳评分

8. 打印结果

总结

7. 使用多层感知器回归器对数据进行回归建模

1. 导入和准备数据

2. 创建 MLPRegressor 实例

3. 定义参数网格

4. 创建 GridSearchCV 对象

5. 训练并调参

6. 输出最佳超参数和最佳得分

总结

8. 使用主成分分析 Principal Component Analysis 来对数据进行降维

1. 导入 PCA 模块

2. 创建 PCA 对象

3. 对数据进行 PCA 转换

4. 计算累计解释方差

5. 输出累计解释方差

总结

9. 通过累计解释方差阈值 (95%) 来选择主成分的数量

1. 根据解释方差选择主成分数量

2. 创建 PCA 实例

3. 创建包含 PCA 的管道

4. 使用管道处理数据

总结

10. 再次通过网格搜索找到每个模型的最佳超参数组合

11. 再次使用多层感知器回归器对数据进行回归建模

12. 使用分离出的test数据进行测试（分别使用无pca降维和有pca降维数据训练的模型）

下一篇继续

相关文章：

1. `ColumnTransformer`

2. `Pipeline`

1. 变量 `param_grids`

1) `LinearRegression`

2) `RandomForest`

3) `XGBoost`

5. 初始化 `GridSearchCV` 对象