文章详情

短信预约-IT技能 免费直播动态提醒

请输入下面的图形验证码

提交验证

短信预约提醒成功

Lesson 10.1 超参数优化与枚举网格的理论极限和随机网格搜索 RandomSearchCV

2023-09-03 09:35

关注

文章目录

import numpy as npimport pandas as pdimport sklearnimport matplotlib as mlpimport matplotlib.pyplot as pltimport seaborn as snsimport timeimport re, pip, conda

一、超参数优化与枚举网格的理论极限

1. 超参数优化 HPO(HyperParameter Optimization)

2. 网格搜索的理论极限与缺点

3. 建立 benchmark:随机森林中枚举网格搜索的结果

from sklearn.ensemble import RandomForestRegressor as RFRfrom sklearn.model_selection import cross_validate, KFold, GridSearchCVdata = pd.read_csv(r"D:\Pythonwork\2021ML\PART 2 Ensembles\datasets\House Price\train_encode.csv",index_col=0)X = data.iloc[:,:-1]y = data.iloc[:,-1]X.shape#(1460, 80)X.head()y.describe() #RMSE#参数空间param_grid_simple = {"criterion": ["squared_error","poisson"]                     , 'n_estimators': [*range(20,100,5)]                     , 'max_depth': [*range(10,25,2)]                     , "max_features": ["log2","sqrt",16,32,64,"auto"]                     , "min_impurity_decrease": [*np.arange(0,5,10)]                    }#参数空间大小计算2 * len([*range(20,100,5)]) * len([*range(10,25,2)]) * len(["log2","sqrt",16,32,64,"auto"]) * len([*np.arange(0,5,10)])#1536#直接使用循环计算no_option = 1for i in param_grid_simple:    no_option *= len(param_grid_simple[i])no_option#1536#模型,交叉验证,网格搜索reg = RFR(random_state=1412,verbose=True,n_jobs=-1)cv = KFold(n_splits=5,shuffle=True,random_state=1412)search = GridSearchCV(estimator=reg                     ,param_grid=param_grid_simple                     ,scoring = "neg_mean_squared_error"                     ,verbose = True                     ,cv = cv                     ,n_jobs=-1)#=====【TIME WARNING: 7mins】=====#start = time.time()search.fit(X,y)print(time.time() - start)Fitting 5 folds for each of 1536 candidates, totalling 7680 fits#381.6039867401123381.6039/60#6.3600650000000005search.best_estimator_#RandomForestRegressor(max_depth=23, max_features=16, min_impurity_decrease=0,#                      n_estimators=85, n_jobs=-1, random_state=1412,#                      verbose=True)abs(search.best_score_)**0.5#29179.698261599166#按最优参数重建模型,查看效果ad_reg = RFR(n_estimators=85, max_depth=23, max_features=16, random_state=1412)cv = KFold(n_splits=5,shuffle=True,random_state=1412)result_post_adjusted = cross_validate(ad_reg,X,y,cv=cv,scoring="neg_mean_squared_error"                          ,return_train_score=True                          ,verbose=True                          ,n_jobs=-1)def RMSE(cvresult,key):    return (abs(cvresult[key])**0.5).mean()RMSE(result_post_adjusted,"train_score")#11000.81099038192RMSE(result_post_adjusted,"test_score")#28572.070208366855
HPO方法默认参数网格搜索
搜索空间/全域空间-1536/1536
运行时间(分钟)-6.36
搜索最优(RMSE)30571.26629179.698
重建最优(RMSE)-28572.070
#打包成函数供后续使用#评估指标RMSEdef RMSE(cvresult,key):    return (abs(cvresult[key])**0.5).mean()#计算参数空间大小def count_space(param):    no_option = 1    for i in param_grid_simple:        no_option *= len(param_grid_simple[i])    print(no_option)    #在最优参数上进行重新建模验证结果def rebuild_on_best_param(ad_reg):    cv = KFold(n_splits=5,shuffle=True,random_state=1412)    result_post_adjusted = cross_validate(ad_reg,X,y,cv=cv,scoring="neg_mean_squared_error"              ,return_train_score=True              ,verbose=True              ,n_jobs=-1)    print("训练RMSE:{:.3f}".format(RMSE(result_post_adjusted,"train_score")))    print("测试RMSE:{:.3f}".format(RMSE(result_post_adjusted,"test_score")))

二、随机网格搜索 RandomizedSearchCV

1. 基本原理

fig, [ax1, ax2] = plt.subplots(1,2,dpi=300)n_e_list = [*range(50,350,50)]m_d_list = [*range(2,7)]comb = pd.DataFrame([(n_estimators, max_depth) for n_estimators in n_e_list for max_depth in m_d_list])​ax1.scatter(comb.iloc[:,0],comb.iloc[:,1],cmap="Blues")ax1.set_xticks([*range(50,350,50)])ax1.set_yticks([*range(2,7)])ax1.set_xlabel("n_estimators")ax1.set_ylabel("max_depth")ax1.set_title("GridSearch")​ax2.scatter(comb.iloc[:,0],comb.iloc[:,1],cmap="Blues")ax2.scatter([50,250,200,200,300,100,150,150],[4,2,6,3,2,3,2,5],cmap="red",s=20,linewidths=5)ax2.set_xticks([*range(50,350,50)])ax2.set_yticks([*range(2,7)])ax2.set_xlabel("n_estimators")ax2.set_ylabel("max_depth")ax2.set_title("RandomSearch");

在这里插入图片描述

2. 随机网格搜索的实现

from sklearn.model_selection import RandomizedSearchCV
class sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False)
NameDescription
estimator调参对象,某评估器
param_distributions全域参数空间,可以是字典或者字典构成的列表
n_iter迭代次数,迭代次数越多,抽取的子参数空间越大
scoring评估指标,支持同时输出多个参数
n_jobs设置工作时参与计算的线程数
refit挑选评估指标和最佳参数,在完整数据集上进行训练
cv交叉验证的折数
verbose输出工作日志形式
pre_dispatch多任务并行时任务划分数量
random_state随机数种子
error_score当网格搜索报错时返回结果,选择’raise’时将直接报错并中断训练过程,其他情况会显示警告信息后继续完成训练
return_train_score在交叉验证中是否显示训练集中参数得分
X.shape#(1460, 80)X.head()y.describe()

3. 相同的全域参数空间

param_grid_simple = {"criterion": ["squared_error","poisson"]                     , 'n_estimators': [*range(20,100,5)]                     , 'max_depth': [*range(10,25,2)]                     , "max_features": ["log2","sqrt",16,32,64,"auto"]                     , "min_impurity_decrease": [*np.arange(0,5,10)]                    }
reg = RFR(random_state=1412,verbose=True,n_jobs=-1)cv = KFold(n_splits=5,shuffle=True,random_state=1412)
count_space(param_grid_simple)#1536
search = RandomizedSearchCV(estimator=reg,param_distributions=param_grid_simple,n_iter = 800 #子空间的大小是全域空间的一半左右,scoring = "neg_mean_squared_error",verbose = True,cv = cv,random_state=1412,n_jobs=-1                           )
#=====【TIME WARNING: 5~10min】=====#start = time.time()search.fit(X,y)print(time.time() - start)#Fitting 5 folds for each of 800 candidates, totalling 4000 fits#170.16785073280334170.1678/60#2.83613
search.best_estimator_#RandomForestRegressor(max_depth=24, max_features=16, min_impurity_decrease=0,#                      n_estimators=85, n_jobs=-1, random_state=1412,#                      verbose=True)abs(search.best_score_)**0.5#29251.284326350575
ad_reg = RFR(max_depth=24, max_features=16, min_impurity_decrease=0,                      n_estimators=85, n_jobs=-1, random_state=1412,                      verbose=True)rebuild_on_best_param(ad_reg)#训练RMSE:11031.299#测试RMSE:28639.969
HPO方法默认参数网格搜索随机搜索
搜索空间/全域空间-1536/1536800/1536
运行时间(分钟)-6.362.83(↓)
搜索最优(RMSE)30571.26629179.69829251.284
重建最优(RMSE)-28572.07028639.969(↑)

4. 随机网格搜索的理论极限

from mpl_toolkits.mplot3d import axes3dp1, p2, MSE = axes3d.get_test_data(0.05)
len(p1) #参数1的取值有120个#120len(p2) #参数2的取值也有120个#120
MSE.shape #损失函数值,总共14400个点#(120, 120)
plt.figure(dpi=300)plt.scatter(p1,p2,s=0.2)plt.xticks(fontsize=9)plt.yticks(fontsize=9)

在这里插入图片描述

p1, p2, MSE = axes3d.get_test_data(0.05)plt.figure(dpi=300)ax = plt.axes(projection="3d")ax.plot_wireframe(p1,p2,MSE,rstride=2,cstride=2,linewidth=0.5)ax.view_init(2, -15)ax.zaxis.set_tick_params(labelsize=7)ax.xaxis.set_tick_params(labelsize=7)ax.yaxis.set_tick_params(labelsize=7)

在这里插入图片描述

np.min(MSE) #整个参数空间中,可获得的MSE最小值是-73.39#-73.39620971601681
MSE.shape#(120, 120)
n = 100unsampled = np.random.randint(0,14400,14400-n)​p1, p2, MSE = axes3d.get_test_data(0.05)
MSE = MSE.ravel()MSE[unsampled] = np.nanMSE = MSE.reshape((120,120))
#参数与损失共同构建的函数plt.figure(dpi=300)ax = plt.axes(projection="3d")ax.view_init(2, -15)ax.plot_wireframe(p1,p2,MSE,rstride=2,cstride=2,linewidth=0.5)ax.zaxis.set_tick_params(labelsize=7)ax.xaxis.set_tick_params(labelsize=7)ax.yaxis.set_tick_params(labelsize=7)
MSE = MSE.ravel().tolist()MSE = [x for x in MSE if str(x) != 'nan']print(np.min(MSE))#-73.24243733589367

在这里插入图片描述

5. 更大/更密集的全域空间

param_grid_simple = {'n_estimators': [*range(80,100,1)]                     , 'max_depth': [*range(10,25,1)]                     , "max_features": [*range(10,20,1)]                     , "min_impurity_decrease": [*np.arange(0,5,10)]                    }
count_space(param_grid_simple)#3000
reg = RFR(random_state=1412,verbose=True,n_jobs=-1)cv = KFold(n_splits=5,shuffle=True,random_state=1412)
search = RandomizedSearchCV(estimator=reg,param_distributions=param_grid_simple,n_iter = 1536 #使用与枚举网格搜索类似的拟合次数,scoring = "neg_mean_squared_error",verbose = True,cv = cv,random_state=1412,n_jobs=-1)
start = time.time()search.fit(X,y)end = time.time() - startprint(end/60)#Fitting 5 folds for each of 1536 candidates, totalling 7680 fits#3.8464645385742187
search.best_estimator_RandomForestRegressor(max_depth=22, max_features=14, min_impurity_decrease=0,                      n_estimators=89, n_jobs=-1, random_state=1412,                      verbose=True)
abs(search.best_score_)**0.5#29012.90569846546rebuild_on_best_param(search.best_estimator_)#训练RMSE:11208.818#测试RMSE:28346.673
HPO方法默认参数网格搜索随机搜索随机搜索
(大空间)
搜索空间/全域空间-1536/1536800/15361536/3000
运行时间(分钟)-6.362.83(↓)3.86(↓)
搜索最优(RMSE)30571.26629179.69829251.28429012.905(↓)
重建最优(RMSE)-28572.07028639.969(↑)28346.673(↓)

6. 连续型的参数空间

import scipy #使用scipy来帮助我们建立分布scipy.stats.uniform(loc=1,scale=100)#
param_grid_simple = {'n_estimators': [*range(80,100,1)]                     , 'max_depth': [*range(10,25,1)]                     , "max_features": [*range(10,20,1)]                     , "min_impurity_decrease": scipy.stats.uniform(0,50)                    }#建立回归器、交叉验证reg = RFR(random_state=1412,verbose=True,n_jobs=12)cv = KFold(n_splits=5,shuffle=True,random_state=1412)#定义随机搜索search = RandomizedSearchCV(estimator=reg,param_distributions=param_grid_simple,n_iter = 1536 #还是使用1536这个搜索次数,scoring = "neg_mean_squared_error",verbose = True,cv = cv,random_state=1412,n_jobs=12)#训练随机搜索评估器start = time.time()search.fit(X,y)end = time.time() - startprint(end/60)#Fitting 5 folds for each of 1536 candidates, totalling 7680 fits#3.921058924992879#查看最佳评估器search.best_estimator_#RandomForestRegressor(max_depth=22, max_features=14,#                      min_impurity_decrease=20.070367229896224, n_estimators=98,#                      n_jobs=12, random_state=1412, verbose=True)#查看最终评估指标abs(search.best_score_)**0.5#29148.381610182565rebuild_on_best_param(search.best_estimator_)#训练RMSE:11184.428#测试RMSE:28495.682
HPO方法默认参数网格搜索随机搜索随机搜索
(大空间)
随机搜索
(连续型)
搜索空间/全域空间-1536/1536800/15361536/30001536/无限
运行时间(分钟)-6.362.83(↓)3.86(↓)3.92
搜索最优(RMSE)30571.26629179.69829251.28429012.905(↓)29148.381
重建最优(RMSE)-28572.07028639.969(↑)28346.673(↓)28495.682

来源地址:https://blog.csdn.net/weixin_45891612/article/details/129482165

阅读原文内容投诉

免责声明:

① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。

② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

软考中级精品资料免费领

  • 历年真题答案解析
  • 备考技巧名师总结
  • 高频考点精准押题
  • 2024年上半年信息系统项目管理师第二批次真题及答案解析(完整版)

    难度     813人已做
    查看
  • 【考后总结】2024年5月26日信息系统项目管理师第2批次考情分析

    难度     354人已做
    查看
  • 【考后总结】2024年5月25日信息系统项目管理师第1批次考情分析

    难度     318人已做
    查看
  • 2024年上半年软考高项第一、二批次真题考点汇总(完整版)

    难度     435人已做
    查看
  • 2024年上半年系统架构设计师考试综合知识真题

    难度     224人已做
    查看

相关文章

发现更多好内容

猜你喜欢

AI推送时光机
位置:首页-资讯-后端开发
咦!没有更多了?去看看其它编程学习网 内容吧
首页课程
资料下载
问答资讯