[ Model ]
1. HGB
2. XGB
4. Model
4.1 HGBoost
순서형과 명목형 변수가 섞여있는 데이터 셋.
Q. 이 사실을 어떻게 모델에게 전달해줄 것인가?
A1. one-hot encoding?
데이터셋에서 순서형보다 명목형이 훨씬 많음.
H_속성 등은 1000이상까지 넘어감.
→ 차원 증폭
+ 트리계열 모델에선 숫자의 차이가 모델에 영향을 주지 않음 (i.e. label encoding으로도 충분)
A2. categorical_features =
https://scikit-learn.org/stable/modules/ensemble.html#categorical-support-gbdt
1.11. Ensemble methods
The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator...
scikit-learn.org
[SOURCE CODE]
is_categorical = np.zeros(n_features, dtype=bool)
is_categorical[categorical_features] = True
model= HistGradientBoostingClassifier(loss='binary_crossentropy',
scoring = 'loss',
learning_rate=0.03,
max_iter = 1500, max_depth = 15,
categorical_features = lst,
#max_bins = 255,
early_stopping = True,
validation_fraction = 0.1,
n_iter_no_change = 70,
tol = 1e-10,
#l2_regularization = 0.01,
verbose =True)
source code 참고하여 다음과 같이 코드 작성하면 아래 오류 발생
Categorical feature at index 10 is expected to have a cardinality <= 255
cardinality : 상대적인 중복도 ( 중복도 높음 = 카디널리티 낮음 )
max_bins : 255 이상으로 늘릴 수 없음
#index 10
x = train.iloc[:, :-1]
l = x['person_prefer_d_1']
s = set(l)
print(len(s),'/',len(l))
# 출력값
# 1093 / 501951
즉, 속성코드를 불러와야 하는 feature들의 카디널리티가 모델 허용치보다 높음 i.e. 중복도 낮음
lst=[x for x in range(len(dic)-1)]
print(lst)
for c in ['person_attribute_a_1','person_attribute_b','person_prefer_e','contents_attribute_e']:
#print(dic[c])
lst.remove(dic[c])
#오류 나는 featre -> 일단 numeric value 취급
for i in [10,11,12,16,17,18,25,26,29]:
lst.remove(i)
→ 동일한 조건에서 categorical_features 지정해주지 않았을때보다 개선
→ iteration 올렸을 때 overfitting에 덜 취약
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from datetime import datetime
import datetime as dt
import pytz
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split
#Timezone
KST = pytz.timezone('Asia/Seoul')
NAME= datetime.now(KST).strftime('%m-%d %H-%M')
x = train.iloc[:, :-1]
y = train.iloc[:, -1]
model= HistGradientBoostingClassifier(loss='binary_crossentropy',
scoring = 'loss',
learning_rate=0.03,
warm_start = True,
max_iter = 10000, max_depth = 15,
categorical_features = lst,
early_stopping = True,
validation_fraction = 0.1,
n_iter_no_change = 15,
tol = 1e-5,
l2_regularization = 0.01,
verbose =False,
random_state = 777)
history = model.fit(x,y)
acc = model.score(x,y)
print(model)
print('Accuracy:', acc)
print()
preds = model.predict(test)
submission = pd.read_csv('sample_submission.csv')
submission['target'] = preds
submission.to_csv('{}_{}_HGB.csv'.format(NAME,round(acc*100,2), index=False))
print('saved as {}_{}_HGB.csv'.format(NAME,round(acc*100,2)))
#### 1
HistGradientBoostingClassifier(categorical_features=[0, 1, 2, 3, 4, 5, 6, 9, 14,
15, 19, 20, 21, 22, 23, 24,
27],
early_stopping=True, learning_rate=0.01,
loss='binary_crossentropy', max_depth=15,
max_iter=10000, n_iter_no_change=50, tol=1e-06,
verbose=100)
Accuracy: 0.6763249799283197
saved as 01-22 15-56_submission.csv
#### 2
HistGradientBoostingClassifier(categorical_features=[0, 1, 2, 3, 4, 5, 6, 9, 14,
15, 19, 20, 21, 22, 23, 24,
27],
early_stopping=True, l2_regularization=0.01,
learning_rate=0.03, loss='binary_crossentropy',
max_depth=15, max_iter=10000,
n_iter_no_change=15, random_state=777, tol=1e-05,
verbose=False, warm_start=True)
Accuracy: 0.6449374540542802
saved as 01-22 17-44_0.6449374540542802_HGB.csv
4.2 XGBoost
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
#Timezone
KST = pytz.timezone('Asia/Seoul')
NAME= datetime.now(KST).strftime('%m-%d %H-%M')
train_x, val_x, train_y, val_y = train_test_split(x, y, test_size=0.1)
#model
model_2 =XGBClassifier(learning_rate=0.01, num_iterations=10000,
max_depth=15 ,
booster='gbtree',
silent=False,
n_estimators = 400,
boosting = 'dart',
colsample_bytree=0.8,
colsample_bylevel=0.9,
metric=['binary'],
n_jobs=-1,
objective = 'binary:logistic'
,verbose = True)
print(model_2)
history_2 = model_2.fit(train_x,train_y,eval_set=[(train_x,train_y),(val_x,val_y)],eval_metric='error',early_stopping_rounds=20,verbose=5)
acc = model_2.score(x,y)
print(model_2)
print('Accuracy:', acc)
print()
preds = model.predict(test)
submission = pd.read_csv('sample_submission.csv')
submission['target'] = preds
submission.to_csv('{}_{}_XGB.csv'.format(NAME,round(acc*100,2)), index=False)
print('saved as {}_{}_XGB.csv'.format(NAME,round(acc*100,2)))
###1
Stopping. Best iteration:
[334] validation_0-error:0.163502 validation_1-error:0.36736
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
learning_rate=0.01, max_depth=15, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
verbose=True)
[train+val] Accuracy: 0.8161115327990183
[val] f1-score: 0.6501745333130976
saved as 01-22 19-47_81.61_XGB.csv
###2
[399] validation_0-error:0.149718 validation_1-error:0.370269
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
learning_rate=0.01, max_depth=15, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=1000, silent=False,
verbose=True)
[train+val] : 0.8279274271791469
[val] f1-score: 0.6476345840130506
saved as 01-22 20-31_64.76_XGB.csv
###3
[499] validation_0-error:0.269894 validation_1-error:0.368217
XGBClassifier(boosting='gbtree', colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=2, learning_rate=0.03, max_depth=10, metric=['binary'],
n_estimators=500, n_jobs=-1, num_iterations=1000, silent=False,
verbose=True)
[train+val] : 0.720241617209648
[val] f1-score: 0.6497498483929655
saved as 01-22 22-01_72.02_64.97_XGB.csv
###4
499] validation_0-error:0.269894 validation_1-error:0.368217
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=2, learning_rate=0.03, max_depth=10, metric=['binary'],
n_estimators=500, n_jobs=-1, num_iterations=1000, silent=False,
verbose=True)
[train+val] : 0.720241617209648
[val] f1-score: 0.6497498483929655
saved as 01-22 22-09_72.02_64.97_XGB.csv
###5
Stopping. Best iteration:
[90] validation_0-error:0.199332 validation_1-error:0.381465
XGBClassifier(boosting='dart',
colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=2, learning_rate=0.005, max_depth=20, metric=['binary'],
min_child_weight=5, n_estimators=500, n_jobs=-1,
num_iterations=10000, silent=False, verbose=True)
val ratio : 0.05
[train+val] : 0.7915613276993173
[val] f1-score: 0.6324477886977887
saved as 01-22 23-05_79.16_63.24_XGB.csv
###6-1
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=1, learning_rate=0.01, max_depth=20, metric=['binary'],
min_child_weight=3, n_estimators=500, n_jobs=-1,
num_iterations=10000, silent=False, verbose=True)
val ratio : 0.05
[train+val] : 0.8851581130429066
[val] f1-score: 0.6394219741570457
saved as 01-22 23-26_88.52_63.94_XGB.csv
###6-2
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=1, learning_rate=0.01, max_depth=20, metric=['binary'],
min_child_weight=3, n_estimators=500, n_jobs=-1,
num_iterations=10000, silent=False, verbose=True)
val ratio : 0.1
[train+val] : 0.9213469043791127
[val] f1-score: 0.6480643924875431
saved as 01-22 23-34_92.13_64.81_XGB.csv
val ratio : 0.05 | val ratio : 0.1 |
##5 Model [train+val] : 0.7915613276993173 [val] f1-score: 0.6324477886977887 saved as 01-22 23-05_79.16_63.24_XGB.csv |
[train+val] : 0.7882861076081131 [val] f1-score: 0.6396484188401643 saved as 01-22 23-15_78.83_63.96_XGB.csv |
##6 Model [train+val] : 0.8851581130429066 [val] f1-score: 0.6394219741570457 saved as 01-22 23-26_88.52_63.94_XGB.csv [public] 0.6364930964 |
[train+val] : 0.9213469043791127 [val] f1-score: 0.6480643924875431 saved as 01-22 23-34_92.13_64.81_XGB.csv [public] 0.637506222 |
-> 이에 따라 우선 valid_split_ratio = 0.1 로 고정
###7 max_depth=5
Stopping. Best iteration:
[158] validation_0-error:0.403869 validation_1-error:0.404992
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=1, learning_rate=0.01, max_depth=5, metric=['binary'],
min_child_weight=3, n_estimators=400, n_jobs=-1,
num_iterations=10000, silent=False, verbose=1)
[train+val] : 0.5960183364511675
[val] f1-score: 0.6212999012686052
saved as 01-23 00-11_59.6_62.13_XGB.csv
###8 max_depth=10
Stopping. Best iteration:
[32] validation_0-error:0.376877 validation_1-error:0.39553
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
gamma=1, learning_rate=0.01, max_depth=10, metric=['binary'],
min_child_weight=3, n_estimators=400, n_jobs=-1,
num_iterations=1000, silent=False, verbose=1)
[train+val] : 0.6212578518620343
[val] f1-score: 0.6255375330064127
saved as 01-23 00-16_62.13_62.55_XGB.csv
###9
[399] validation_0-error:0.149718 validation_1-error:0.370269
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
learning_rate=0.01, max_depth=15, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
verbose=True)
[train+val] : 0.8279274271791469
[val] f1-score: 0.6476345840130506
saved as 01-23 00-21_82.79_64.76_XGB.csv
###10
Stopping. Best iteration:
[268] validation_0-error:0.112922 validation_1-error:0.363874
XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
learning_rate=0.03, max_depth=15, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
verbose=True)
[train+val] : 0.8619825441128716
[val] f1-score: 0.6511583490899367
saved as 01-23 00-45_86.2_65.12_XGB.csv
###11
Stopping. Best iteration:
[198] validation_0-error:0.017408 validation_1-error:0.365886
XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
learning_rate=0.03, max_depth=20, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
verbose=True)
[train+val] : 0.9477439032893649
[val] f1-score: 0.6460043945877182
saved as 01-23 00-55_94.77_64.6_XGB.csv
###12
Stopping. Best iteration:
[334] validation_0-error:0.293985 validation_1-error:0.370727
XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
learning_rate=0.03, max_depth=10, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
verbose=True)
[train+val] : 0.6983410731326365
[val] f1-score: 0.6472561842479386
saved as 01-23 01-02_69.83_64.73_XGB.csv
###13
[399] validation_0-error:0.284276 validation_1-error:0.369093
XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
learning_rate=0.03, max_depth=10, metric=['binary'],
n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
verbose=True)
[train+val] : 0.7072144492191469
[val] f1-score: 0.6490440187216948
saved as 01-23 01-06_70.72_64.9_XGB.csv
###13을 최종으로 고정한 후
+ parameter search 진행
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV, train_test_split
model=XGBClassifier()
param_grid={'booster' :['gbtree','dart'],
'silent':[True],
#'nthread':[0.5,0.7],
#'max_depth':[5,25],
'min_child_weight':[1,3,5],
'gamma':[0,2,3],
#'colsample_bytree':[0.5,0.8],
#'colsample_bylevel':[0.5,0.8],
#'n_estimators':[100,400,500],
'objective':['binary:logistic'],
'random_state':[777],
'verbose':[100]}
cv=KFold(n_splits=5)
grid_search=GridSearchCV(model, param_grid=param_grid, cv=cv, scoring='f1', n_jobs=-1)
grid_search.fit(x.values,y.values,verbose=True)
print('best score:', grid_search.best_score_)
print('best params', grid_search.best_params_)
#best score: 0.5026674516475902
#best params {'booster': 'gbtree', 'gamma': 2, 'min_child_weight': 5, 'objective': 'binary:logistic', 'random_state': 777, 'silent': True, 'verbose': 100}
'Project > DACON' 카테고리의 다른 글
[DACON] 잡케어 추천 알고리즘 Part3 (0) | 2022.03.08 |
---|