본문 바로가기

Project/DACON

[DACON] 잡케어 추천 알고리즘 Part2

 


[   Model   ] 

1. HGB

2. XGB


4. Model 

4.1 HGBoost

 

 


순서형과 명목형 변수가 섞여있는 데이터 셋.

Q. 이 사실을 어떻게 모델에게 전달해줄 것인가?


A1. one-hot encoding?

데이터셋에서 순서형보다 명목형이 훨씬 많음.

H_속성 등은 1000이상까지 넘어감.

→ 차원 증폭

+ 트리계열 모델에선 숫자의 차이가 모델에 영향을 주지 않음 (i.e. label encoding으로도 충분)

 

A2. categorical_features = 

https://scikit-learn.org/stable/modules/ensemble.html#categorical-support-gbdt

 

1.11. Ensemble methods

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator...

scikit-learn.org

[SOURCE CODE]
is_categorical = np.zeros(n_features, dtype=bool)
is_categorical[categorical_features] = True

model= HistGradientBoostingClassifier(loss='binary_crossentropy',
                                         scoring = 'loss',
                                         learning_rate=0.03,
                                         max_iter = 1500, max_depth = 15,
                                         categorical_features = lst,
                                         #max_bins = 255,
                                         early_stopping = True,
                                         validation_fraction = 0.1,
                                         n_iter_no_change = 70,
                                         tol = 1e-10,
                                         #l2_regularization = 0.01,
                                         verbose =True)

source code 참고하여 다음과 같이 코드 작성하면 아래 오류 발생

Categorical feature at index 10 is expected to have a cardinality <= 255

 

cardinality : 상대적인 중복도 ( 중복도 높음 = 카디널리티 낮음 )

max_bins : 255 이상으로 늘릴 수 없음

 

#index 10 
x = train.iloc[:, :-1]
l = x['person_prefer_d_1']
s = set(l)
print(len(s),'/',len(l))

# 출력값
# 1093 / 501951

즉, 속성코드를 불러와야 하는 feature들의 카디널리티가 모델 허용치보다 높음 i.e. 중복도 낮음  

 


lst=[x for x in range(len(dic)-1)]
print(lst)

for c in ['person_attribute_a_1','person_attribute_b','person_prefer_e','contents_attribute_e']:
    #print(dic[c])
    lst.remove(dic[c])
    
#오류 나는 featre -> 일단 numeric value 취급
for i in [10,11,12,16,17,18,25,26,29]:
    lst.remove(i)

 

동일한 조건에서 categorical_features 지정해주지 않았을때보다 개선

iteration 올렸을 때 overfitting에 덜 취약


from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from datetime import datetime
import datetime as dt
import pytz
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split

#Timezone
KST = pytz.timezone('Asia/Seoul')
NAME= datetime.now(KST).strftime('%m-%d %H-%M')

x = train.iloc[:, :-1]
y = train.iloc[:, -1]

model= HistGradientBoostingClassifier(loss='binary_crossentropy',
                                         scoring = 'loss',
                                         learning_rate=0.03,
                                         warm_start = True,
                                         max_iter = 10000, max_depth = 15,
                                         categorical_features = lst,
                                         early_stopping = True,
                                         validation_fraction = 0.1,
                                         n_iter_no_change = 15,
                                         tol = 1e-5,
                                         l2_regularization = 0.01,
                                         verbose =False,
                                         random_state = 777)




history = model.fit(x,y)
acc = model.score(x,y)

print(model)
print('Accuracy:', acc)
print()

preds = model.predict(test)
submission = pd.read_csv('sample_submission.csv')
submission['target'] = preds

submission.to_csv('{}_{}_HGB.csv'.format(NAME,round(acc*100,2), index=False))
print('saved as {}_{}_HGB.csv'.format(NAME,round(acc*100,2)))
#### 1

HistGradientBoostingClassifier(categorical_features=[0, 1, 2, 3, 4, 5, 6, 9, 14,
                                                     15, 19, 20, 21, 22, 23, 24,
                                                     27],
                               early_stopping=True, learning_rate=0.01,
                               loss='binary_crossentropy', max_depth=15,
                               max_iter=10000, n_iter_no_change=50, tol=1e-06,
                               verbose=100)
Accuracy: 0.6763249799283197

saved as 01-22 15-56_submission.csv


#### 2

HistGradientBoostingClassifier(categorical_features=[0, 1, 2, 3, 4, 5, 6, 9, 14,
                                                     15, 19, 20, 21, 22, 23, 24,
                                                     27],
                               early_stopping=True, l2_regularization=0.01,
                               learning_rate=0.03, loss='binary_crossentropy',
                               max_depth=15, max_iter=10000,
                               n_iter_no_change=15, random_state=777, tol=1e-05,
                               verbose=False, warm_start=True)
Accuracy: 0.6449374540542802

saved as 01-22 17-44_0.6449374540542802_HGB.csv

 

 

4.2 XGBoost

from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier

#Timezone
KST = pytz.timezone('Asia/Seoul')
NAME= datetime.now(KST).strftime('%m-%d %H-%M')

train_x, val_x, train_y, val_y = train_test_split(x, y, test_size=0.1)

#model
model_2 =XGBClassifier(learning_rate=0.01, num_iterations=10000, 
                       max_depth=15 ,
                       booster='gbtree',
                       silent=False,
                       n_estimators = 400, 
                       boosting = 'dart',
                       colsample_bytree=0.8,
                       colsample_bylevel=0.9,
                       metric=['binary'],
                       n_jobs=-1,
                       objective = 'binary:logistic'
                       ,verbose = True)
print(model_2)


history_2 = model_2.fit(train_x,train_y,eval_set=[(train_x,train_y),(val_x,val_y)],eval_metric='error',early_stopping_rounds=20,verbose=5)


acc = model_2.score(x,y)

print(model_2)
print('Accuracy:', acc)
print()

preds = model.predict(test)
submission = pd.read_csv('sample_submission.csv')
submission['target'] = preds

submission.to_csv('{}_{}_XGB.csv'.format(NAME,round(acc*100,2)), index=False)
print('saved as {}_{}_XGB.csv'.format(NAME,round(acc*100,2)))

 

###1

Stopping. Best iteration:
[334]	validation_0-error:0.163502	validation_1-error:0.36736

XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              learning_rate=0.01, max_depth=15, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
              verbose=True)
[train+val] Accuracy: 0.8161115327990183
[val] f1-score: 0.6501745333130976 

saved as 01-22 19-47_81.61_XGB.csv

###2
[399]	validation_0-error:0.149718	validation_1-error:0.370269
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              learning_rate=0.01, max_depth=15, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=1000, silent=False,
              verbose=True)
[train+val] : 0.8279274271791469
[val] f1-score: 0.6476345840130506
saved as 01-22 20-31_64.76_XGB.csv

###3
[499]	validation_0-error:0.269894	validation_1-error:0.368217
XGBClassifier(boosting='gbtree', colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=2, learning_rate=0.03, max_depth=10, metric=['binary'],
              n_estimators=500, n_jobs=-1, num_iterations=1000, silent=False,
              verbose=True)
[train+val] : 0.720241617209648
[val] f1-score: 0.6497498483929655
saved as 01-22 22-01_72.02_64.97_XGB.csv

###4
499]	validation_0-error:0.269894	validation_1-error:0.368217
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=2, learning_rate=0.03, max_depth=10, metric=['binary'],
              n_estimators=500, n_jobs=-1, num_iterations=1000, silent=False,
              verbose=True)
[train+val] : 0.720241617209648
[val] f1-score: 0.6497498483929655
saved as 01-22 22-09_72.02_64.97_XGB.csv

###5
Stopping. Best iteration:
[90] validation_0-error:0.199332 validation_1-error:0.381465

XGBClassifier(boosting='dart', 
              colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=2, learning_rate=0.005, max_depth=20, metric=['binary'],
              min_child_weight=5, n_estimators=500, n_jobs=-1,
              num_iterations=10000, silent=False, verbose=True)

val ratio : 0.05
[train+val] : 0.7915613276993173
[val] f1-score: 0.6324477886977887
saved as 01-22 23-05_79.16_63.24_XGB.csv

###6-1
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=1, learning_rate=0.01, max_depth=20, metric=['binary'],
              min_child_weight=3, n_estimators=500, n_jobs=-1,
              num_iterations=10000, silent=False, verbose=True)

val ratio : 0.05
[train+val] : 0.8851581130429066
[val] f1-score: 0.6394219741570457
saved as 01-22 23-26_88.52_63.94_XGB.csv

###6-2

XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=1, learning_rate=0.01, max_depth=20, metric=['binary'],
              min_child_weight=3, n_estimators=500, n_jobs=-1,
              num_iterations=10000, silent=False, verbose=True)

val ratio : 0.1
[train+val] : 0.9213469043791127
[val] f1-score: 0.6480643924875431
saved as 01-22 23-34_92.13_64.81_XGB.csv

 

val ratio : 0.05 val ratio : 0.1
##5 Model
[train+val] : 0.7915613276993173
[val] f1-score: 0.6324477886977887
saved as 01-22 23-05_79.16_63.24_XGB.csv
[train+val] : 0.7882861076081131
[val] f1-score: 0.6396484188401643
saved as 01-22 23-15_78.83_63.96_XGB.csv
##6 Model
[train+val] : 0.8851581130429066
[val] f1-score: 0.6394219741570457
saved as 01-22 23-26_88.52_63.94_XGB.csv

[public] 0.6364930964
[train+val] : 0.9213469043791127
[val] f1-score: 0.6480643924875431
saved as 01-22 23-34_92.13_64.81_XGB.csv

[public] 0.637506222

-> 이에 따라 우선 valid_split_ratio = 0.1 로 고정

 

 

 

###7 max_depth=5
Stopping. Best iteration:
[158]	validation_0-error:0.403869	validation_1-error:0.404992

XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=1, learning_rate=0.01, max_depth=5, metric=['binary'],
              min_child_weight=3, n_estimators=400, n_jobs=-1,
              num_iterations=10000, silent=False, verbose=1)

[train+val] : 0.5960183364511675
[val] f1-score: 0.6212999012686052
saved as 01-23 00-11_59.6_62.13_XGB.csv

###8 max_depth=10
Stopping. Best iteration:
[32]	validation_0-error:0.376877	validation_1-error:0.39553

XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              gamma=1, learning_rate=0.01, max_depth=10, metric=['binary'],
              min_child_weight=3, n_estimators=400, n_jobs=-1,
              num_iterations=1000, silent=False, verbose=1)

[train+val] : 0.6212578518620343
[val] f1-score: 0.6255375330064127
saved as 01-23 00-16_62.13_62.55_XGB.csv

###9 
[399]	validation_0-error:0.149718	validation_1-error:0.370269
XGBClassifier(boosting='dart', colsample_bylevel=0.9, colsample_bytree=0.8,
              learning_rate=0.01, max_depth=15, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
              verbose=True)

[train+val] : 0.8279274271791469
[val] f1-score: 0.6476345840130506
saved as 01-23 00-21_82.79_64.76_XGB.csv

###10
Stopping. Best iteration:
[268]	validation_0-error:0.112922	validation_1-error:0.363874

XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
              learning_rate=0.03, max_depth=15, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
              verbose=True)

[train+val] : 0.8619825441128716
[val] f1-score: 0.6511583490899367
saved as 01-23 00-45_86.2_65.12_XGB.csv

###11
Stopping. Best iteration:
[198]	validation_0-error:0.017408	validation_1-error:0.365886

XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
              learning_rate=0.03, max_depth=20, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
              verbose=True)

[train+val] : 0.9477439032893649
[val] f1-score: 0.6460043945877182
saved as 01-23 00-55_94.77_64.6_XGB.csv

###12
Stopping. Best iteration:
[334]	validation_0-error:0.293985	validation_1-error:0.370727

XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
              learning_rate=0.03, max_depth=10, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
              verbose=True)

[train+val] : 0.6983410731326365
[val] f1-score: 0.6472561842479386
saved as 01-23 01-02_69.83_64.73_XGB.csv

###13
[399]	validation_0-error:0.284276	validation_1-error:0.369093
XGBClassifier(boosting='dart', colsample_bylevel=0.7, colsample_bytree=0.8,
              learning_rate=0.03, max_depth=10, metric=['binary'],
              n_estimators=400, n_jobs=-1, num_iterations=10000, silent=False,
              verbose=True)

[train+val] : 0.7072144492191469
[val] f1-score: 0.6490440187216948
saved as 01-23 01-06_70.72_64.9_XGB.csv

###13을 최종으로 고정한 후 

 

+ parameter search 진행

from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV, train_test_split

model=XGBClassifier()
param_grid={'booster' :['gbtree','dart'],
                 'silent':[True],
                 #'nthread':[0.5,0.7],
                 #'max_depth':[5,25],
                 'min_child_weight':[1,3,5],
                 'gamma':[0,2,3],
                 #'colsample_bytree':[0.5,0.8],
                 #'colsample_bylevel':[0.5,0.8],
                 #'n_estimators':[100,400,500],
                 'objective':['binary:logistic'],
                 'random_state':[777],
                 'verbose':[100]}

cv=KFold(n_splits=5)

grid_search=GridSearchCV(model, param_grid=param_grid, cv=cv, scoring='f1', n_jobs=-1)
grid_search.fit(x.values,y.values,verbose=True)

print('best score:', grid_search.best_score_)  
print('best params', grid_search.best_params_)

#best score: 0.5026674516475902
#best params {'booster': 'gbtree', 'gamma': 2, 'min_child_weight': 5, 'objective': 'binary:logistic', 'random_state': 777, 'silent': True, 'verbose': 100}

 

'Project > DACON' 카테고리의 다른 글

[DACON] 잡케어 추천 알고리즘 Part3  (0) 2022.03.08