[n222]Random Forests

■ Key words

ㆍEnsemble

ㆍRandom Forests

- Bagging(Bootstrap Aggregating)

- Out-Of-Bag(OOB) sample

ㆍ순서형 인코딩(Ordinal Encoding)

ㆍ범주형 변수의 인코딩 방법이 Tree model과 Linear Regression Model에 끼치는 영향

■ 주요내용

ㆍ앙상블(Ensemble) : 한 종류의 데이터로 여러 머신러닝 학습모델(Weak Base Learner, 기본모델)을 만들어

그 모델들의 예측결과를 다수결이나 평균을 내어 예측하는 방법

ㆍRandom Forests : Tree Decision Model을 기본모델로 하는 앙상블 방법

- 결정트리들은 독립적으로 만들어지며, 각각 랜덤으로 예측하는 성능보다 좋을 경우

랜덤포레스트는 결정트리보다 성능이 좋음

- 표본을 Random으로 복원추출(Bootstrap Aggregation)하여 Tree model을 돌려 Forest(숲)이 된 개념

ㆍBootstrap AGGregatING(Bagging, Sampling) : log₂n의 표본수(k)를 복원추출하여 합치는 방법으로 modeling 하는 방법

- 복원추출을 하기 때문에 1. 표본이 한 집합에서 중복되어 추출될 수도 있지만,

전체 표본의 e-¹은 추출되지 않는다(Out-Of-Bag, OOB)

→ Decision Tree는 모든 표본의 feature 사용 ↔ Random Forest의 각 Bootstrap은 n개의 표본 중 k(log₂n)개만 추출 후 사용

ㆍ기본모델(Weak Learner) : 앙상블의 기본이 되는 모델들. Random Forest의 경우에는 Decision Tree Model이 해당한다.

- 회귀문제 : 기본모델 결과들의 평균 사용

- 분류문제 : 기본모델 결과들의 최빈값(다수결) 사용

ㆍ순서형 인코딩(Ordinal Encoding)

- 범주에 숫자를 mapping; 컴퓨터는 우선순위를 인지하지 못함. 회귀에 feature의 각 순위가 있을 때 적용 가능.

- high cardinality를 가지는 feature는 one hot encoding을 적용할 경우 상위노드에 채택될 가능성이 낮아짐

: 원래는 하나의 column에서 분리되어야 하나, one hot encoding으로 인해 여러 column, 0/1로 구분됨

- Ordinal Encoding과 One Hot Encoding 중 Node의 Impurity가 가장 감소하는 방법을 선택

- Hyper-parameter tuning을 통해 적용할 column, 방향 등 지정 가능

- Encoding은 사람이 편하려고 적용하는 것이 아니라, 기계가 인지할 수 있도록 하는 작업임; 인간은 불편함

ㆍRandom Forest Model이 Decision Tree Model 보다 과적합이 낮은 이유

- Bootstrap Aggregating에서 각 Sampling이 Random하게 됨(bootstrap = true)

- 각 Tree는 Random하게 선택된 feature를 가지고 분기를 수행함(max_features = auto.)

■ Session note

ㆍTree decision에서 가장 중요한 것은 역시 Target! 이에 따라 구분할 때 불순도가 달라진다

ㆍRandom Forests는 Tree model을 random으로 여러 번 돌려 평균을 구하기 때문에 forests가 된 것이다.

ㆍ부트스트래핑 : 복원추출을 통해 같은 표본수를 만드는 방법 / 다른 모델에서도 쓰일 수 있음

ㆍTree model에선 one hot encoding을 지양함 / 분석 목적에 따라 one hot 또는 ordinal 적용

ㆍValidation : bootstraping에서 미사용 된 sample 사용 / 처음부터 validation data set을 분리한 후 사용 모두 가능

ㆍfeature encoding : 컴퓨터는 숫자의 순서나 우선순위를 고려하여 분리하진 않는다

ㆍ회귀 : RandomForestRegressor / 분류 : RandomForestClassifier

ㆍFlipped study : code states의 지향점

ㆍHyper-parameter tuning / grid search

ㆍF1 score로 모델 성능 평가 ; Accuracy, Recall, Precision

ㆍEncoding 시 mapping의 fine tuning 정도에 따라 결과가 달라짐

■ 주요함수

ㆍRandom Forest - One Hot Encoder

%%time

# ordinal encoding

pipe_ord = make_pipeline(

OrdinalEncoder(),

SimpleImputer(),

RandomForestClassifier(random_state=10, n_jobs=-1, oob_score=True)

)

pipe_ord.fit(X_train, y_train)

print('검증 정확도', pipe_ord.score(X_val, y_val))

ㆍOrdinal Encoder

enc = pipe_ord.named_steps['ordinalencoder']

encoded = enc.transform(X_train)

print('Ordinal shape: ', encoded.shape)

■ Reference

ㆍRandom Forests : https://youtu.be/J4Wdy0Wc_xQ

ㆍAccuracy/Recall/Precision/F1 Score : https://eunsukimme.github.io/ml/2019/10/21/Accuracy-Recall-Precision-F1-score/

ㆍcategory_encoders(scikit-learn에서 찾으면 세부설명이 안 나옴) : https://contrib.scikit-learn.org/category_encoders/

ㆍRandom Forest : https://inuplace.tistory.com/570

ㆍMachine Learning 전반 : https://kkkkhd.tistory.com/m/276?category=926797

ㆍOrdinary encoder : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

'Data Science > 2. 기계학습' 카테고리의 다른 글

[n231]Choose Your ML Problem (0)	2021.06.22
[n223]Evaluation Metrics for Classification (0)	2021.06.21
[n221]결정트리(Decision Trees) (0)	2021.06.21
[n214]Logistic Regression (0)	2021.06.21
[n213]Ridge Regression (0)	2021.06.21

진화곰의 소소한 일상 이야기

[n222]Random Forests

'Data Science > 2. 기계학습' 카테고리의 다른 글

티스토리툴바

[n222]Random Forests

'Data Science > 2. 기계학습' 카테고리의 다른 글

'Data Science/2. 기계학습' Related Articles

티스토리툴바