[n221]결정트리(Decision Trees)

■ Key words

ㆍPipelines

ㆍDecision Tree

- Information Gain

- Gini Impurity

ㆍFeature importances

■ 주요내용

ㆍ결정트리(Decision Tree) 모델 : 특성들을 기준으로 샘플을 분류하는 알고리즘

- Node : 질문 혹은 말단의 정답. Root / Internal / External, leaf, terminal로 구분

- Edge : 노드를 연결하는 선

- 분류와 회귀 모두 적용 가능

- 새로운 데이터가 특정 말단 노드에 속한다는 정보를 확인한 뒤 말단노드의 빈도가 가장 높은 범주로 데이터 분류

- Scale 적용 불필요

- 분류과정을 tree 구조로 직관적으로 확인 가능

- Ensemble 기법인 Random Forests, Gradient Boosted Trees의 기본 이론.

ㆍ결정트리 학습 알고리즘 : 비용함수(Gini impurity, Entropy)를 정의하고, 이를 최소화 하도록 분할

ㆍ불순도 : 여러 범주가 섞여 있는 정도. 엔트로피(불순도)가 감소하는 것을 정보획득(information gain)이라고도 표현한다.

ㆍ특성중요도(feature importance) : Linear Regression의 회귀 계수(Coefficient)와 대비되는 것. 항상 양수를 가지며,

해당 특성이 얼마나 일찍, 그리고 자주 분기에 사용되는지 결정할 수 있음.

ㆍ결정트리모델은 선형모델과 달리 비선형, 비단조(non-monotonic), 특성상호작용(feature interactions)

특징을 가지고 있는 데이터 분석에 용의

ㆍ특성 상호작용 : 특성 간에 상호작용이 있을 경우 회귀분석은 개별계수를 해석하는 데 어려움이 있으나,

Decision Tree Model은 상호작용이 있더라도 동일한 분석력을 가질 수 있음.

■ 노트필기

ㆍdepth : 공식문서의 parameter 찾아보기

ㆍimpurity가 좋은 관측치 / 나쁜 관측치를 의미하는 것은 아니다

ㆍscale이 필요 없는 이유 : 불순도가 중요한 기준이기 때문

ㆍimputer : 평균/최빈값/가까운 값으로 채우는 방법 등

- 결측치/이상치 처리 방법에 대한 고민 필요

ㆍ분류 : 정확도 / 재현률 / 민감도 모두 가져갈 순 없음. α / β 중 어떤 것에 비중을 높여 가져갈 것인지 판단해야 함.

- 암환자, 스팸메일

ㆍkaggle : overview부터 잘 보기

- leaderboard : 점수표 / submit prediction에 업로드하면 점수 볼 수 있음

- 결측치를 drop하면 점수가 안 나옴

ㆍcardinality 기준을 30으로 한 이유 : 해보자. 결과로 비교해보자

ㆍ과적합/일반화 정도는 α, β 중 어떤 것을 더 키울 지 고민하는 것과 같다

ㆍdecision tree는 결측치를 어떻게 처리하나 검색

ㆍdecision tree에서 scale이 불필요한 이유

- scale의 목적 생각해보기 => 얼마나 정보를 잘 나누냐(information gain / 불순도)가 중요함

ㆍone hot encoding을 할 시 결측값은 0이 된다.

- one hot encoding 결측치(missing value) 처리

■ 주요함수

ㆍDecisionTreeClassifier()

- min_samples_split

- min_samples_leaf

- max_depth

ㆍtrain[target].value_counts(normalize=True) : target의 비율 확인. 최빈값을 baseline model로 활용 가능

ㆍtrain.select_dtypes('float').head(20).T : 실수형 데이터 타입 확인

ㆍtrain.T.duplicated() : 중복값 확인. profile에도 나옴

ㆍtrain.describe(exclude='number').T.sort_values(by='unique') : unique에서 cardinality 확인

ㆍFeature Engineering

import numpy as np

def engineer(df):

# 높은 카디널리티를 가지는 특성을 제거합니다.

selected_cols = df.select_dtypes(include=['number', 'object'])

labels = selected_cols.nunique() # 특성별 카디널리티 리스트

selected_features = labels[labels <= 30].index.tolist() # 카디널리티가 30보다 작은 특성만 선택합니다.

df = df[selected_features]

# 새로운 특성을 생성합니다.

behaviorals = [col for col in df.columns if 'behavioral' in col]

df['behaviorals'] = df[behaviorals].sum(axis=1)

dels = [col for col in df.columns if ('employment' in col or 'seas' in col)]

df.drop(columns=dels, inplace=True)

return df

train = engineer(train)

val = engineer(val)

test = engineer(test)

ㆍScikit-learn Pipelines

from category_encoders import OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(

OneHotEncoder(),

SimpleImputer(),

StandardScaler(),

LogisticRegression(n_jobs=-1)

)

pipe.fit(X_train, y_train)

print('검증세트 정확도', pipe.score(X_val, y_val))

y_pred = pipe.predict(X_test)

ㆍpipe.named_steps : named_steps 속성을 사용해서 파이프라인의 각 스텝에 접근 /

유사 딕셔너리 객체(dictionary-like object)로 파이프라인 내 과정에 접근 가능하도록 함.

ㆍ그래프로 특성 확인

import matplotlib.pyplot as plt

model_lr = pipe.named_steps['logisticregression']

enc = pipe.named_steps['onehotencoder']

encoded_columns = enc.transform(X_val).columns

coefficients = pd.Series(model_lr.coef_[0], encoded_columns)

plt.figure(figsize=(10,30))

coefficients.sort_values().plot.barh();

ㆍDecision tree

from sklearn.tree import DecisionTreeClassifier

pipe = make_pipeline(

OneHotEncoder(use_cat_names=True),

SimpleImputer(),

DecisionTreeClassifier(random_state=1, criterion='entropy')

)

pipe.fit(X_train, y_train)

print('훈련 정확도: ', pipe.score(X_train, y_train))

print('검증 정확도: ', pipe.score(X_val, y_val))

y_val.value_counts(normalize=True) : validation target의 비율 확인. 최빈값을 기준모델로 활용.

ㆍ Decision Tree의 시각화 - graphviz 설치방법: conda install -c conda-forge python-graphviz

import graphviz

from sklearn.tree import export_graphviz

model_dt = pipe.named_steps['decisiontreeclassifier']

enc = pipe.named_steps['onehotencoder']

encoded_columns = enc.transform(X_val).columns

dot_data = export_graphviz(model_dt

, max_depth=3

, feature_names=encoded_columns

, class_names=['no', 'yes']

, filled=True

, proportion=True)

display(graphviz.Source(dot_data))

ㆍ Decision Tree의 복잡도(과적합)을 줄이기 위한 parameter : DecisionTreeClassifier()<< 안에 적용

- min_samples_split

- min_samples_leaf

- max_depth

ㆍfrom ipywidgets import interact : 위젯을 통해 지수를 조정하며 그 변화를 확인할 수 있음.

from sklearn.tree import DecisionTreeRegressor, export_graphviz

def thurber_tree(max_depth=1):

tree = DecisionTreeRegressor(max_depth=max_depth)

tree.fit(X_thurber, y_thurber)

print('R2: ', tree.score(X_thurber, y_thurber))

ax = thurber.plot('mobility', 'density', kind='scatter', title='Thuber')

ax.step(X_thurber, tree.predict(X_thurber), where='mid')

plt.show()

display(show_tree(tree, colnames=['mobility']))

interact(thurber_tree, max_depth=(1,6,1));

■ Reference

ㆍCardinality : https://itholic.github.io/database-cardinality/

ㆍGini impurity :

https://mangastorytelling.tistory.com/entry/%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%82%AC%EC%9D%B4%EC%96%B8%EC%8A%A4-%EC%8A%A4%EC%BF%A8-math-101-%EC%97%94%ED%8A%B8%EB%A1%9C%ED%94%BC

ㆍDecision trees : https://www.youtube.com/watch?v=7VeUPuFGJHk

ㆍF1 score : https://kkn1220.tistory.com/138

ㆍto_csv(csv 파일로 저장) : https://computer-nerd.tistory.com/49

ㆍimputer(결측값 채우기) : https://m.blog.naver.com/bosongmoon/221800174155

ㆍSkikit learn Pipeline : https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

https://scikit-learn.org/stable/modules/compose.html#accessing-steps

ㆍ결정트리(Decision Tree) 모델 : http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

'Data Science > 2. 기계학습' 카테고리의 다른 글

[n223]Evaluation Metrics for Classification (0)	2021.06.21
[n222]Random Forests (0)	2021.06.21
[n214]Logistic Regression (0)	2021.06.21
[n213]Ridge Regression (0)	2021.06.21
[n212]다중선형회귀(Multiple Linear Regression) (0)	2021.06.21

진화곰의 소소한 일상 이야기

[n221]결정트리(Decision Trees)

'Data Science > 2. 기계학습' 카테고리의 다른 글

티스토리툴바

[n221]결정트리(Decision Trees)

'Data Science > 2. 기계학습' 카테고리의 다른 글

'Data Science/2. 기계학습' Related Articles

티스토리툴바