[n234]Interpreting ML Model

■ Key words

ㆍ부분의존도그림(Partial Depnedence Plot, PDP)

ㆍShap(SHapley Additive exPlanations) Value Plot을 이용한 계별 예측 사례 설명

■ 주요내용

ㆍ부분의존도그림(Partial Depnedence Plot, PDP) : 특정 feautre가 target에 미치는 영향 확인 가능

- 복잡한 모델 : 이해하기 어렵지만, 성능이 좋음

- 단순한 모델 : 이해하기 쉽지만, 성능이 부족

⇒ feature의 값 변화에 따라 target의 변화에 어떻게 영향을 미치는지 알 수 있음

ㆍICE(Individual Conditional Expectation, 개별 조건부 기대치) curves :

특성값이 변경될 때 인스턴스(Instance)의 예측값이 어떻게 변하는지 보여주는 선그래프

⇒ PDP는 ICE curve들의 평균임

https://twitter.com/i/status/1066398522608635904

Christoph Molnar on Twitter

“Partial dependence plots show how a feature affects predictions of a #MachineLearning model on average. Learn how PD plots are computed with this cool animation. 😎 [created with #rstats #gganimate] https://t.co/3Ibra2qEEg”

twitter.com

ㆍShapley Value : 게임이론을 이용한 특성들의 기여도(feature attrivution)을 계산하기 위한 방법.

다른 indicator들과 다르게 특성의 기여도뿐만 아니라 특성들 간의 상호작용도 확인 가능.

필요 계산량이 기하급수적으로 증가하여 sampling을 이용해 근사적으로 값을 구함.

■ Session note

ㆍ수업 들으며 lecture note도 같이 실행하며 따라가보기
ㆍ상관이 떨어지는 feature를 제거하기 전

1. 모델이 제대로 되어있는지 확인하기

2. 해당 feature에 유의미한 data가 있는지 확인하기
ㆍpdp box와 shap value는 모델 학습 후에 feature의 특성을 확인하는 것이다.

model 학습 전 변수들의 관계를 독립변수/종속변수인지 사전에 확인해볼 필요가 있다.

■ 주요함수

ㆍPDP와 SHAP 설치(colab)

 !pip install pdpbox # pdp box 설치
 from pdpbox.pdp import pdp_isolate, pdp_plot # pdp 함수 소환
 
 !pip install shap # shap 설치
 import shap # shap 소환
 shap.initjs() # shap이 있는 모든 셀에 입력해주어 에러가 발생하지 않는다.
 # 아니면 이런 에러 발생 : WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

ㆍPDP

import matplotlib.pyplot as plt
from pdpbox import pdp

feature = 'colum_name'
pdp_dist = pdp.pdp_isolate(model=rf, dataset=X_encoded, model_features=features, feature=feature)
pdp.pdp_plot(pdp_dist, feature);

ㆍPDP Category mapping automatically

feature = 'column_name'
for item in encoder.mapping:
    if item['col'] == feature:
        feature_mapping = item['mapping'] # Series
        
feature_mapping = feature_mapping[feature_mapping.index.dropna()]
category_names = feature_mapping.index.tolist()
category_codes = feature_mapping.values.tolist()

pdp.pdp_plot(pdp_dist, feature)

plt.xticks(category_codes, category_names)

ㆍ2D PDP

features = ['column1', 'column2']

interaction = pdp_interact(
    model=rf, 
    dataset=X_encoded, 
    model_features=X_encoded.columns, 
    features=features
)

pdp_interact_plot(interaction, plot_type='grid', feature_names=features);

# ----------------------------------------------------------------------

# 2D PDP 를 Seaborn Heatmap으로 그리기 위해 데이터프레임으로 만듭니다
pdp = interaction.pdp.pivot_table(
    values='preds', 
    columns=features[0], 
    index=features[1]
)[::-1]

pdp = pdp.rename(columns=dict(zip(category_codes, category_names)))
plt.figure(figsize=(6,5))
sns.heatmap(pdp, annot=True, fmt='.2f', cmap='viridis')
plt.title('PDP decoded categorical');

ㆍshap value plot

shap.initjs()
shap_values = explainer.shap_values(X_test.iloc[:100])
shap.force_plot(explainer.expected_value, shap_values, X_test.iloc[:100])
# 모든 특성들의 +,- 기여도 확인 가능

shap.initjs()
shap.summary_plot(shap_values, X_train.iloc[:100], plot_type="bar")
# 막대그래프로 각 특성의 양/음의 기여도확인 가능

shap.initjs()
shap_values = explainer.shap_values(X_test.iloc[:100])
shap.summary_plot(shap_values, X_train.iloc[:100])
# 산점도로 각 특성별 기여 분포 확인 가능

shap.initjs()
shap.summary_plot(shap_values, X_train.iloc[:100], plot_type="violin")
# 바이올린플롯으로 각 특성별 기여 분포 확인 가능

■ Reference

ㆍPartial Dependence Plot(PDP) : https://youtu.be/21QAKe2PDkk

ㆍPartial Dependence Plot(PDP / Kaggle) : https://www.kaggle.com/dansbecker/partial-plots

Partial Plots

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

ㆍPDP box(github) : https://github.com/SauceCat/PDPbox

SauceCat/PDPbox

python partial dependence plot toolbox. Contribute to SauceCat/PDPbox development by creating an account on GitHub.

github.com

ㆍICE Curve : https://eair.tistory.com/21

[해석할 수 있는 기계학습(5-2)] 개별 조건부 기대치(Individual Conditional Expectation)

개별 조건부 기대치(ICE)는 특성값이 변경될 때 인스턴스(Instance)의 예측값이 어떻게 변하는지 보여주는 인스턴스당 하나의 선그래프를 나타냅니다. 특성값의 평균 효과에 대한 부분의존도는

eair.tistory.com

ㆍShapley Values : https://datanetworkanalysis.github.io/2019/12/23/shap1

SHAP에 대한 모든 것 - part 1 : Shapley Values 알아보기

1. 게임이론 (Game Thoery) Shapley Value에 대해 알기위해서는 게임이론에 대해 먼저 이해해야한다. 게임이론이란 우리가 아는 게임을 말하는 것이 아닌 여러 주제가 서로 영향을 미치는 상황에서 서로

datanetworkanalysis.github.io

ㆍSHAP Values(Kaggle) : https://www.kaggle.com/dansbecker/shap-values

SHAP Values

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

ㆍSHAP Values(github) : https://github.com/slundberg/shap

slundberg/shap

A game theoretic approach to explain the output of any machine learning model. - slundberg/shap

github.com

'Data Science > 2. 기계학습' 카테고리의 다른 글

[n233]Feature Importances (0)	2021.06.24
[n232]Data Wrangling (0)	2021.06.23
[n231]Choose Your ML Problem (0)	2021.06.22
[n223]Evaluation Metrics for Classification (0)	2021.06.21
[n222]Random Forests (0)	2021.06.21

진화곰의 소소한 일상 이야기

[n234]Interpreting ML Model

'Data Science > 2. 기계학습' 카테고리의 다른 글

티스토리툴바

[n234]Interpreting ML Model

'Data Science > 2. 기계학습' 카테고리의 다른 글

'Data Science/2. 기계학습' Related Articles

티스토리툴바