[n111]Data Preprocess & EDA

■ Key words

ㆍExploratory Data Analysis(EDA)

ㆍPre-Processing(전처리 과정) : 데이터를 활용하기 위해 다듬는 과정.

■ Data Preprocessing

ㆍCleaning : 데이터의 정확도, 완성도, 일관성 및 신뢰성을 높이기 위해 noise 제거 혹은 inconsistency 보정

ㆍMissing Value : 결측치가 있는 데이터 삭제(Ignore the tuple) / 수동으로 입력(Manual Fill) / Global Constant; Unknown /

Imputation(All mean, Class mean, Inference mean, Regression 등)

ㆍNoisy Data : random error 혹은 variance를 포함하는 데이터. descriptive statistics 혹은 visualization을 통해 제거 가능.

ㆍetc. : Binning / Regression / Outlier Analysis

ㆍIntegration : 여러 데이터 합치기

ㆍTransformation(Scaling) : 데이터의 형태 변환 작업. Normalized 등.

ㆍReduction / Sampling : Dimension reduction 등

■ 주요 함수

ㆍimport pandas as pd : pandas library 불러오기

ㆍpd.read_csv('주소', thousands = ',', index_col = '칼럼 순번 or 명', names = ['column 명'])

- ','(comma)로 구분(separated)되어있는 값(value)를 불러오는 함수. 파일 형식에 따라 read_excel 등도 쓰인다.

- 천 단위가 ','로 나뉘어져 있으면 정수인 int 혹은 실수인 float이 아니라 문자인 string으로 인식하기

때문에 계산하기 위해drop, replace, int, to_int 등의 함수를 이용하여 숫자로 바꾸는 수고를 덜기

위해선 thousands를 적용해주는 것이 편하다.

- index가 있는 데이터의 경우, index를 지정해주면 print 혹은 .head() 등을 이용하여 출력했을 때 더 이쁘게 나온다.

- names에 column명을 지정하면 역시 표의 가독성이 향상된다

ㆍ.head(표시할 수 줄 수 n)

- 제일 위에서부터 n개의 열을 출력하여 보여준다. 반대 함수로는 .tail(n)이 있다.

- 전체를 출력하는 .to_strint()도 있으나, 행과 열이 안 맞는 경우도 있으니 필요에 따라 사용하면 된다.

ㆍdf.plot() : 주어진 data frame의 선 그래프 출력. ';'을 마지막에 달아주면 <metaplot~> 출력을 안 할 수 있음

ㆍdf.plot.hist() : 주어진 data frame의 히스토그램 출력

ㆍdf.plot.scatter() : 주어진 data frame의 scatter plot 출력

ㆍdf.plot.bar() : 주어진 data frame의 막대그래프 출력(세로)

ㆍdf.plot.barh() : 주어진 data frame의 막대그래프 출력(가로)

ㆍ.shape : 데이터의 dimension 확인

ㆍpd.DataFrame(df.isnull().sum(), columns=['결측치 개수']) : 결측치를 확인하는 data frame 만들기

ㆍdf.fillna(0, inplace=True) : 결측치를 0으로 대체

ㆍdf.CL[df['CL'] > 0].value_counts().sum() : df의 'CL'칼럼값 중 0보다 큰 값들을 갯수 구하기

ㆍdf.to_csv("./df.csv", index=True) : df를 csv로 내보내기

ㆍsns.load_dataset('penguins') : python seaborn library의 dataset 중 penguins을 읽어옴

ㆍpd.crosstab(index=df["n"], columns=df["m"]) : n과 m열의 데이터를 교차분석(cross-tabulation)함

■ Reference

[n122]Statistics_Hypothesis Test(2) (0)	2021.06.21
[n121]Statistics_Hypothesis Test (0)	2021.06.21
[n114]Data Preprocess & EDA_Basic Derivative (0)	2021.06.21
[n113]Data Preprocess & EDA_Data Manipulation (0)	2021.06.21
[n112]Data Preprocess & EDA_Feature Engineering (0)	2021.06.21

진화곰의 소소한 일상 이야기