[Book] 4. 비지도학습과 데이터 전처리

Book Title : Introduction to Machine Learning with Python

- 파이썬 라이브러리를 활용한 머신러닝 -

지은이 : 안드레아스 뮐러, 세라 가이도

옮긴이 : 박해선

출판사 : 한빛미디어

코드 출처

https://github.com/rickiepark/introduction_to_ml_with_python

GitHub - rickiepark/introduction_to_ml_with_python: 도서 "[개정판] 파이썬 라이브러리를 활용한 머신 러닝"의

도서 "[개정판] 파이썬 라이브러리를 활용한 머신 러닝"의 주피터 노트북과 코드입니다. Contribute to rickiepark/introduction_to_ml_with_python development by creating an account on GitHub.

github.com

4.5 상호작용과 다항식

특성을 나타내는 방법 중 하나는 원본 데이터에 상호작용과 다항식을 추가하는 것

4.7 특성 자동 선택

새로운 특성을 만드는 방법은 많아서 원본 특성보다 많아질 수 있음
그러나 특성이 많아지면 모델은 더 복잡해지고 과대 적합을 초래할 수 있음
그래서 고차원 데이터셋을 사용할 때는 의미있는 특성만 선택하고 나머지는 무시하는 게 좋음
따라서 어떤 특성이 좋은지 알 수 있는 방법으로 일변량 통계, 모델 기반 선택, 반복적 선택이 있음

4.7.1 일변량 통계

각각의 특성과 타겟 사이에 중요한 통계적 관계가 있는지 계산
핵심 요소는 각 특성이 독립적으로 평과 된다는 점, 즉 일 변량 임
사이킷런에서 일변량 분석으로 특성을 선택하기 위에서는 보통 분류에서는 f_classif(기본값), 회귀 문제에서는 f_regression 선택 테스트하고, 계산한 값에 기초해 특성을 제외하는 방식 선택

cancer = load_breast_cancer()

# 난수 발생
rng = np.random.RandomState(42)

# 데이터 + 노이즈 , 처음 30개는 원본, 50개는 노이즈
noise = rng.normal(size = (len(cancer.data), 50))

X_w_noise = np.hstack([cancer.data, noise])

X_train, X_test, y_train, y_test = train_test_split(
    X_w_noise, cancer.target, random_state = 0, test_size = 5
)

# f_classif, SelectPercentile 사용해 특성의 50% 선택
select = SelectPercentile(score_func = f_classif, percentile = 50)

# fit
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)

print("X_Train.shape", X_train.shape)
print("X_train_selected.shape", X_train_selected.shape)

# output
# X_Train.shape (564, 80)
# X_train_selected.shape (564, 40)

특성 개수가 80개에서 40개로 줄어듬

mask = select.get_support()
print(mask)

plt.matshow(mask.reshape(1, -1), cmap = 'gray_r')

plt.xlabel("feature num")
plt.yticks([0])

대부분 원본 특성이 선택되었지만 완벽하진 않음
하지만 많은 특성 중 무조건 몇 가지 특성을 학습한다고 성능이 더 좋지는 않음, 원본 특성을 그대로 쓰는 경우가 더 좋을 때도 있음

4.7.2 모델 기반 특성 선택

머신러닝 모델을 사용해 특성의 중요도를 평가해서 가장 중요한 특성들만 선택
특성 선택에 사용하는 지도 학습 모델은 최종적으로 사용할 지도 학습 모델과 같을 필요는 없음
모델 기반의 특성 선택은 SelectFromModel에 구현되어 있음

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

select = SelectFromModel(
    RandomForestClassifier(n_estimators = 100, random_state = 42),
    threshold = 'median'
)

select.fit(X_train, y_train)
X_train_l1 = select.transform(X_train)
print('X_train.shape', X_train.shape)
print('X_train_l1.shape', X_train_l1.shape)
# X_train.shape (564, 80)
# X_train_l1.shape (564, 40)


mask = select.get_support()
plt.matshow(mask.reshape(1, -1), cmap = 'gray_r')
plt.xlabel('feature num')

4.7.3 반복적 특성 선택

반복적 특성 선택은 특성의 수가 각기 다른 일련의 모델이 만들어짐
일련의 모델이 만들어져서 계산 비용이 많이들

방법 1

특성을 하나도 선택하지 않고 시작해서 어떤 종료 조건에 도달할 때까지 하나씩 추가

방법 2

모든 특성을 가지고 시작해서 어떤 종료 조건에 도달할 때까지 하나씩 제거
재귀적 특성 제거 방법이 있음

from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(n_estimators = 100, random_state = 42),
             n_features_to_select = 40)
select.fit(X_train, y_train)

mask = select.get_support()
plt.matshow(mask.reshape(1, -1), cmap = 'gray_r')
plt.xlabel('feature num')

X_train_rfe = select.transform(X_train)
X_test_rfe = select.transform(X_test)

score = LogisticRegression(max_iter = 5000).fit(X_train_rfe, y_train).score(X_test_rfe, y_test)
print("테스트 점수", score)

랜덤 포레스트 모델은 특성이 누락될 때마다 다시 학습하므로 40번이나 실행

'Study > Introduction to ML with python - 한빛' 카테고리의 다른 글

[Book] 5. 모델 평가와 성능 향상 - (2) (0)	2022.03.01
[Book] 5. 모델 평가와 성능 향상 - (1) (0)	2022.03.01
[Book] 4. 비지도학습과 데이터 전처리 - (1) (0)	2022.02.25
[Book] 3. 비지도학습과 데이터 전처리 - (3) (0)	2022.02.20
[Book] 3. 비지도학습과 데이터 전처리 - (2) (0)	2022.02.17

허곰의 코딩블로그

[Book] 4. 비지도학습과 데이터 전처리 - (2)

Book Title : Introduction to Machine Learning with Python

- 파이썬 라이브러리를 활용한 머신러닝 -

4.5 상호작용과 다항식

4.7 특성 자동 선택

'Study > Introduction to ML with python - 한빛' 카테고리의 다른 글

티스토리툴바

[Book] 4. 비지도학습과 데이터 전처리 - (2)

Book Title : Introduction to Machine Learning with Python

- 파이썬 라이브러리를 활용한 머신러닝 -

4.5 상호작용과 다항식

4.7 특성 자동 선택

'Study > Introduction to ML with python - 한빛' 카테고리의 다른 글

'Study/Introduction to ML with python - 한빛' Related Articles

티스토리툴바