파이썬 머신러닝 완벽가이드

2022-06-11

도서명	파이썬 머신러닝 완벽 가이드:다양한 캐글 예제와 함께 기초 알고리즘부터 최신 기법까지 배우는
지은이	권철민
출판사	위키북스
ISBN	9791158391928
금액	38,000원
출판일	2020년 02월 07일 발행
페이지수	648쪽

파이썬 기반의 머신러닝과 생태계 이해

01 머신러닝의 개념

일반적으로 애플리케이션을 수정하지 않고도 데이터를 기반으로 패턴을 학습하고 결과를 예측하는 알고리즘 기법을 통칭

현실 세계의 매우 복잡한 조건으로 인해 기존의 소프트웨어 코드만으로는 해결하기 어려웠던 많은 문제점들을 머신러닝을 이용하여 해결해 나가고 있음

소프트웨어 코드로 로직을 구성하여 이들을 관통하는 일정한 패턴을 찾기 어려운 경우에 머신러닝은 훌륭한 솔루션을 제공함

데이터를 기반으로 통계적인 신뢰도를 강화하고 예측 오류를 최소화하기 위한 다양한 수학적 기법을 적용해 데이터 내의 패턴을 스스로 인지하고 신뢰도 있는 예측 결과를 도출함

데이터마이닝, 영상 인식, 음성 인식, 자연어 처리에서 개발자가 데이터나 업무 로직의 특성을 직접 감안한 프로그램을 만들 경우 난이도와 개발 복잡도가 너무 높아질 수밖에 없는 분야에서 머신러닝은 급속하게 발전을 이루고 있음

머신러닝의 분류

지도 학습

분류
회귀
추천시스템
시각/음성감지/인지
텍스트 분석, NLP

비지도 학습

클러스터링
차원 축소
강화학습

책 추천 : «마스터 알고리즘»

머신러닝 알고리즘을 기호주의, 열결주의, 베이지안 통계, 유추주의 등의 유형으로 나누어 설명

데이터 전쟁

데이터와 머신러닝 알고리즘 모두 중요한 요소임

머신러닝 세상이 본격적으로 펼쳐진다면 데이터의 중요성이 무엇보다 커짐

머신러닝의 가장 큰 단점은 데이터에 매우 의존적임

좋은 품질의 데이터를 갖추지 못한다면 머신러닝의 수행 결과는 좋을 수 없음

최적의 머신러닝 알고리즘과 모델 파라미터를 구축하는 능력도 중요하지만, 데이터를 이해하고 효율적으로 가공, 처리, 추출해 최적의 데이터를 기반으로 알고리즘을 구동할 수 있도록 준비하는 능력이 더 중요함.

파이썬과 R 기반의 머신러닝 비교

머신러닝 프로그램을 작성할 수 있는 대표적인 오픈 소스 프로그램 언어는 파이썬과 R 입니다.

R은 통계 전용 프로그램 언어로, SPSS, SAS, MATLAB 등 전통적인 통계 및 마이닝 패키지를 개선하고자 만든 언어

파이썬은 개발 전문 프로그램 언어로, 객체지향과 함수형 프로그래밍 모두를 포괄하는 유연한 프로그램 아키텍처, 다양한 라이브러리 등의 큰 강점을 가짐

개발 언어에 익숙하지 않으나 통계 분석에 능한 현업 사용자라면 R을 선택하는 것이 더 나을 수도 있으나, 머신러닝을 처음 시작하는 사람이라면, 특히 개발자라면 파이썬을 권함

파이썬 R에 비해 뛰어난 점은

쉽고 뛰어난 개발 생산성으로 전 세계 개발자들이 파이썬을 선호함. 특히 구글, 페이스북 등 유수의 IT 업계에서도 파이썬의 높은 생산성으로 인해 활용도가 높음
오픈 소스 계열의 전폭적인 지원을 받고 있고, 놀라울 정도의 많은 라이브러리로 인해 개발 시 높은 생산성을 보장함
인터프리터 언어의 특성상 속도는 느리지만, 대신에 뛰어난 확장성, 유연성, 호환성으로 인해 서버, 네트워크, 시스템, IOT, 심지어 데스크톱까지 다양한 영역에서 사용됨
머신러닝 애플리케이션과 결합한 다양한 애플리케이션 개발 가능
엔터프라이즈 아키텍처로의 확장 및 마이크로서비스 기반의 실시간 연계 등 다양한 기업 환경으로의 확산이 가능

무엇보다 유수의 딥러닝 프레임워크인 텐서플로, 케라스, 파이토치 등에서 파이썬 우선 정책으로 파이썬을 지원하고 있음

02 파이썬 머신러닝 생태계를 구성하는 주요 패키지

머신러닝 패키지 : 대표적 머신러닝 패키지는 사이킷런으로, 데이터 마이닝 기반의 머신러닝에서 독보적인 위치를 차지함
행렬/선형대수/통계 패키지 : 대표적인 행렬과 선형대수를 다루는 패키지는 넘파이(NumPy) 이며, 더불어 사이파이(SciPy)는 자연과학 통계를 위한 패키지 지원
데이터 핸들링 : 판다스는 대표적인 데이터 처리 패키지임
시각화 : 대표적인 시각화 패키지 맷플롯립
이 외 여러 서드파티 라이브러리
주피터

파이썬 머신러닝을 위한 S/W 설치

Anaconda 설치

pip 로 패키지들을 설치도 가능하지만 불편하기 떄문에, 필요한 패키지들을 일괄적으로 설치할 수 있는 Anaconda를 이용함

https://www.anaconda.com/download/ 접속
Anaconda 설치 파일을 내려받음
파이썬, 넘파이, 판다스, 맷플롯립, 시본, 주피터 노트북이 설치됨
Anaconda Prompt 는 Anaconda를 이용하여 패키지 설치할 때 사용
Anaconda Prompt 관리자 권한으로 실행 ```javascript // 버전 확인 python -V

Python 3.7.1

6. Jupyter Notebook 실행 -> 서버 프로그램이 실행 됨
7. 브라우저에서 http://localhost:8888 접속
8. New 로 새로운 주피터 노트북 생성
```javascript
import numby
import pandas
import matplotlib.pyplot
import seaborn
form sklearn.model_selection import train_test_spllit

위 import 코드를 입력해도 오류가 발생하지 않으면 정상 설치가 완료 된 것임
Microsoft Visual Studio Build Tool 2015 이상 버전 설치 (4장 분류, 9장 추천 시스템에서 사용하는 패키지를 설치하기 위해)

03 넘파이

많은 머신러닝 알고리즘이 넘파이 기반으로 작성돼 있기 때문에, 이해하는 것이 매우 중요함.

단 직접 패키지 등을 만드는 개발자가 아니라면 상세하게 알 필요는 없음

넘파이 ndarray 개요

import numpy as np

ndarray를 이용해 다차원 배열을 쉽게 생성하고 다양한 연산 수행 가능

array() : 다양한 인자를 입력받아 ndarray로 변환하는 기능 수행

생성된 ndarray 배열의 shape 변수는 ndarray의 크기(행 열을 튜플 형태로 가짐)를 가지고 ndarray배열의 차원까지 알수 있음

array1 = np.array([1,2,3])
print('array1 type:', type(array1))
print(`array1 array 형태:`, array1.shape)

array2 = np.aaray([[1,2,3],
                    [2,3,4]])
print('array2 type:', type(array2))
print('array2 array 형태:', array2.shape)

array3 = np.array([[1,2,3]])
print('array3 type:', type(array3))
print(`array3 array 형태:`, array3.shape)

Output

array1 type: <class 'numpy.ndarray'>
array1 array 형태: (3, )
array2 type: <class 'numpy.ndarray'>
array2 array 형태: (2, 3)
array3 type: <class 'numpy.ndarray'>
array3 array 형태: (1, 3)

ndarray.ndim : array의 차원 확인

print('array1: {:0}차원, array2: {:1}차원, array3: {:2}차원'.format(array1.ndim,
                                                                array2.ndim, array3.ndim))

Output

array1: 1차원, array2: 2차원, array3: 2차원

ndarray의 데이터 타입

ndarray 내의 데이터 값

숫자 값
문자열 값
불 값

숫자형

int형 (8bit, 16bit, 32bit)
unsigned int형 (8bit, 16bit, 32bit)
float형 (16bit, 32bit, 64bit, 128bit)
complex타입 (더 큰 숫자 및 정밀도)

list1 = [1, 2, 3]
print(type(list1))
array1 = np.array(list1)
print(type(array1))
print(array1, array1.dtype)

Output

<class 'list'>
<class 'numpy.ndarray'>
[1 2 3] int32

list2 = [1, 2, 'test']
array2 = np.array(list2)
print(array2, array2.dtype)

list3 = [1, 2, 3.0]
array3 = np.array(list3)
print(array3, array3.dtype)

Output

// U11 은 유니코드 문자열임
<class 'numpy.ndarray'>
['1' '2' 'test'] <U11
[1. 2. 3.] float64

astype() : ndarray 내 데이터 값의 타입 변경이 가능함

array_int = np.array([1, 2, 3])
array_float = array_int.astype('float64')
print(array_float, array_float.dtype)

array_int1 = array_float.astype('int32')
print(array_int1, array_int1.dtype)

array_float1 = np.array([1.1, 2.1, 3.1])
array_int2 = array_int.astype('int32')
print(array_int2, array_int2.dtype)

Output

[1. 2. 3.] float64
[1 2 3] int32
[1 2 3] int32

ndarray를 편리하게 생성하기 - arange, zeros, ones

arange(), zeros(), ones() : 쉽게 ndarray 를 생성할 수 있음

테스트용으로 데이터 만들거나 대규모 데이터를 일괄적으로 초기화 하는 경우 사용

arange() : 0부터 인자값 까지 순차적으로 데이터값으로 변환

sequence_array = np.arange(10)
print(sequence_array)
print(sequence_array.dtype, sequence_array.shape)

Output

[0 1 2 3 4 5 6 7 8 9]
int32 (10, )

zeros() : 모든 값을 0으로 채운 해당 shape를 가진 ndarray를 반환

ones() : 모든 값을 1로 채운 해당 shape를 가진 ndarray를 반환

zero_array = np.zeros((3, 2), dtype='int32')
print(zero_array)
print(zero_array.dtype, zero_array.shape)

one_array = no.ones((3, 2))
print(one_array)
print(one_array.dtype, one_array.shape)

Output

[[0 0]
[0 0]
[0 0]]
int32 (3, 2)
[[1. 1.]
[1. 1.]
[1. 1.]]
float64 (3, 2)

ndarray의 차원과 크기를 변경하는 reshape()

ndarray를 특정 차원 및 크기로 변환

array1 = np.arange(10)
print('array1:\n', array1)

array2 = array1.reshape(2, 5)
print('array2:\n', array2)

array3 = array1.reshape(5, 2)
print('array3:\n', array3)

Output

array1:
[0 1 2 3 4 5 6 7 8 9]
array2:
[[0 1 2 3 4]
[5 6 7 8 9]]
array3:
[[0 1]
[2 3]
[4 5]
[6 7]
[8 9]]

지정된 사이즈로 변경이 불가능하면 오류 발생

array4 = array1.reshape(4, 3)

ValueError

실전에서 효율적으로 사용하는 경우는 -1을 적용하는 경우

array1 = np.arange(10)
print(array1)

array2 = array1.reshape(-1, 5)
print('array2 shape:', array2.shape)

array3 = array1.reshape(5, -1)
print('array3 shape', array3.shape)

Output

[0 1 2 3 4 5 6 7 8 9]
array2 shape: (2, 5)
array3 shape: (5, 2)

물론 -1을 사용하더라도 호환 될 수 없는 형태는 변환할 수 없음

array4 = array1.reshape(4, 3)

ValueError

reshape(-1, 1) 이런 형태로 자주 사용되며, tolist() 는 리스트 자료형으로 변환 가능함 (시각적으로 더 이해하기 쉬울거 같아 리스트로 변환하여 출력함)

array1 = np.arange(8)
array3d = array1.reshape((2, 2, 2))
print('array3d:\n', array3d.tolist())

## 3차원 ndarray를 2차원 ndarray로 변환
array5 = array3d.reshape(-1, 1)
print('array5:\n', array5.tolist())
print('array5 shape', array5.shape)

## 1차원 ndarray를 2차원 ndarray로 변환
array6 = array1.reshape(-1, 1)
print('array6:\n', array6.tolist())
print('array6 shape', array6.shape)

Output

array3d:
[[[0, 1], [2, 3]], [[4, 5], [6, 7]]]
array5: [[0], [1], [2], [3], [4], [5], [6], [7]]
array5 shape: (8, 1)
array6: [[0], [1], [2], [3], [4], [5], [6], [7]]
array6 shape: (8, 1)

넘파이의 ndarray의 데이터 세트 선택하기 - 인덱싱(indexing)

특정한 데이터만 추출
슬라이싱(Slicing)
팬시 인덱싱(Fancy Indexing)
불린 인덱싱(Boolean Indexing)

단일 값 추출

한개의 데이터만 추출

1개의 데이터 값을 선택하려면 [] 안에 인덱스 값을 입력하며 ㄴ됨

## 1에서 부터 9 까지의 1차원 ndarray 생성 
array1 = np.arange(start=1, stop=10)
print('array1:',array1)
## index는 0 부터 시작하므로 array1[2]는 3번째 index 위치의 데이터 값을 의미
value = array1[2]
print('value:',value)
print(type(value))

Output

array1: [1 2 3 4 5 6 7 8 9]
value: 3
<class 'numpy.int32'>

print('맨 뒤의 값:',array1[-1], ', 맨 뒤에서 두번째 값:',array1[-2])

Output

맨 뒤의 값: 9 , 맨 뒤에서 두번째 값: 8

단일 인덱스를 이용해 ndarray 내의 데이터값도 수정 가능

array1[0] = 9
array1[8] = 0
print('array1:',array1)

Output

array1: [9 2 3 4 5 6 7 8 0]

다차원 ndarray에서 단일 값 추출

콤마(,)로 분리된 로우와 칼럼 위치를 인덱스를 통해 접근

### 1차원 ndarray를 2차원 3 x 3 ndarry로 변환
### [row, col] 을 이용해 2차원 ndarray에서 데이터 추출

array1d = np.arange(start=1, stop=10)
array2d = array1d.reshape(3,3)
print(array2d)

print('(row=0,col=0) index 가리키는 값:', array2d[0,0] )
print('(row=0,col=1) index 가리키는 값:', array2d[0,1] )
print('(row=1,col=0) index 가리키는 값:', array2d[1,0] )
print('(row=2,col=2) index 가리키는 값:', array2d[2,2] )

Output

[[1 2 3]
 [4 5 6]
 [7 8 9]]
(row=0,col=0) index 가리키는 값: 1
(row=0,col=1) index 가리키는 값: 2
(row=1,col=0) index 가리키는 값: 4
(row=2,col=2) index 가리키는 값: 9

슬라이싱

’:’ 기호를 이용해 연속된 데이터를 슬라이싱하여 추출 가능

단일 데이터값 추출을 제외하고, 슬라이싱, 팬시 인덱싱, 불린 인덱싱으로 추출된 데이터는 ndarray 타입

’:’ 사이에 시작 인덱스와 종료 인덱스를 표시하면 시작 인덱스 ~ 종료 인덱스 데이터의 ndarray를 반환

’:’ 기호 앞에 시작 인덱스를 생략하면 0으로 간주
’:’ 기호 뒤에 종료 인덱스를 생략하면 맨 마지막 인덱스로 간주
’:’ 기호 앞/뒤에 시작/종료 인덱스를 생략하면 자동으로 0/맨 마지막 인덱스로 간주

array1 = np.arange(start=1, stop=10)
array3 = array1[0:3]
print(array3)
print(type(array3))

Output

[1 2 3]
<class 'numpy.ndarray'>

array1 = np.arange(start=1, stop=10)
array4 = array1[:3]
print(array4)

array5 = array1[3:]
print(array5)

array6 = array1[:]
print(array6)

Output

[1 2 3]
[4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]

### 2차원 ndarray에서 슬라이싱으로 데이터 접근

array1d = np.arange(start=1, stop=10)
array2d = array1d.reshape(3,3)
print('array2d:\n',array2d)

print('array2d[0:2, 0:2] \n', array2d[0:2, 0:2])
print('array2d[1:3, 0:3] \n', array2d[1:3, 0:3])
print('array2d[1:3, :] \n', array2d[1:3, :])
print('array2d[:, :] \n', array2d[:, :])
print('array2d[:2, 1:] \n', array2d[:2, 1:])
print('array2d[:2, 0] \n', array2d[:2, 0])

Output

array2d:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
array2d[0:2, 0:2] 
 [[1 2]
 [4 5]]
array2d[1:3, 0:3] 
 [[4 5 6]
 [7 8 9]]
array2d[1:3, :] 
 [[4 5 6]
 [7 8 9]]
array2d[:, :] 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
array2d[:2, 1:] 
 [[2 3]
 [5 6]]
array2d[:2, 0] 
 [1 4]

2차원 ndarray에서 뒤에 오는 인덱스를 없애면 1차원 ndarray를 반환합니다.

print(array2d[0])
print(array2d[1])
print('array2d[0] shape:', array2d[0].shape, 'array2d[1] shape:', array2d[1].shape )

Output

[1 2 3]
[4 5 6]
array2d[0] shape: (3,) array2d[1] shape: (3,)

팬시 인덱싱

array1d = np.arange(start=1, stop=10)
array2d = array1d.reshape(3,3)

array3 = array2d[[0,1], 2]
print('array2d[[0,1], 2] => ',array3.tolist())

array4 = array2d[[0,1], 0:2]
print('array2d[[0,1], 0:2] => ',array4.tolist())

array5 = array2d[[0,1]]
print('array2d[[0,1]] => ',array5.tolist())

Output

array2d[[0,1], 2] =>  [3, 6]
array2d[[0,1], 0:2] =>  [[1, 2], [4, 5]]
array2d[[0,1]] =>  [[1, 2, 3], [4, 5, 6]]

불린 인덱싱

array1d = np.arange(start=1, stop=10)
## [ ] 안에 array1d > 5 Boolean indexing을 적용 
array3 = array1d[array1d > 5]
print('array1d > 5 불린 인덱싱 결과 값 :', array3)

Output

array1d > 5 불린 인덱싱 결과 값 : [6 7 8 9]

array1d > 5

Output

array([False, False, False, False, False,  True,  True,  True,  True])

boolean_indexes = np.array([False, False, False, False, False,  True,  True,  True,  True])
array3 = array1d[boolean_indexes]
print('불린 인덱스로 필터링 결과 :', array3)

Output

불린 인덱스로 필터링 결과 : [6 7 8 9]

indexes = np.array([5,6,7,8])
array4 = array1d[ indexes ]
print('일반 인덱스로 필터링 결과 :',array4)

Output

일반 인덱스로 필터링 결과 : [6 7 8 9]

행렬의 정렬 - sort()와 argsort()

행렬 정렬

org_array = np.array([ 3, 1, 9, 5]) 
print('원본 행렬:', org_array)
## np.sort( )로 정렬 
sort_array1 = np.sort(org_array)         
print ('np.sort( ) 호출 후 반환된 정렬 행렬:', sort_array1) 
print('np.sort( ) 호출 후 원본 행렬:', org_array)
## ndarray.sort( )로 정렬
sort_array2 = org_array.sort()
print('org_array.sort( ) 호출 후 반환된 행렬:', sort_array2)
print('org_array.sort( ) 호출 후 원본 행렬:', org_array)

Output

원본 행렬: [3 1 9 5]
np.sort( ) 호출 후 반환된 정렬 행렬: [1 3 5 9]
np.sort( ) 호출 후 원본 행렬: [3 1 9 5]
org_array.sort( ) 호출 후 반환된 행렬: None
org_array.sort( ) 호출 후 원본 행렬: [1 3 5 9]

sort_array1_desc = np.sort(org_array)[::-1]
print ('내림차순으로 정렬:', sort_array1_desc) 

Output

내림차순으로 정렬: [9 5 3 1]

array2d = np.array([[8, 12], 
                   [7, 1 ]])

sort_array2d_axis0 = np.sort(array2d, axis=0)
print('로우 방향으로 정렬:\n', sort_array2d_axis0)

sort_array2d_axis1 = np.sort(array2d, axis=1)
print('컬럼 방향으로 정렬:\n', sort_array2d_axis1)

Output

로우 방향으로 정렬:
 [[ 7  1]
 [ 8 12]]
컬럼 방향으로 정렬:
 [[ 8 12]
 [ 1  7]]

정렬 행렬의 인덱스 반환

org_array = np.array([ 3, 1, 9, 5]) 
sort_indices = np.argsort(org_array)
print(type(sort_indices))
print('행렬 정렬 시 원본 행렬의 인덱스:', sort_indices)

Output

<class 'numpy.ndarray'>
행렬 정렬 시 원본 행렬의 인덱스: [1 0 3 2]

org_array = np.array([ 3, 1, 9, 5]) 
sort_indices_desc = np.argsort(org_array)[::-1]
print('행렬 내림차순 정렬 시 원본 행렬의 인덱스:', sort_indices_desc)

Output

행렬 내림차순 정렬 시 원본 행렬의 인덱스: [2 3 0 1]

선형대수 연산 - 행열 내적과 전치 행렬 구하기

행렬 내적

A = np.array([[1, 2, 3],
              [4, 5, 6]])
B = np.array([[7, 8],
              [9, 10],
              [11, 12]])

dot_product = np.dot(A, B)
print('행렬 내적 결과:\n', dot_product)

Output

행렬 내적 결과:
 [[ 58  64]
 [139 154]]

전치 행렬

A = np.array([[1, 2],
              [3, 4]])
transpose_mat = np.transpose(A)
print('A의 전치 행렬:\n', transpose_mat)

Output

A의 전치 행렬:
 [[1 3]
 [2 4]]

04 데이터 핸들링 - 판다스

판다스 시작 - 파일을 DataFrame으로 로딩, 기본 API

import pandas as pd

titanic_df = pd.read_csv(r'C://Users/chkwon/Data_H.../titanic_train.csv')
titanic_df.head(3)

titanic_df = pd.read_csv('titanic_train.csv')
print('titanic 변수 type:',type(titanic_df))
titanic_df

Output

titanic 변수 type: <class 'pandas.core.frame.DataFrame'>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen “Carrie”	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

titanic_df.head(3)

Output

a	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

print('DataFrame 크기: ', titanic_df.shape)

Output

DataFrame 크기:  (891, 12)

titanic_df.describe()

Output

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)

Output

  491
  216
  184
Name: Pclass, dtype: int64

titanic_pclass = titanic_df['Pclass']
print(type(titanic_pclass))

Output

<class 'pandas.core.series.Series'>

titanic_pclass.head()

Output

  3
  1
  3
  1
  3
Name: Pclass, dtype: int64

value_counts = titanic_df['Pclass'].value_counts()
print(type(value_counts))
print(value_counts)

Output

<class 'pandas.core.series.Series'>
3    491
1    216
2    184
Name: Pclass, dtype: int64

DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

넘파이 ndarray, 리스트, 딕셔너리를 DataFrame으로 변환하기

import numpy as np

col_name1=['col1']
list1 = [1, 2, 3]
array1 = np.array(list1)

print('array1 shape:', array1.shape )
df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1차원 리스트로 만든 DataFrame:\n', df_list1)
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1차원 ndarray로 만든 DataFrame:\n', df_array1)

Output

array1 shape: (3,)
1차원 리스트로 만든 DataFrame:
    col1
0     1
1     2
2     3
1차원 ndarray로 만든 DataFrame:
    col1
0     1
1     2
2     3

## 3개의 컬럼명이 필요함. 
col_name2=['col1', 'col2', 'col3']

## 2행x3열 형태의 리스트와 ndarray 생성 한 뒤 이를 DataFrame으로 변환. 
list2 = [[1, 2, 3],
         [11, 12, 13]]
array2 = np.array(list2)
print('array2 shape:', array2.shape )
df_list2 = pd.DataFrame(list2, columns=col_name2)
print('2차원 리스트로 만든 DataFrame:\n', df_list2)
df_array2 = pd.DataFrame(array2, columns=col_name2)
print('2차원 ndarray로 만든 DataFrame:\n', df_array2)

Output

array2 shape: (2, 3)
2차원 리스트로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    12    13
2차원 ndarray로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    12    13

## Key는 컬럼명으로 매핑, Value는 리스트 형(또는 ndarray)
dict = {'col1':[1, 11], 'col2':[2, 22], 'col3':[3, 33]}
df_dict = pd.DataFrame(dict)
print('딕셔너리로 만든 DataFrame:\n', df_dict)

Output

1    11    22    33

DataFrame을 넘파이 ndarray, 리스트, 딕셔너리로 변환하기

## DataFrame을 ndarray로 변환
array3 = df_dict.values
print('df_dict.values 타입:', type(array3), 'df_dict.values shape:', array3.shape)
print(array3)

Output

df_dict.values 타입: <class 'numpy.ndarray'> df_dict.values shape: (2, 3)
[[ 1  2  3]
 [11 22 33]]

## DataFrame을 리스트로 변환
list3 = df_dict.values.tolist()
print('df_dict.values.tolist() 타입:', type(list3))
print(list3)

## DataFrame을 딕셔너리로 변환
dict3 = df_dict.to_dict('list')
print('\n df_dict.to_dict() 타입:', type(dict3))
print(dict3)

Output

df_dict.values.tolist() 타입: <class 'list'>
[[1, 2, 3], [11, 22, 33]]

 df_dict.to_dict() 타입: <class 'dict'>
{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

DataFrame의 칼럼 데이터 세트 생성과 수정

titanic_df['Age_0']=0
titanic_df.head(3)

Output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

titanic_df['Age_by_10'] = titanic_df['Age']*10
titanic_df['Family_No'] = titanic_df['SibSp'] + titanic_df['Parch']+1
titanic_df.head(3)

Output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Age_by_10	Family_No
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	220.0	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C	380.0	2
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	260.0	1

titanic_df['Age_by_10'] = titanic_df['Age_by_10'] + 100
titanic_df.head(3)

Output

PassengerId	Survived	Pclass	Name	                                            Sex	    Age	    SibSp	Parch	Ticket	            Fare	Cabin	Embarked	Age_0	Age_by_10	Family_No 0	1	        0	        3	    Braund, Mr. Owen Harris	                            male	22.0	1	    0	    A/5 21171	        7.2500	NaN	    S	        0	    320.0	    2 1	2	        1	        1	    Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	    0	    PC 17599	        71.2833	C85	    C	        0	    480.0	    2 2	3	        1	        3	    Heikkinen, Miss. Laina	                            female	26.0	0	    0	    STON/O2. 3101282	7.9250	NaN	    S	        0	    360.0	    1

DataFrame 데이터 삭제

titanic_drop_df = titanic_df.drop('Age_0', axis=1 )
titanic_drop_df.head(3)

Output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Age_by_10 Family_No 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 320.0 2 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 480.0 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 360.0 1

titanic_df.head(3)

Output

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Age_0	Age_by_10	Family_No 0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	0	320.0	2 1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	0	480.0	2 2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	0	360.0	1

drop_result = titanic_df.drop(['Age_0', 'Age_by_10', 'Family_No'], axis=1, inplace=True)
print(' inplace=True 로 drop 후 반환된 값:',drop_result)
titanic_df.head(3)

Output

inplace=True 로 drop 후 반환된 값: None

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 15)
print('##### before axis 0 drop ####')
print(titanic_df.head(3))

titanic_df.drop([0,1,2], axis=0, inplace=True)

print('##### after axis 0 drop ####')
print(titanic_df.head(3))

Output

##### before axis 0 drop ####
   PassengerId  Survived  Pclass            Name     Sex   Age  SibSp  Parch          Ticket     Fare Cabin Embarked
0            1         0       3  Braund, Mr....    male  22.0      1      0       A/5 21171   7.2500   NaN        S
1            2         1       1  Cumings, Mr...  female  38.0      1      0        PC 17599  71.2833   C85        C
2            3         1       3  Heikkinen, ...  female  26.0      0      0  STON/O2. 31...   7.9250   NaN        S
##### after axis 0 drop ####
   PassengerId  Survived  Pclass            Name     Sex   Age  SibSp  Parch  Ticket     Fare Cabin Embarked
3            4         1       1  Futrelle, M...  female  35.0      1      0  113803  53.1000  C123        S
4            5         0       3  Allen, Mr. ...    male  35.0      0      0  373450   8.0500   NaN        S
5            6         0       3  Moran, Mr. ...    male   NaN      0      0  330877   8.4583   NaN        Q

Index 객체

## 원본 파일 재 로딩 
titanic_df = pd.read_csv('titanic_train.csv')
## Index 객체 추출
indexes = titanic_df.index
print(indexes)
## Index 객체를 실제 값 arrray로 변환 
print('Index 객체 array값:\n',indexes.values)

Output

RangeIndex(start=0, stop=891, step=1)
Index 객체 array값:
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377
379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395
397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413
415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431
433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449
451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467
469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485
487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503
505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521
523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539
541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557
559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575
577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593
595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611
613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629
631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647
649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665
667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683
685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701
703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719
721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737
739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755
757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773
775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791
793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809
811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827
829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845
847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863
865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881
883 884 885 886 887 888 889 890]

print(type(indexes.values))
print(indexes.values.shape)
print(indexes[:5].values)
print(indexes.values[:5])
print(indexes[6])

Output

<class 'numpy.ndarray'>
(891,)
[0 1 2 3 4]
[0 1 2 3 4]
6

indexes[0] = 5

Output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-2fe1c3d18d1a> in <module>()
----> 1 indexes[0] = 5

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):

TypeError: Index does not support mutable operations

series_fair = titanic_df['Fare']
print('Fair Series max 값:', series_fair.max())
print('Fair Series sum 값:', series_fair.sum())
print('sum() Fair Series:', sum(series_fair))
print('Fair Series + 3:\n',(series_fair + 3).head(3) )

Output

Fair Series max 값: 512.3292
Fair Series sum 값: 28693.9493
sum() Fair Series: 28693.949299999967
Fair Series + 3:
 0    10.2500
1    74.2833
2    10.9250
Name: Fare, dtype: float64

titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

Output index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 1 0 3 Braund, Mr…. male 22.0 1 0 A/5 21171 7.2500 NaN S 1 1 2 1 1 Cumings, Mr… female 38.0 1 0 PC 17599 71.2833 C85 C 2 2 3 1 3 Heikkinen, … female 26.0 0 0 STON/O2. 31… 7.9250 NaN S

print('#### before reset_index ###')
value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)
print('value_counts 객체 변수 타입:',type(value_counts))

new_value_counts = value_counts.reset_index(inplace=False)
print('#### After reset_index ###')
print(new_value_counts)
print('new_value_counts 객체 변수 타입:',type(new_value_counts))

Output

#### before reset_index ###
3    491
1    216
2    184
Name: Pclass, dtype: int64
value_counts 객체 변수 타입: <class 'pandas.core.series.Series'>
#### After reset_index ###
   index  Pclass
0      3     491
1      1     216
2      2     184
new_value_counts 객체 변수 타입: <class 'pandas.core.frame.DataFrame'>

데이터 셀렉션 및 필터링

DataFrame의 [ ] 연산자

print('단일 컬럼 데이터 추출:\n', titanic_df[ 'Pclass' ].head(3))
print('\n여러 컬럼들의 데이터 추출:\n', titanic_df[ ['Survived', 'Pclass'] ].head(3))
print('[ ] 안에 숫자 index는 KeyError 오류 발생:\n', titanic_df[0])

Output

단일 컬럼 데이터 추출:
 0    3
1    1
2    3
Name: Pclass, dtype: int64

여러 컬럼들의 데이터 추출:
    Survived  Pclass
0         0       3
1         1       1
2         1       3

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-30-db364dee1383> in <module>()
      1 print('단일 컬럼 데이터 추출:\n', titanic_df[ 'Pclass' ].head(3))
      2 print('\n여러 컬럼들의 데이터 추출:\n', titanic_df[ ['Survived', 'Pclass'] ].head(3))
----> 3 print('[ ] 안에 숫자 index는 KeyError 오류 발생:\n', titanic_df[0])

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2690         ## get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         ## duplicate columns & possible reduce dimensionality

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   2484         res = cache.get(item)
   2485         if res is None:
-> 2486             values = self._data.get(item)
   2487             res = self._box_item_values(item, values)
   2488             cache[item] = res

~\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066 
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

titanic_df[0:2]

Output

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr....	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mr...	female	38.0	1	0	PC 17599	71.2833	C85	C

titanic_df[ titanic_df['Pclass'] == 3].head(3)

Output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr…. male 22.0 1 0 A/5 21171 7.250 NaN S 2 3 1 3 Heikkinen, … female 26.0 0 0 STON/O2. 31… 7.925 NaN S 4 5 0 3 Allen, Mr. … male 35.0 0 0 373450 8.050 NaN S

DataFrame ix[] 연산자

print('컬럼 위치 기반 인덱싱 데이터 추출:',titanic_df.ix[0,2])
print('컬럼명 기반 인덱싱 데이터 추출:',titanic_df.ix[0,'Pclass'])

Output

컬럼 위치 기반 인덱싱 데이터 추출: 3
컬럼명 기반 인덱싱 데이터 추출: 3

C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

data = {'Name': ['Chulmin', 'Eunkyung','Jinwoong','Soobeom'],
        'Year': [2011, 2016, 2015, 2015],
        'Gender': ['Male', 'Female', 'Male', 'Male']
       }
data_df = pd.DataFrame(data, index=['one','two','three','four'])
data_df

Output

Name Year Gender one Chulmin 2011 Male two Eunkyung 2016 Female three Jinwoong 2015 Male four Soobeom 2015 Male

print("\n ix[0,0]", data_df.ix[0,0])
print("\n ix['one', 0]", data_df.ix['one',0])
print("\n ix[3, 'Name']",data_df.ix[3, 'Name'],"\n")

print("\n ix[0:2, [0,1]]\n", data_df.ix[0:2, [0,1]])
print("\n ix[0:2, [0:3]]\n", data_df.ix[0:2, 0:3])
print("\n ix[0:3, ['Name', 'Year']]\n", data_df.ix[0:3, ['Name', 'Year']], "\n")
print("\n ix[:] \n", data_df.ix[:])
print("\n ix[:, :] \n", data_df.ix[:, :])

print("\n ix[data_df.Year >= 2014] \n", data_df.ix[data_df.Year >= 2014])

Output

 ix[0,0] Chulmin

 ix['one', 0] Chulmin

 ix[3, 'Name'] Soobeom 


 ix[0:2, [0,1]]
          Name  Year
one   Chulmin  2011
two  Eunkyung  2016

 ix[0:2, [0:3]]
          Name  Year  Gender
one   Chulmin  2011    Male
two  Eunkyung  2016  Female

 ix[0:3, ['Name', 'Year']]
            Name  Year
one     Chulmin  2011
two    Eunkyung  2016
three  Jinwoong  2015 


 ix[:] 
            Name  Year  Gender
one     Chulmin  2011    Male
two    Eunkyung  2016  Female
three  Jinwoong  2015    Male
four    Soobeom  2015    Male

 ix[:, :] 
            Name  Year  Gender
one     Chulmin  2011    Male
two    Eunkyung  2016  Female
three  Jinwoong  2015    Male
four    Soobeom  2015    Male

 ix[data_df.Year >= 2014] 
            Name  Year  Gender
two    Eunkyung  2016  Female
three  Jinwoong  2015    Male
four    Soobeom  2015    Male

C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  if __name__ == '__main__':
C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:11: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  ## This is added back by InteractiveShellApp.init_path()

명칭 기반 인덱싱과 위치 기반 인덱싱의 구분

## data_df 를 reset_index() 로 새로운 숫자형 인덱스를 생성
data_df_reset = data_df.reset_index()
data_df_reset = data_df_reset.rename(columns={'index':'old_index'})

## index 값에 1을 더해서 1부터 시작하는 새로운 index값 생성
data_df_reset.index = data_df_reset.index+1
data_df_reset

Output old_index Name Year Gender 1 one Chulmin 2011 Male 2 two Eunkyung 2016 Female 3 three Jinwoong 2015 Male 4 four Soobeom 2015 Male

## 아래 코드는 오류를 발생합니다. 
data_df_reset.ix[0,1]

Output

C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-37-3ba6f9b5f35d> in <module>()
      1 ## 아래 코드는 오류를 발생합니다.
----> 2 data_df_reset.ix[0,1]

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    114                 pass
    115 
--> 116             return self._getitem_tuple(key)
    117         else:
    118             ## we by definition only have the 0th axis

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    868     def _getitem_tuple(self, tup):
    869         try:
--> 870             return self._getitem_lowerdim(tup)
    871         except IndexingError:
    872             pass

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
    996         for i, key in enumerate(tup):
    997             if is_label_like(key) or isinstance(key, tuple):
--> 998                 section = self._getitem_axis(key, axis=i)
    999 
   1000                 ## we have yielded a scalar ?

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1114                     return self._get_loc(key, axis=axis)
   1115 
-> 1116             return self._get_label(key, axis=axis)
   1117 
   1118     def _getitem_iterable(self, key, axis=None):

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
    138             raise IndexingError('no slices here, handle elsewhere')
    139 
--> 140         return self.obj._xs(label, axis=axis)
    141 
    142     def _get_loc(self, key, axis=None):

~\Anaconda3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
   2982                                                       drop_level=drop_level)
   2983         else:
-> 2984             loc = self.index.get_loc(key)
   2985 
   2986             if isinstance(loc, np.ndarray):

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3063                 return self._engine.get_loc(key)
   3064             except KeyError:
-> 3065                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3066 
   3067         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

data_df_reset.ix[1,1]

Output

C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.
  
'Chulmin'

DataFrame iloc[ ] 연산자

data_df.iloc[0, 0]

Output

'Chulmin'

## 아래 코드는 오류를 발생합니다. 
data_df.iloc[0, 'Name']

Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    221             try:
--> 222                 self._validate_key(k, i)
    223             except ValueError:

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1970             raise ValueError("Can only index by location with "
-> 1971                              "a [{types}]".format(types=self._valid_types))
   1972 

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-40-ab5240d8ed9d> in <module>()
      1 ## 아래 코드는 오류를 발생합니다.
----> 2 data_df.iloc[0, 'Name']

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             ## we by definition only have the 0th axis

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   2011     def _getitem_tuple(self, tup):
   2012 
-> 2013         self._has_valid_tuple(tup)
   2014         try:
   2015             return self._getitem_lowerdim(tup)

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    224                 raise ValueError("Location based indexing can only have "
    225                                  "[{types}] types"
--> 226                                  .format(types=self._valid_types))
    227 
    228     def _is_nested_tuple_indexer(self, tup):

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

## 아래 코드는 오류를 발생합니다. 
data_df.iloc['one', 0]

Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    221             try:
--> 222                 self._validate_key(k, i)
    223             except ValueError:

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1970             raise ValueError("Can only index by location with "
-> 1971                              "a [{types}]".format(types=self._valid_types))
   1972 

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-41-0fe0a94ee06c> in <module>()
      1 ## 아래 코드는 오류를 발생합니다.
----> 2 data_df.iloc['one', 0]

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             ## we by definition only have the 0th axis

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   2011     def _getitem_tuple(self, tup):
   2012 
-> 2013         self._has_valid_tuple(tup)
   2014         try:
   2015             return self._getitem_lowerdim(tup)

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _has_valid_tuple(self, key)
    224                 raise ValueError("Location based indexing can only have "
    225                                  "[{types}] types"
--> 226                                  .format(types=self._valid_types))
    227 
    228     def _is_nested_tuple_indexer(self, tup):

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

data_df_reset.iloc[0, 1]

Output

'Chulmin'

DataFrame loc[ ] 연산자

data_df.loc['one', 'Name']

Output

'Chulmin'

data_df_reset.loc[1, 'Name']

Output

'Chulmin'

## 아래 코드는 오류를 발생합니다. 
data_df_reset.loc[0, 'Name']

Output

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1789                 if not ax.contains(key):
-> 1790                     error()
   1791             except TypeError as e:

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in error()
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 

KeyError: 'the label [0] is not in the [index]'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-45-5fc7023ea88b> in <module>()
      1 ## 아래 코드는 오류를 발생합니다.
----> 2 data_df_reset.loc[0, 'Name']

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             ## we by definition only have the 0th axis

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
    868     def _getitem_tuple(self, tup):
    869         try:
--> 870             return self._getitem_lowerdim(tup)
    871         except IndexingError:
    872             pass

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
    996         for i, key in enumerate(tup):
    997             if is_label_like(key) or isinstance(key, tuple):
--> 998                 section = self._getitem_axis(key, axis=i)
    999 
   1000                 ## we have yielded a scalar ?

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1909 
   1910         ## fall thru to straight lookup
-> 1911         self._validate_key(key, axis)
   1912         return self._get_label(key, axis=axis)
   1913 

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_key(self, key, axis)
   1796                 raise
   1797             except:
-> 1798                 error()
   1799 
   1800     def _is_scalar_access(self, key):

~\Anaconda3\lib\site-packages\pandas\core\indexing.py in error()
   1783                 raise KeyError(u"the label [{key}] is not in the [{axis}]"
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 
   1787             try:

KeyError: 'the label [0] is not in the [index]'

print('명칭기반 ix slicing\n', data_df.ix['one':'two', 'Name'],'\n')
print('위치기반 iloc slicing\n', data_df.iloc[0:1, 0],'\n')
print('명칭기반 loc slicing\n', data_df.loc['one':'two', 'Name'])

Output

명칭기반 ix slicing
 one     Chulmin
two    Eunkyung
Name: Name, dtype: object 

위치기반 iloc slicing
 one    Chulmin
Name: Name, dtype: object 

명칭기반 loc slicing
 one     Chulmin
two    Eunkyung
Name: Name, dtype: object

C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

print(data_df_reset.loc[1:2 , 'Name'])

Output

1     Chulmin
2    Eunkyung
Name: Name, dtype: object

print(data_df.ix[1:2 , 'Name'])

Output

two    Eunkyung
Name: Name, dtype: object

C:\Users\chkwon\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

불린 인덱싱

titanic_df = pd.read_csv('titanic_train.csv')
titanic_boolean = titanic_df[titanic_df['Age'] > 60]
print(type(titanic_boolean))
titanic_boolean

Output

<class 'pandas.core.frame.DataFrame'>

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked 33	34	0	2	Wheadon, Mr...	male	66.0	0	0	C.A. 24579	10.5000	NaN	S 54	55	0	1	Ostby, Mr. ...	male	65.0	0	1	113509	61.9792	B30	C 96	97	0	1	Goldschmidt...	male	71.0	0	0	PC 17754	34.6542	A5	C 116	117	0	3	Connors, Mr...	male	70.5	0	0	370369	7.7500	NaN	Q 170	171	0	1	Van der hoe...	male	61.0	0	0	111240	33.5000	B19	S 252	253	0	1	Stead, Mr. ...	male	62.0	0	0	113514	26.5500	C87	S 275	276	1	1	Andrews, Mi...	female	63.0	1	0	13502	77.9583	D7	S 280	281	0	3	Duane, Mr. ...	male	65.0	0	0	336439	7.7500	NaN	Q 326	327	0	3	Nysveen, Mr...	male	61.0	0	0	345364	6.2375	NaN	S 438	439	0	1	Fortune, Mr...	male	64.0	1	4	19950	263.0000	C23 C25 C27	S 456	457	0	1	Millet, Mr....	male	65.0	0	0	13509	26.5500	E38	S 483	484	1	3	Turkula, Mr...	female	63.0	0	0	4134	9.5875	NaN	S 493	494	0	1	Artagaveyti...	male	71.0	0	0	PC 17609	49.5042	NaN	C 545	546	0	1	Nicholson, ...	male	64.0	0	0	693	26.0000	NaN	S 555	556	0	1	Wright, Mr....	male	62.0	0	0	113807	26.5500	NaN	S 570	571	1	2	Harris, Mr....	male	62.0	0	0	S.W./PP 752	10.5000	NaN	S 625	626	0	1	Sutton, Mr....	male	61.0	0	0	36963	32.3208	D50	S 630	631	1	1	Barkworth, ...	male	80.0	0	0	27042	30.0000	A23	S 672	673	0	2	Mitchell, M...	male	70.0	0	0	C.A. 24580	10.5000	NaN	S 745	746	0	1	Crosby, Cap...	male	70.0	1	1	WE/P 5735	71.0000	B22	S 829	830	1	1	Stone, Mrs....	female	62.0	0	0	113572	80.0000	B28	NaN 851	852	0	3	Svensson, M...	male	74.0	0	0	347060	7.7750	NaN	S

titanic_df[titanic_df['Age'] > 60][['Name','Age']].head(3)

Output Name Age 33 Wheadon, Mr… 66.0 54 Ostby, Mr. … 65.0 96 Goldschmidt… 71.0

titanic_df.loc[titanic_df['Age'] > 60, ['Name','Age']].head(3)

Output Name Age 33 Wheadon, Mr… 66.0 54 Ostby, Mr. … 65.0 96 Goldschmidt… 71.0

titanic_df[ (titanic_df['Age'] > 60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]

Output PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 275 276 1 1 Andrews, Mi… female 63.0 1 0 13502 77.9583 D7 S 829 830 1 1 Stone, Mrs…. female 62.0 0 0 113572 80.0000 B28 NaN

cond1 = titanic_df['Age'] > 60
cond2 = titanic_df['Pclass']==1
cond3 = titanic_df['Sex']=='female'
titanic_df[ cond1 & cond2 & cond3]

정렬, Aggregation 함수, GroupBy 적용

DataFrame, Series의 정렬 - sort_values()

titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)

Output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 845 846 0 3 Abbing, Mr…. male 42.0 0 0 C.A. 5547 7.55 NaN S 746 747 0 3 Abbott, Mr…. male 16.0 1 1 C.A. 2673 20.25 NaN S 279 280 1 3 Abbott, Mrs… female 35.0 1 1 C.A. 2673 20.25 NaN S

titanic_sorted = titanic_df.sort_values(by=['Pclass', 'Name'], ascending=False)
titanic_sorted.head(3)

Output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
869	0	3	van Melkebe...	male	NaN	0	0	345777	9.5	NaN	S
154	0	3	van Billiar...	male	40.5	0	2	A/5. 851	14.5	NaN	S
283	0	3	de Pelsmaek...	male	16.0	0	0	345778	9.5	NaN	S

Aggregation 함수 적용

titanic_df.count()

Output

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

titanic_df[['Age', 'Fare']].mean()

Output

Age     29.699118
Fare    32.204208
dtype: float64

groupby() 이용하기

titanic_groupby = titanic_df.groupby(by='Pclass')
print(type(titanic_groupby))

Output

<class 'pandas.core.groupby.groupby.DataFrameGroupBy'>

titanic_groupby = titanic_df.groupby('Pclass').count()
titanic_groupby

Output

PassengerId	Survived	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked Pclass											 1	216	216	216	216	186	216	216	216	216	176	214 2	184	184	184	184	173	184	184	184	184	16	184 3	491	491	491	491	355	491	491	491	491	12	491

titanic_groupby = titanic_df.groupby('Pclass')[['PassengerId', 'Survived']].count()
titanic_groupby

Output

PassengerId	Survived
Pclass		
1	216	216
2	184	184
3	491	491

titanic_df.groupby('Pclass')['Age'].agg([max, min])

Output

max min Pclass 1 80.0 0.92 2 70.0 0.67 3 74.0 0.42

agg_format={'Age':'max', 'SibSp':'sum', 'Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)

Output

Age SibSp Fare Pclass 1 80.0 90 84.154687 2 70.0 74 20.662183 3 74.0 302 13.675550

결손 데이터 처리하기

isna()로 결손 데이터 여부 확인

titanic_df.isna().head(3)

Output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 False False False False False False False False False False True False 1 False False False False False False False False False False False False 2 False False False False False False False False False False True False

titanic_df.isna( ).sum( )

Output

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

fillna( ) 로 Missing 데이터 대체하기

titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
titanic_df.head(3)

Output

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked 0	1	0	3	Braund, Mr....	male	22.0	1	0	A/5 21171	7.2500	C000	S 1	2	1	1	Cumings, Mr...	female	38.0	1	0	PC 17599	71.2833	C85	C 2	3	1	3	Heikkinen, ...	female	26.0	0	0	STON/O2. 31...	7.9250	C000	S

titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df['Embarked'] = titanic_df['Embarked'].fillna('S')
titanic_df.isna().sum()

Output

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

apply lambda 식으로 데이터 가공

def get_square(a):
    return a**2

print('3의 제곱은:',get_square(3))

Output

3의 제곱은: 9

lambda_square = lambda x : x ** 2
print('3의 제곱은:',lambda_square(3))

Output

3의 제곱은: 9

a=[1,2,3]
squares = map(lambda x : x**2, a)
list(squares)

Output

[1, 4, 9]

titanic_df['Name_len']= titanic_df['Name'].apply(lambda x : len(x))
titanic_df[['Name','Name_len']].head(3)

Output

Name Name_len 0 Braund, Mr…. 23 1 Cumings, Mr… 51 2 Heikkinen, … 22

titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x <=15 else 'Adult' )
titanic_df[['Age','Child_Adult']].head(8)

Output

Age	Child_Adult
22.000000	Adult
38.000000	Adult
26.000000	Adult
35.000000	Adult
35.000000	Adult
29.699118	Adult
54.000000	Adult
2.000000	Child

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x<=15 else ('Adult' if x <= 60 else 
                                                                                  'Elderly'))
titanic_df['Age_cat'].value_counts()

Output

Adult      786
Child       83
Elderly     22
Name: Age_cat, dtype: int64

## 나이에 따라 세분화된 분류를 수행하는 함수 생성. 
def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

## lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정. 
## get_category(X)는 입력값으로 ‘Age’ 컬럼 값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age','Age_cat']].head()

Output

Age	Age_cat
22.0	Student
38.0	Adult
26.0	Young Adult
35.0	Young Adult
35.0	Young Adult

05 정리

사이킷런으로 시작하는 머신러닝

01. 사이킷런 소개와 특징

사이킷런(scikit-lean)은 파이썬 머신러닝 라이브러링 중 가장 많이 사용되는 라이브러리 입니다.

최근에는 텐서플로, 케라스 등 딥러닝 전문 라이브러리의 강세로 인해 대중적인 관심이 줄어들고는 있지만, 여전히 많은 데이터 분석가가 의존하는 대표적인 파이썬 ML 라이브러리 입니다.

사이킷런 특징

파이썬 기반의 다른 머신러닝 패키지도 사이킷런 스타일의 API를 지향할 정도로 쉽고 가장 파이썬스러운 API를 제공합니다.
머신러닝을 위한 매우 다양한 알고리즘과 개발을 위한 편리한 프레임워크와 API를 제공합니다.
오랜 기간 실전 환경에서 검증됐으며, 매우 많은 환경에서 사용되는 성숙한 라이브러리입니다.

conda 를 이용한 설치

conda install scikit-learn

pip를 이용한 설치

pip install scikit-learn

사이킷런 버전 출력

import sklearn

print(sklearn.__version__)

02. 첫 번째 머신러닝 만들어 보기 - 붓꽃 품종 예측하기

분류(Classification)는 대표적인 지도학습(Supervised Learning) 방법의 하나입니다. 지도학습은 학습을 위한 다양한 피처와 분류 결정값인 레이블(Label) 데이터로 모델을 학습한 뒤, 별도의 테스트 데이터 세트에서 미지의 레이블을 예측합니다. 즉, 지도학습은 명확한 정답이 주어진 데이터를 먼저 학습한 뒤 미지의 정답을 예측하는 방식입니다.

sklearn.datasets 내의 모듈은 사이킷런에서 자체적으로 제공하는 데이터 세트를 생성하는 모듈의 모임입니다.

sklearn.tree 내의 모듈은 트리 기반 ML 알고리즘을 구현한 클래스의 모임입니다.

sklearn.model_selection은 학습 데이터와 검증 데이터, 예측 데이터로 데이터를 분리하거나 최적의 하이퍼 파라미터로 평가하기 위한 다양한 모듈의 모임입니다.

붓꽃 데이터 세트를 생성하는 데는 load_iris()를 이용하며, ML 알고리즘은 의사 결정트리(Decision Tree) 알고리즘으로, 이를 구한현 DecisionTreeClassifier를 적용합니다.

데이터 세트를 학습 데이터와 테스트 데이터로 분라히는 데는 train_test_split() 함수를 사용할 것입니다.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

load_iris() 함수를 이용해 붓꽃 데이터 세트를 로딩한 후, 피처들과 데이터 값이 어떻게 구성돼 있는지 확인하기 위해 DataFrame으로 변환하겠습니다.

import pandas as pd

## 붓꽃 데이터 세트를 로딩합니다. 
iris = load_iris()

## iris.data는 Iris 데이터 세트에서 피처(feature)만으로 된 데이터를 numpy로 가지고 있습니다. 
iris_data = iris.data

## iris.target은 붓꽃 데이터 세트에서 레이블(결정 값) 데이터를 numpy로 가지고 있습니다. 
iris_label = iris.target
print('iris target값:', iris_label)
print('iris target명:', iris.target_names)

## 붓꽃 데이터 세트를 자세히 보기 위해 DataFrame으로 변환합니다. 
iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_names)
iris_df['label'] = iris.target
iris_df.head(3)

Out

iris target값: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
iris target명: ['setosa' 'versicolor' 'virginica']

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2

학습용 데이터와 테스트용 데이터를 분리해 보겠습니다. 학습 데이터로 학습된 모델이 얼마나 뛰어난 성능을 가지는지 평가하려면 테스트 데이터 세트가 필요하기 때문입니다.

이를 위해서 사이킷런은 train_test_split() API를 제공합니다.

train_test_split()을 이용하면 학습 데이터와 테스트 데이터를 test_size 파라미터 입력 값의 비율로 쉽게 분할합니다.

예를 들어 test_size=0.2로 입력 파라미터를 설정하면 전체 데이터 중 테스트 데이터가 20%, 학습 데이터가 80%로 데이터를 분할합니다.

먼저 train_test_split()을 호출한 후 좀 더 자세히 입력 파라미터와 반환 값을 살펴보겠습니다.

X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_label, 
                                                    test_size=0.2, random_state=11)

첫 번쨰 파라미터 iris_data 는 피처 데이터 세트입니다. 두 번쨰 파라미터 iris_lable 은 레이블(Label) 데이터 세트입니다. test_size=0.2는 전체 데이터 세트 중 테스트 데이터 세트의 비율입니다. 마지막 random_state는 호출할 때마다 같은 학습/테스트 용 데이터 세트를 생성하기 위해 주어지는 난수 발생 값 입니다. random_state를 지정하지 않으면 수행할 때마다 다른 학습/테스트 용 데이터를 만들수 있습니다.

학습 데이터를 확보했으니 이 데이터를 기반으로 머신러닝 분류 알고리즘의 하나인 의사 결정 트리를 이용해 학습과 예측을 수행해 보겠습니다. 사이킷런의 의사 결정 트리 클랙스인 DecisionTreeClassifier를 객체로 생성합니다. 생성된 DecisionTreeClassifier 객체의 fit() 메서드에 학습용 피처 데이터 속성과 결정 값 데이터 세트를 입력해 호출하면 학습을 수행합니다.

## DecisionTreeClassifier 객체 생성 
dt_clf = DecisionTreeClassifier(random_state=11)

## 학습 수행 
dt_clf.fit(X_train, y_train)

Out

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=11,
            splitter='best')

predict() 메서드에 테스트용 피처 데이터 세트를 입력해 호출하면 학습된 모델 기반에서 테스트 데이터 세트에 대한 예측값을 반환하게 됩니다.

## 학습이 완료된 DecisionTreeClassifier 객체에서 테스트 데이터 세트로 예측 수행. 
pred = dt_clf.predict(X_test)

사이킷런은 정확도 측정을 위해 accuracy_score() 함수를 제공합니다. 첫 번째 파라미터로 실제 레이블 데이터 세트, 두 번쨰 파라미터로 예측 레이블 데이터 세트를 입력하면 됩니다.

from sklearn.metrics import accuracy_score
print('예측 정확도: {0:.4f}'.format(accuracy_score(y_test,pred)))

Out

예측 정확도: 0.9333

앞의 붓꽃 데이터 세트로 분류를 예측한 프로세스를 정리하면 다음과 같습니다.

데이터 세트 분리 : 데이터를 학습 데이터와 테스트 데이터로 분리합니다.
모델 학습 : 학습 데이터를 기반으로 ML 알고리즘을 적용해 모델을 학습시킵니다.
예측 수행 : 학습된 ML 모델을 이용해 테스트 데이터의 분류(즉, 붓꽃 종류)를 예측합니다.
평가 : 이렇게 예측된 결괏값과 테스트 데이터의 실제 결괏값을 비교해 ML 모델 성능을 평가합니다.

03. 사이킷런의 기반 프레임워크 익히기

___Estimator 이해 및 fit( ), predict( ) 메서드

ML 모델 학습을 위해서 fit()
학습된 모델의 예측을 위해 predict()
분류 알고리즘을 구현한 클래스를 Classifier
회귀 알고리즘을 구현한 클래스를 Regressor
Classifier 와 Regressor를 합쳐 Estimator

cross_val_score()와 같은 evaluation 함수,

GridSearchCV와 같은 하이퍼 파라미터 튜닝을 지원하는 클래스의 경우 Estimator를 인자로 받음

비지도학습과 피처 추출에서 fit()은 학습을 의미하는 것이 아니라 입력데이터의 형태에 맞춰 데이터를 변환하기 위한 사전 구조를 맞추는 작업을 합니다.

이후 차원 변환, 크러스터링, 피처 추출 등의 실제 작업은 transform()으로 수행합니다.

fit() + transform() = fit_transform()도 제공함.

단 개별적으로 적용하는 것과 fit_transform()을 한 번에 적용하는 것에는 차이가 있음

___사이킷런의 주요 모듈

분류	모듈명	설명
예제 데이터	sklearn.datasets	사이킷런에 내장되어 예제로 제공하는 데이터 세트
피처 처리	sklearn.preprocessing	데이터 전처리에 필요한 다양한 가공 기능 제공(문자열을 숫자형 코드 값으로 인코딩, 정규화, 스케일링 등)
	sklearn.feature_selection	알고리즘에 큰 영향을 미치는 피처를 우선순위대로 셀렉션 작업을 수행하는 다양한 기능 제공
	sklearn.feature_extraction	텍스트 데이터나 이미지 데이터의 백터화된 피처를 추출하는데 사용됨. 예를 들어 텍스트 데이터에서 Count Vectorizer나 Tf-Idf Vectorizer 등을 생성하는 기능 제공. 텍스트 데이터의 피처 추출은 sklearn.feature_extraction.text 모듈에, 이미지 데이터의 피처 추출은 sklearn.feature_extraction.image 모듈에 지원 API가 있음
피처 처리 & 차원 축소	sklearn.decomposition	차원 축소와 관련한 알고리즘을 지원하는 모듈임. PCA, NMF, Truncated SVD 등을 통해 차원 축소 기능을 수행할 수 있음
데이터 분리. 검증 & 파라미터 튜닝	sklearn.model_selection	교차검증을 위한 학습용/테스트용 분리. 그리드 서치(Grid Search)로 최적 파라미터 추출 등의 API 제공
평가	sklearn.metrics	분류, 회귀, 클러스터링, 페어와이즈(Pairwise)에 대한 다양항 성능 측정 방법 제공 Accuracy. Precision. Recall. ROC-AUC. RMSE 등 제공
ML 알고리즘	sklearn.ensemble	앙상블 알고리즘 제공, 랜덤 포레스트. 에이다 부스트. 그래디언트 부스팅 등을 제공
	sklearn.linear_model	주로 선형 회귀. 릿지(Ridge). 라쏘(Lasso) 및 로지스틱 회귀 등 회귀 관련 알고리즘을 지원. 또한 SGD(Stochasic Grdient Descent) 관련 알고리즘도 제공
	sklearn.naive_bayes	나이브 베이즈 알고리즘 제공. 가우시안 NB. 다항 분포 NB 등.
	sklearn.neighbors	최근접 이웃 알고리즘 제공. K-NN 등
	sklearn.svm	서포트 벡터 머신 알고리즘 제공
	sklearn.tree	의사 결정 트리 알고리즘 제공
	sklearn.cluster	비지도 클러스터링 알고리즘 제공, (K-평균. 계층형. DBSCAN 등)
유틸리티	sklearn.pipeline	피처 처리 등의 변환과 ML 알고리즘 학습. 예측 등을 함께 묶어서 실행할 수 있는 유틸리티 제공

___내장된 예제 데이터 세트

분류나 회귀 연습용 예제 데이터

API 명	설명
datasets.load_boston()	회귀 용도이며. 미국 보스턴의 집 피처들과 가격에 대한 데이터 세트
datasets.load_breast_cancer()	분류 용도이며. 위스코닌 유방암 피처들과 악성/음성 레이블 데이터 세트
datasets.load_diabetes()	회귀 용도이며. 당뇨 데이터 세트
datasets.load_digits()	분류 용도이며. 0에서 9까지 숫자의 이미지 픽셀 데이터 세트
datasets.load_iris()	분류 용도이며. 붓꽃에 대한 피처를 가진 데이터 세트

fetch 계열의 명령은 데이터 크기가 커 패키지에 저장되어 있지 않고 인터넷에서 내려받아 홈 디렉터리 아래의 scikit_learn_data라는 서브 디렉터리에 저장한 후 추후 불러들이는 데이터

featch_covtype(): 회귀 분석용 토지 조사 자료
featch_20newsgroups(): 뉴스 그룹 텍스트 자료
featch_olivetti_faces(): 얼굴 이미지 자료
featch_lfw_people(): 얼굴 이미지 자료
featch_lfw_pairs(): 얼굴 이미지 자료
featch_rcv1(): 로이터 뉴스 말뭉치
featch_mldata(): ML 웹사이트에서 다운로드

분류와 클러스터링을 위한 표본 데이터 생성기

API 명	설명
datasets.make_classifications()	분류를 위한 데이터 세트를 만듭니다. 특히 높은 상관도. 불필요한 속성등의 노이즈 효과를 위한 데이터를 무작위로 생성해 줍니다.
datasets.make_blobs()	클러스터링을 위한 데이터 세트를 무작위로 생성해 줍니다. 군집 지정 개수에 따라 여러 가지 클러스터링을 위한 데이터 세트를 쉽게 만들어 줍니다.

키는 보통 data, target, target_name, feature_names, DESCR로 구성돼 있습니다. 개별 키가 가리키는 데이터 세트의 의미

data: 피처와 데이터 세트를 가르킴
target: 분류 시 레이블 값, 회귀일 떄는 숫자 결괏값 데이터 세트입니다.
target_name: 개별 레이블의 이름을 나타냅니다.
feature_names: 피처의 이름을 나타냅니다.
DESCR: 데이터 세트에 대한 설명과 각 피처의 설명을 나타냅니다.

아래는 붓꽃 데이터 세트 생성 입니다.

from sklearn.datasets import load_iris

iris_data = load_iris()
print(type(iris_data))

Out

<class 'sklearn.utils.Bunch'>

load_iris()의 반환 값은 sklearn.utils.Bunch 클래스 이며, Bunch 클래스는 파이썬 딕셔너리 자료형과 유사

keys = iris_data.keys()
print('붓꽃 데이터 세트의 키들:', keys)

Out

붓꽃 데이터 세트의 키들: dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

데이터 키는 피처들의 데이터 값을 가르킵니다.

다음 예제 코드에서 load_iris()가 반환하는 객체의 키인 ‘data’, ‘target’, ‘target_names’, ‘DESCR’, ‘feature_names’가 가르키는 값을 출력

print('\n feature_names 의 type:',type(iris_data.feature_names))
print(' feature_names 의 shape:',len(iris_data.feature_names))
print(iris_data.feature_names)

print('\n target_names 의 type:',type(iris_data.target_names))
print(' feature_names 의 shape:',len(iris_data.target_names))
print(iris_data.target_names)

print('\n data 의 type:',type(iris_data.data))
print(' data 의 shape:',iris_data.data.shape)
print(iris_data['data'])

print('\n target 의 type:',type(iris_data.target))
print(' target 의 shape:',iris_data.target.shape)
print(iris_data.target)

Out

 feature_names 의 type: <class 'list'>
 feature_names 의 shape: 4
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

 target_names 의 type: <class 'numpy.ndarray'>
 feature_names 의 shape: 3
['setosa' 'versicolor' 'virginica']

 data 의 type: <class 'numpy.ndarray'>
 data 의 shape: (150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]

 target 의 type: <class 'numpy.ndarray'>
 target 의 shape: (150,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

04. Model Selection 모듈 소개

전체 데이터를 학습 데이터와 테스트 데이터 세트로 분리해주는 train_test_split() 부터 살펴보겠습니다.

___학습/테스트 데이터 세트 분리 - train_test_split()

학습과 예측을 동일한 데이터 세트로 수행하 결과 입니다.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
dt_clf = DecisionTreeClassifier()
train_data = iris.data
train_label = iris.target
dt_clf.fit(train_data, train_label)

## 학습 데이터 셋으로 예측 수행
pred = dt_clf.predict(train_data)
print('예측 정확도:',accuracy_score(train_label,pred))

Out

예측 정확도: 1.0

동일하게 하면 정확도가 100%로 이상하게 나옵니다.

train_test_split 파라미터

test_size: 전체 데이터에서 테스트 데이터 세트 크기를 얼마로 샘플링할 것인가를 결정합니다. 디폴트는 0.25
train_size: 전체 데이터에서 학습용 데이터 세트크기를 얼마로 샘플링할 것인가를 결정.
shuffle: 데이트를 분리하기 전에 데이터를 미리 섞을지를 결정합니다. 디폴트는 True
random_state: 호출할 때마다 동일한 학습/테스트용 데이터 세트를 새성하기 위해 주어지는 난수 값, 지정하지 않으면 무작위로 분리
train_test_split() 의 반환 값은 튜플 형태입니다. 학습용 데이터의 피처 데이터 세트. 테스트용 데이터의 피처 데이터 세트. 학습용 데이터 레이블 데이터 세트. 테스트용 데이터 레이블 데이터 세트가 차례대로 반환됨

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

dt_clf = DecisionTreeClassifier( )
iris_data = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, 
                                                    test_size=0.3, random_state=121)

dt_clf.fit(X_train, y_train)
pred = dt_clf.predict(X_test)
print('예측 정확도: {0:.4f}'.format(accuracy_score(y_test,pred)))

Out

예측 정확도: 0.9556

95%가 나왔지만, 붓꽃 데이터는 150개의 데이터로 데이터량이 많지 않아 30%면 45개로 테스트를 했기 때문에, 예측 성능 판단하기에는 그리 적절하지 않음

___교차 검증

학습 데이터에만 과도하게 최적화 되어, 실제 예측시 성능이 떨어지는것을 과적합(Overfitting) 이라고 함. 이를 개선하기 위해 교차 검증을 이용하여 다양한 학습과 평가를 수행함.

K 폴드

가장 보편적으로 사용되는 교차 검증 기법.

n번 평가를 수행한 뒤. n개 평가를 평균한 결과를 가지고 예측.

그리고 n 등분 한다음. n-1 학습하고 1로 검증 … 이걸 반복

마지막으로 예측 평가들의 평균하여 평가 결과로 반영

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np

iris = load_iris()
features = iris.data
label = iris.target
dt_clf = DecisionTreeClassifier(random_state=156)

## 5개의 폴드 세트로 분리하는 KFold 객체와 폴드 세트별 정확도를 담을 리스트 객체 생성.
kfold = KFold(n_splits=5)
cv_accuracy = []
print('붓꽃 데이터 세트 크기:',features.shape[0])

Out

붓꽃 데이터 세트 크기: 150

150 개에서 KFold(n_splits=5)로 했으니, 30개로 분할 됨.

n_iter = 0

## KFold객체의 split( ) 호출하면 폴드 별 학습용, 검증용 테스트의 로우 인덱스를 array로 반환  
for train_index, test_index  in kfold.split(features):
    ## kfold.split( )으로 반환된 인덱스를 이용하여 학습용, 검증용 테스트 데이터 추출
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    #학습 및 예측 
    dt_clf.fit(X_train , y_train)    
    pred = dt_clf.predict(X_test)
    n_iter += 1
    ## 반복 시 마다 정확도 측정 
    accuracy = np.round(accuracy_score(y_test,pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    print('\n#{0} 교차 검증 정확도 :{1}, 학습 데이터 크기: {2}, 검증 데이터 크기: {3}'
          .format(n_iter, accuracy, train_size, test_size))
    print('#{0} 검증 세트 인덱스:{1}'.format(n_iter,test_index))
    cv_accuracy.append(accuracy)
    
## 개별 iteration별 정확도를 합하여 평균 정확도 계산 
print('\n### 평균 검증 정확도:', np.mean(cv_accuracy)) 

Out

#1 교차 검증 정확도 :1.0, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#1 검증 세트 인덱스:[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]

#2 교차 검증 정확도 :0.9667, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#2 검증 세트 인덱스:[30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
 54 55 56 57 58 59]

#3 교차 검증 정확도 :0.8667, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#3 검증 세트 인덱스:[60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
 84 85 86 87 88 89]

#4 교차 검증 정확도 :0.9333, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#4 검증 세트 인덱스:[ 90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119]

#5 교차 검증 정확도 :0.7333, 학습 데이터 크기: 120, 검증 데이터 크기: 30
#5 검증 세트 인덱스:[120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]

### 평균 검증 정확도: 0.9

0~29, 30~59 … 120~149 검증 세트 추출함

Stratified K 폴드

불균형한(im)

import pandas as pd

iris = load_iris()

iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['label']=iris.target
iris_df['label'].value_counts()

Out

  50
  50
  50
Name: label, dtype: int64

kfold = KFold(n_splits=3)
## kfold.split(X)는 폴드 세트를 5번 반복할 때마다 달라지는 학습/테스트 용 데이터 로우 인덱스 번호 반환. 
n_iter =0
for train_index, test_index  in kfold.split(iris_df):
    n_iter += 1
    label_train= iris_df['label'].iloc[train_index]
    label_test= iris_df['label'].iloc[test_index]
    print('### 교차 검증: {0}'.format(n_iter))
    print('학습 레이블 데이터 분포:\n', label_train.value_counts())
    print('검증 레이블 데이터 분포:\n', label_test.value_counts())
    

Out

### 교차 검증: 1
학습 레이블 데이터 분포:
 2    50
1    50
Name: label, dtype: int64
검증 레이블 데이터 분포:
 0    50
Name: label, dtype: int64
### 교차 검증: 2
학습 레이블 데이터 분포:
 2    50
0    50
Name: label, dtype: int64
검증 레이블 데이터 분포:
 1    50
Name: label, dtype: int64
### 교차 검증: 3
학습 레이블 데이터 분포:
 1    50
0    50
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    50
Name: label, dtype: int64

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)
n_iter=0

for train_index, test_index in skf.split(iris_df, iris_df['label']):
    n_iter += 1
    label_train= iris_df['label'].iloc[train_index]
    label_test= iris_df['label'].iloc[test_index]
    print('### 교차 검증: {0}'.format(n_iter))
    print('학습 레이블 데이터 분포:\n', label_train.value_counts())
    print('검증 레이블 데이터 분포:\n', label_test.value_counts())

Out

### 교차 검증: 1
학습 레이블 데이터 분포:
 2    33
1    33
0    33
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    17
1    17
0    17
Name: label, dtype: int64
### 교차 검증: 2
학습 레이블 데이터 분포:
 2    33
1    33
0    33
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    17
1    17
0    17
Name: label, dtype: int64
### 교차 검증: 3
학습 레이블 데이터 분포:
 2    34
1    34
0    34
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    16
1    16
0    16
Name: label, dtype: int64

dt_clf = DecisionTreeClassifier(random_state=156)

skfold = StratifiedKFold(n_splits=3)
n_iter=0
cv_accuracy=[]

## StratifiedKFold의 split( ) 호출시 반드시 레이블 데이터 셋도 추가 입력 필요  
for train_index, test_index  in skfold.split(features, label):
    ## split( )으로 반환된 인덱스를 이용하여 학습용, 검증용 테스트 데이터 추출
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    #학습 및 예측 
    dt_clf.fit(X_train , y_train)    
    pred = dt_clf.predict(X_test)

    ## 반복 시 마다 정확도 측정 
    n_iter += 1
    accuracy = np.round(accuracy_score(y_test,pred), 4)
    train_size = X_train.shape[0]
    test_size = X_test.shape[0]
    print('\n#{0} 교차 검증 정확도 :{1}, 학습 데이터 크기: {2}, 검증 데이터 크기: {3}'
          .format(n_iter, accuracy, train_size, test_size))
    print('#{0} 검증 세트 인덱스:{1}'.format(n_iter,test_index))
    cv_accuracy.append(accuracy)
    
## 교차 검증별 정확도 및 평균 정확도 계산 
print('\n### 교차 검증별 정확도:', np.round(cv_accuracy, 4))
print('### 평균 검증 정확도:', np.mean(cv_accuracy)) 

Out

#1 교차 검증 정확도 :0.9804, 학습 데이터 크기: 99, 검증 데이터 크기: 51
#1 검증 세트 인덱스:[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  50
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116]

#2 교차 검증 정확도 :0.9216, 학습 데이터 크기: 99, 검증 데이터 크기: 51
#2 검증 세트 인덱스:[ 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  67
  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 117 118
 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133]

#3 교차 검증 정확도 :0.9792, 학습 데이터 크기: 102, 검증 데이터 크기: 48
#3 검증 세트 인덱스:[ 34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  84  85
  86  87  88  89  90  91  92  93  94  95  96  97  98  99 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]

### 교차 검증별 정확도: [0.9804 0.9216 0.9792]
### 평균 검증 정확도: 0.9604

cross_val_score( )

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score , cross_validate
from sklearn.datasets import load_iris

iris_data = load_iris()
dt_clf = DecisionTreeClassifier(random_state=156)

data = iris_data.data
label = iris_data.target

## 성능 지표는 정확도(accuracy) , 교차 검증 세트는 3개 
scores = cross_val_score(dt_clf , data , label , scoring='accuracy',cv=3)
print('교차 검증별 정확도:',np.round(scores, 4))
print('평균 검증 정확도:', np.round(np.mean(scores), 4))

Out

교차 검증별 정확도: [0.9804 0.9216 0.9792]
평균 검증 정확도: 0.9604

___GridSearchCV - 교차 검증과 최적 하이퍼 파라미터 튜닝을 한 번에 111

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

## 데이터를 로딩하고 학습데이타와 테스트 데이터 분리
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, 
                                                    test_size=0.2, random_state=121)
dtree = DecisionTreeClassifier()

#### parameter 들을 dictionary 형태로 설정
parameters = {'max_depth':[1,2,3], 'min_samples_split':[2,3]}

import pandas as pd

## param_grid의 하이퍼 파라미터들을 3개의 train, test set fold 로 나누어서 테스트 수행 설정.  
#### refit=True 가 default 임. True이면 가장 좋은 파라미터 설정으로 재 학습 시킴.  
grid_dtree = GridSearchCV(dtree, param_grid=parameters, cv=3, refit=True)

## 붓꽃 Train 데이터로 param_grid의 하이퍼 파라미터들을 순차적으로 학습/평가 .
grid_dtree.fit(X_train, y_train)

## GridSearchCV 결과 추출하여 DataFrame으로 변환
scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score', \
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

Out

C:\Users\chkwon\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\Users\chkwon\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\Users\chkwon\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\Users\chkwon\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
C:\Users\chkwon\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'max_depth': 1, 'min_samples_split': 2}	0.700000	5	0.700	0.7	0.70
1	{'max_depth': 1, 'min_samples_split': 3}	0.700000	5	0.700	0.7	0.70
2	{'max_depth': 2, 'min_samples_split': 2}	0.958333	3	0.925	1.0	0.95
3	{'max_depth': 2, 'min_samples_split': 3}	0.958333	3	0.925	1.0	0.95
4	{'max_depth': 3, 'min_samples_split': 2}	0.966667	1	0.950	1.0	0.95
5	{'max_depth': 3, 'min_samples_split': 3}	0.966667	1	0.950	1.0	0.95

print('GridSearchCV 최적 파라미터:', grid_dtree.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dtree.best_score_))

Out

GridSearchCV 최적 파라미터: {'max_depth': 3, 'min_samples_split': 2}
GridSearchCV 최고 정확도: 0.9667

## GridSearchCV의 refit으로 이미 학습이 된 estimator 반환
estimator = grid_dtree.best_estimator_

## GridSearchCV의 best_estimator_는 이미 최적 하이퍼 파라미터로 학습이 됨
pred = estimator.predict(X_test)
print('테스트 데이터 세트 정확도: {0:.4f}'.format(accuracy_score(y_test,pred)))

Out

테스트 데이터 세트 정확도: 0.9667

05. 데이터 전처리

___데이터 인코딩

레이블 인코딩(Label encoding)

from sklearn.preprocessing import LabelEncoder

items=['TV','냉장고','전자렌지','컴퓨터','선풍기','선풍기','믹서','믹서']

## LabelEncoder를 객체로 생성한 후 , fit( ) 과 transform( ) 으로 label 인코딩 수행. 
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
print('인코딩 변환값:',labels)

Out

인코딩 변환값: [0 1 4 5 3 3 2 2]

print('인코딩 클래스:',encoder.classes_)

Out

인코딩 클래스: ['TV' '냉장고' '믹서' '선풍기' '전자렌지' '컴퓨터']

print('디코딩 원본 값:',encoder.inverse_transform([4, 5, 2, 0, 1, 1, 3, 3]))

Out

디코딩 원본 값: ['전자렌지' '컴퓨터' '믹서' 'TV' '냉장고' '냉장고' '선풍기' '선풍기']

C:\Users\chkwon\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

원-핫 인코딩(One-Hot encoding)

from sklearn.preprocessing import OneHotEncoder
import numpy as np

items=['TV','냉장고','전자렌지','컴퓨터','선풍기','선풍기','믹서','믹서']

## 먼저 숫자값으로 변환을 위해 LabelEncoder로 변환합니다. 
encoder = LabelEncoder()
encoder.fit(items)
labels = encoder.transform(items)
## 2차원 데이터로 변환합니다. 
labels = labels.reshape(-1,1)

## 원-핫 인코딩을 적용합니다. 
oh_encoder = OneHotEncoder()
oh_encoder.fit(labels)
oh_labels = oh_encoder.transform(labels)
print('원-핫 인코딩 데이터')
print(oh_labels.toarray())
print('원-핫 인코딩 데이터 차원')
print(oh_labels.shape)

Out

원-핫 인코딩 데이터
[[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]]
원-핫 인코딩 데이터 차원
(8, 6)

import pandas as pd

df = pd.DataFrame({'item':['TV','냉장고','전자렌지','컴퓨터','선풍기','선풍기','믹서','믹서'] })
pd.get_dummies(df)

Out

	item_TV	item_냉장고	item_믹서	item_선풍기	item_전자렌지	item_컴퓨터
0	1	0	0	0	0	0
1	0	1	0	0	0	0
2	0	0	0	0	1	0
3	0	0	0	0	0	1
4	0	0	0	1	0	0
5	0	0	0	1	0	0
6	0	0	1	0	0	0
7	0	0	1	0	0	0

___피처 스케일링과 정규화

___StandardScaler

from sklearn.datasets import load_iris
import pandas as pd
## 붓꽃 데이터 셋을 로딩하고 DataFrame으로 변환합니다. 
iris = load_iris()
iris_data = iris.data
iris_df = pd.DataFrame(data=iris_data, columns=iris.feature_names)

print('feature 들의 평균 값')
print(iris_df.mean())
print('\nfeature 들의 분산 값')
print(iris_df.var())

Out

feature 들의 평균 값
sepal length (cm)    5.843333
sepal width (cm)     3.054000
petal length (cm)    3.758667
petal width (cm)     1.198667
dtype: float64

feature 들의 분산 값
sepal length (cm)    0.685694
sepal width (cm)     0.188004
petal length (cm)    3.113179
petal width (cm)     0.582414
dtype: float64

from sklearn.preprocessing import StandardScaler

## StandardScaler객체 생성
scaler = StandardScaler()
## StandardScaler 로 데이터 셋 변환. fit( ) 과 transform( ) 호출.  
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

#transform( )시 scale 변환된 데이터 셋이 numpy ndarry로 반환되어 이를 DataFrame으로 변환
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
print('feature 들의 평균 값')
print(iris_df_scaled.mean())
print('\nfeature 들의 분산 값')
print(iris_df_scaled.var())

Out

feature 들의 평균 값
sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.637024e-15
petal length (cm)   -1.482518e-15
petal width (cm)    -1.623146e-15
dtype: float64

feature 들의 분산 값
sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64

___MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

## MinMaxScaler객체 생성
scaler = MinMaxScaler()
## MinMaxScaler 로 데이터 셋 변환. fit() 과 transform() 호출.  
scaler.fit(iris_df)
iris_scaled = scaler.transform(iris_df)

## transform()시 scale 변환된 데이터 셋이 numpy ndarry로 반환되어 이를 DataFrame으로 변환
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
print('feature들의 최소 값')
print(iris_df_scaled.min())
print('\nfeature들의 최대 값')
print(iris_df_scaled.max())

Out

feature들의 최소 값
sepal length (cm)    0.0
sepal width (cm)     0.0
petal length (cm)    0.0
petal width (cm)     0.0
dtype: float64

feature들의 최대 값
sepal length (cm)    1.0
sepal width (cm)     1.0
petal length (cm)    1.0
petal width (cm)     1.0
dtype: float64

___학습 데이터와 테스트 데이터의 스케일링 변환 시 유의점

06. 사이킷런으로 수행하는 타이타닉 생존자 예측

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic_df = pd.read_csv('./titanic_train.csv')
titanic_df.head(3)

Out

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

print('\n #### train 데이터 정보 ####  \n')
print(titanic_df.info())

Out

 #### train 데이터 정보 ####  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None

titanic_df['Age'].fillna(titanic_df['Age'].mean(),inplace=True)
titanic_df['Cabin'].fillna('N',inplace=True)
titanic_df['Embarked'].fillna('N',inplace=True)
print('데이터 세트 Null 값 갯수 ',titanic_df.isnull().sum().sum())

Out

데이터 세트 Null 값 갯수  0

print(' Sex 값 분포 :\n',titanic_df['Sex'].value_counts())
print('\n Cabin 값 분포 :\n',titanic_df['Cabin'].value_counts())
print('\n Embarked 값 분포 :\n',titanic_df['Embarked'].value_counts())

Out

 Sex 값 분포 :
 male      577
female    314
Name: Sex, dtype: int64

 Cabin 값 분포 :
 N              687
G6               4
C23 C25 C27      4
B96 B98          4
C22 C26          3
F2               3
E101             3
F33              3
D                3
C65              2
B22              2
B28              2
E25              2
D36              2
F G73            2
E24              2
E67              2
B51 B53 B55      2
C78              2
B35              2
F4               2
C125             2
E121             2
D20              2
C93              2
E33              2
B18              2
C52              2
D17              2
E8               2
              ... 
D56              1
B82 B84          1
B80              1
A5               1
F E69            1
E38              1
E12              1
C50              1
D30              1
D49              1
C54              1
F G63            1
D46              1
A32              1
F38              1
C110             1
E31              1
E34              1
C87              1
D37              1
B86              1
C111             1
C47              1
D45              1
T                1
C101             1
D28              1
B73              1
B102             1
A24              1
Name: Cabin, Length: 148, dtype: int64

 Embarked 값 분포 :
 S    644
C    168
Q     77
N      2
Name: Embarked, dtype: int64

titanic_df['Cabin'] = titanic_df['Cabin'].str[:1]
print(titanic_df['Cabin'].head(3))

Out

  N
  C
  N
Name: Cabin, dtype: object

titanic_df.groupby(['Sex','Survived'])['Survived'].count()

Out

Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

sns.barplot(x='Sex', y = 'Survived', data=titanic_df)

Out

<matplotlib.axes._subplots.AxesSubplot at 0x1be04d1cbe0>

sns.barplot(x='Pclass', y='Survived', hue='Sex', data=titanic_df)

<matplotlib.axes._subplots.AxesSubplot at 0x1be04e50d30>

## 입력 age에 따라 구분값을 반환하는 함수 설정. DataFrame의 apply lambda식에 사용. 
def get_category(age):
    cat = ''
    if age <= -1: cat = 'Unknown'
    elif age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

## 막대그래프의 크기 figure를 더 크게 설정 
plt.figure(figsize=(10,6))

#X축의 값을 순차적으로 표시하기 위한 설정 
group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Elderly']

## lambda 식에 위에서 생성한 get_category( ) 함수를 반환값으로 지정. 
## get_category(X)는 입력값으로 'Age' 컬럼값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
sns.barplot(x='Age_cat', y = 'Survived', hue='Sex', data=titanic_df, order=group_names)
titanic_df.drop('Age_cat', axis=1, inplace=True)

from sklearn import preprocessing

def encode_features(dataDF):
    features = ['Cabin', 'Sex', 'Embarked']
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(dataDF[feature])
        dataDF[feature] = le.transform(dataDF[feature])
        
    return dataDF

titanic_df = encode_features(titanic_df)
titanic_df.head()

Out

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	7	3
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	2	0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	7	3
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	2	3
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	7	3

from sklearn.preprocessing import LabelEncoder

## Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(),inplace=True)
    df['Cabin'].fillna('N',inplace=True)
    df['Embarked'].fillna('N',inplace=True)
    df['Fare'].fillna(0,inplace=True)
    return df

## 머신러닝 알고리즘에 불필요한 속성 제거
def drop_features(df):
    df.drop(['PassengerId','Name','Ticket'],axis=1,inplace=True)
    return df

## 레이블 인코딩 수행. 
def format_features(df):
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Cabin','Sex','Embarked']
    for feature in features:
        le = LabelEncoder()
        le = le.fit(df[feature])
        df[feature] = le.transform(df[feature])
    return df

## 앞에서 설정한 Data Preprocessing 함수 호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df

## 원본 데이터를 재로딩 하고, feature데이터 셋과 Label 데이터 셋 추출. 
titanic_df = pd.read_csv('./titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df= titanic_df.drop('Survived',axis=1)

X_titanic_df = transform_features(X_titanic_df)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X_titanic_df, y_titanic_df, \
                                                  test_size=0.2, random_state=11)

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## 결정트리, Random Forest, 로지스틱 회귀를 위한 사이킷런 Classifier 클래스 생성
dt_clf = DecisionTreeClassifier(random_state=11)
rf_clf = RandomForestClassifier(random_state=11)
lr_clf = LogisticRegression()

## DecisionTreeClassifier 학습/예측/평가
dt_clf.fit(X_train , y_train)
dt_pred = dt_clf.predict(X_test)
print('DecisionTreeClassifier 정확도: {0:.4f}'.format(accuracy_score(y_test, dt_pred)))

## RandomForestClassifier 학습/예측/평가
rf_clf.fit(X_train , y_train)
rf_pred = rf_clf.predict(X_test)
print('RandomForestClassifier 정확도:{0:.4f}'.format(accuracy_score(y_test, rf_pred)))

## LogisticRegression 학습/예측/평가
lr_clf.fit(X_train , y_train)
lr_pred = lr_clf.predict(X_test)
print('LogisticRegression 정확도: {0:.4f}'.format(accuracy_score(y_test, lr_pred)))

Out

DecisionTreeClassifier 정확도: 0.7877
RandomForestClassifier 정확도:0.8324
LogisticRegression 정확도: 0.8659

from sklearn.model_selection import KFold

def exec_kfold(clf, folds=5):
    ## 폴드 세트를 5개인 KFold객체를 생성, 폴드 수만큼 예측결과 저장을 위한  리스트 객체 생성.
    kfold = KFold(n_splits=folds)
    scores = []
    
    ## KFold 교차 검증 수행. 
    for iter_count , (train_index, test_index) in enumerate(kfold.split(X_titanic_df)):
        ## X_titanic_df 데이터에서 교차 검증별로 학습과 검증 데이터를 가리키는 index 생성
        X_train, X_test = X_titanic_df.values[train_index], X_titanic_df.values[test_index]
        y_train, y_test = y_titanic_df.values[train_index], y_titanic_df.values[test_index]
        
        ## Classifier 학습, 예측, 정확도 계산 
        clf.fit(X_train, y_train) 
        predictions = clf.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        scores.append(accuracy)
        print("교차 검증 {0} 정확도: {1:.4f}".format(iter_count, accuracy))     
    
    ## 5개 fold에서의 평균 정확도 계산. 
    mean_score = np.mean(scores)
    print("평균 정확도: {0:.4f}".format(mean_score)) 
## exec_kfold 호출
exec_kfold(dt_clf , folds=5) 

Out

교차 검증 0 정확도: 0.7542
교차 검증 1 정확도: 0.7809
교차 검증 2 정확도: 0.7865
교차 검증 3 정확도: 0.7697
교차 검증 4 정확도: 0.8202
평균 정확도: 0.7823

from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt_clf, X_titanic_df , y_titanic_df , cv=5)
for iter_count,accuracy in enumerate(scores):
    print("교차 검증 {0} 정확도: {1:.4f}".format(iter_count, accuracy))

print("평균 정확도: {0:.4f}".format(np.mean(scores)))

Out

교차 검증 0 정확도: 0.7430
교차 검증 1 정확도: 0.7765
교차 검증 2 정확도: 0.7809
교차 검증 3 정확도: 0.7753
교차 검증 4 정확도: 0.8418
평균 정확도: 0.7835

from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':[2,3,5,10],
             'min_samples_split':[2,3,5], 'min_samples_leaf':[1,5,8]}

grid_dclf = GridSearchCV(dt_clf , param_grid=parameters , scoring='accuracy' , cv=5)
grid_dclf.fit(X_train , y_train)

print('GridSearchCV 최적 하이퍼 파라미터 :',grid_dclf.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dclf.best_score_))
best_dclf = grid_dclf.best_estimator_

## GridSearchCV의 최적 하이퍼 파라미터로 학습된 Estimator로 예측 및 평가 수행. 
dpredictions = best_dclf.predict(X_test)
accuracy = accuracy_score(y_test , dpredictions)
print('테스트 세트에서의 DecisionTreeClassifier 정확도 : {0:.4f}'.format(accuracy))

Out

GridSearchCV 최적 하이퍼 파라미터 : {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
GridSearchCV 최고 정확도: 0.7992
테스트 세트에서의 DecisionTreeClassifier 정확도 : 0.8715