Data Wrangling

2022-06-22

3주차 : 데이터 랭그링(Data Wrangling)

  1. 데이터 설명
    • 데이터의 모든 열 이름 표시(힌트: columns 함수)
    • 각 열의 데이터 유형 표시(힌트: dtypes 함수)
  2. 누락 데이터 (Missing Value)
    • 각 열에 대한 NaN 값을 포함하는 전체 행을 계산하고 표시하려면(힌트: isna() 함수)
    • 각 열에 대해 NULL 값을 포함하는 전체 행을 계산하여 표시하려면(힌트: isnull() 함수)
    • ”?”가 포함된 총 행을 계산하고 표시하려면 각 열에 대한 값
    • 예상 출력 예 2(a,b)(총 NULL 값을 계산하여 표시)
  3. 결측치 처리 (Data Imputation)
    • NaN 또는 NULL 또는 “?”를 대체하려면 값:
      • 열 연령의 중앙값
      • 열 요금의 평균 값
      • Cabin 열의 경우 “0”(문자열)
  4. 데이터 필터링 (Data Filtering)
    • 결측값이 많은 “body” 및 “homedest” 열을 제거합니다.
  5. 데이터 시각화 (Data Visualization using seaborn)
    • BAR PLOT을 사용하여 “생존” 열에서 “생존” 및 “생존하지 않은” 승객 수를 플로팅합니다.
    • BAR PLOT을 사용하여 “성별” 열의 남성 및 여성 승객 수를 플로팅합니다.
    • BAR PLOT을 사용하여 “pclass”의 승객 수를 플로팅합니다.
    • “나이” 열의 HISTOGRAM 차트를 플로팅합니다.
! pip install pandas
! pip install numpy
! pip install matplotlib

out

Collecting pandas
  Using cached pandas-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.9/site-packages (from pandas) (2021.3)
Collecting numpy>=1.18.5
  Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Installing collected packages: numpy, pandas
Successfully installed numpy-1.22.4 pandas-1.4.2
Requirement already satisfied: numpy in /opt/conda/lib/python3.9/site-packages (1.22.4)
Collecting matplotlib
  Using cached matplotlib-3.5.2-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.9/site-packages (from matplotlib) (1.22.4)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.9/site-packages (from matplotlib) (2.8.2)
Collecting fonttools>=4.22.0
  Using cached fonttools-4.33.3-py3-none-any.whl (930 kB)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/conda/lib/python3.9/site-packages (from matplotlib) (2.4.7)
Collecting kiwisolver>=1.0.1
  Using cached kiwisolver-1.4.3-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
Collecting cycler>=0.10
  Using cached cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting pillow>=6.2.0
  Using cached Pillow-9.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.9/site-packages (from matplotlib) (21.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Installing collected packages: pillow, kiwisolver, fonttools, cycler, matplotlib
Successfully installed cycler-0.11.0 fonttools-4.33.3 kiwisolver-1.4.3 matplotlib-3.5.2 pillow-9.1.1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('titanic-openml_01.csv')

3.1 데이터 설명

3.1.1 데이터의 모든 열 이름 표시(힌트: 열 함수)

df.columns

out

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'homedest'],
      dtype='object')

3.1.2 각 열의 데이터 유형 표시(힌트: dtypes 함수)

df.dtypes

out

pclass        int64
survived      int64
name         object
sex          object
age          object
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
boat         object
body         object
homedest     object
dtype: object

3.2 누락 데이터 (Missing Value)

3.2.1 각 열에 대한 NaN 값을 포함하는 전체 행을 계산하고 표시하려면(힌트: isna() 함수)

df.isna( ).sum( )

out

pclass      0
survived    0
name        0
sex         0
age         0
sibsp       0
parch       0
ticket      0
fare        0
cabin       0
embarked    0
boat        0
body        0
homedest    0
dtype: int64

3.2.2각 열에 대해 NULL 값을 포함하는 전체 행을 계산하여 표시하려면(힌트: isnull() 함수)

df.isnull().sum()

out

pclass      0
survived    0
name        0
sex         0
age         0
sibsp       0
parch       0
ticket      0
fare        0
cabin       0
embarked    0
boat        0
body        0
homedest    0
dtype: int64

3.2.3 “?”가 포함된 총 행을 계산하고 표시하려면 각 열에 대한 값

dfRe = df.replace('?', np.NaN)
dfRe.isna( ).sum( )

out

pclass        0
survived      0
name          0
sex           0
age         139
sibsp         0
parch         0
ticket        0
fare          0
cabin       717
embarked      2
boat        583
body        905
homedest    258
dtype: int64

3.3 결측치 처리 (Data Imputation)

3.3.1 NaN 또는 NULL 또는 “?”를 대체하려면 값:

3.3.1.1 열 연령의 중앙값

dfRe['age'] = dfRe['age'].astype(float)
dfRe = dfRe.fillna(dfRe.median(numeric_only=True)['age':'age'])
dfRe

out

pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 NaN C NaN NaN NaN
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 NaN S NaN NaN NaN
997 3 1 Masselmani, Mrs. Fatima female 29.0000 0 0 2649 7.2250 NaN C C NaN NaN
998 3 0 Matinoff, Mr. Nicola male 29.0000 0 0 349255 7.8958 NaN C NaN NaN NaN
999 3 1 McCarthy, Miss. Catherine 'Katie' female 29.0000 0 0 383123 7.7500 NaN Q 15 16 NaN NaN

1000 rows × 14 columns

3.3.1.2 열 요금의 평균 값

dfRe = dfRe.fillna(dfRe.mean(numeric_only=True)['fare':'fare'])
dfRe

out

pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 NaN C NaN NaN NaN
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 NaN S NaN NaN NaN
997 3 1 Masselmani, Mrs. Fatima female 29.0000 0 0 2649 7.2250 NaN C C NaN NaN
998 3 0 Matinoff, Mr. Nicola male 29.0000 0 0 349255 7.8958 NaN C NaN NaN NaN
999 3 1 McCarthy, Miss. Catherine 'Katie' female 29.0000 0 0 383123 7.7500 NaN Q 15 16 NaN NaN

1000 rows × 14 columns

3.3.1.3 Cabin 열의 경우 “0”(문자열)

dfRe.loc[dfRe['cabin'] != dfRe['cabin'], 'cabin'] = 0
dfRe

out

pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 0 C NaN NaN NaN
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 0 S NaN NaN NaN
997 3 1 Masselmani, Mrs. Fatima female 29.0000 0 0 2649 7.2250 0 C C NaN NaN
998 3 0 Matinoff, Mr. Nicola male 29.0000 0 0 349255 7.8958 0 C NaN NaN NaN
999 3 1 McCarthy, Miss. Catherine 'Katie' female 29.0000 0 0 383123 7.7500 0 Q 15 16 NaN NaN

1000 rows × 14 columns

3.4 데이터 필터링 (Data Filtering)

3.4.1 결측값이 많은 “body” 및 “homedest” 열을 제거합니다.

dfRe = dfRe.drop(['body', 'homedest'], axis=1)

dfRe

out

pclass survived name sex age sibsp parch ticket fare cabin embarked boat
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 0 C NaN
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 0 S NaN
997 3 1 Masselmani, Mrs. Fatima female 29.0000 0 0 2649 7.2250 0 C C
998 3 0 Matinoff, Mr. Nicola male 29.0000 0 0 349255 7.8958 0 C NaN
999 3 1 McCarthy, Miss. Catherine 'Katie' female 29.0000 0 0 383123 7.7500 0 Q 15 16

1000 rows × 12 columns

3.5 데이터 시각화 (Data Visualization using seaborn)

3.5.1 BAR PLOT을 사용하여 “survived” 열에서 “survived” 및 “non survived” 승객 수를 플로팅합니다.

x = np.arange(2)

plt.bar(x, dfRe['survived'].value_counts())
plt.xticks(x, ['non survived', 'survived'])

plt.show()

out

png

3.5.2 BAR PLOT을 사용하여 “성별” 열의 남성 및 여성 승객 수를 플로팅합니다.

x = np.arange(2)

plt.bar(x, dfRe['sex'].value_counts())
plt.xticks(x, ['male', 'female'])

plt.show()

out

png

3.5.3 BAR PLOT을 사용하여 “pclass”의 승객 수를 플로팅합니다.

dfRe['pclass'].value_counts()

x = np.arange(3)

plt.bar(x, dfRe['pclass'].value_counts())
plt.xticks(x, ['3', '1', '2'])

plt.show()

out

png

3.5.4 “나이” 열의 HISTOGRAM 차트를 플로팅합니다.

dfRe['age']

plt.hist(dfRe['age'])

plt.show()

out

png

results matching ""

    No results matching ""

    99 other / uml

    04 react / JSX