Feature Engineering

2022-06-22

피처 엔지니어링 (Feature Engineering)

  1. Read the material on related to ML Workflows (i.e., Feature Engineering section) on the Teams channel folder (mandatory), and answer these following questions: (Teams 채널 폴더(필수)에서 ML 워크플로 관련 자료(예: 기능 엔지니어링 섹션)를 읽고 다음 질문에 답하세요.)

    1. What is Feature Engineering? Feature engineering은 무엇인가요? (피쳐 엔지니어링이란 무엇입니까? 피처 엔지니어링)

    2. Why do we need Feature Engineering? 왜 하는 가요? (기능 엔지니어링이 필요한 이유는 무엇입니까?)

    3. What is the goal of feature engineering? 목표는 무엇인가요? (피쳐 엔지니어링의 목표는 무엇입니까)

  2. 특징 추출 (Feature Extraction)
    1. Create a new column called “status” to categorize the status of passenger with the following conditions: (다음 조건으로 승객의 상태를 분류하기 위해 “상태”라는 새 열을 만듭니다.)
      1. If age > 65 , then passenger is “elderly” (연령 > 65인 경우 승객은 “고령자”입니다.)
      2. If age >= 18 & age <= 65, then passenger is “adult” (연령 >= 18 & 연령 <= 65인 경우 승객은 “성인”입니다.)
      3. The rest is “child” passenger (나머지는 “어린이” 승객입니다.)
    2. Create a new column called “title” to extract the title of passenger from “name” column (“이름” 열에서 승객의 직위를 추출하기 위해 “제목”이라는 새 열을 만듭니다.)
      1. You might extract the title such as Mr, Miss, Lady, Master, etc (Mr, Miss, Lady, Master 등과 같은 제목을 추출할 수 있습니다.)
  3. 데이터 인코딩 (Data Encoding)
    1. Encode value from “embarked” column into numerical value and save it as new column “embarked_enc” (“embarked” 열의 값을 숫자 값으로 인코딩하고 새 열 “embarked_enc”로 저장)
    2. For example: S -> 0, C -> 1 (예: S -> 0, C -> 1)
  4. 데이터 구간화 (Data Binning)
    1. Perform a data binning operation on the “fare” column (hint: use pandas cut/qcut function) (“fare” 열에 대한 데이터 비닝 작업 수행(힌트: pandas cut/qcut 기능 사용))
    2. Cut the fare values into 3 categories: “economy”, “business”, and “president” (hint: you can define the fare range (범위) by yourself) (요금 값을 “이코노미”, “비즈니스”, “대통령”의 3가지 범주로 나눕니다. (힌트: 요금 범위(도달)는 스스로 정의할 수 있습니다))
    3. Make a new column “fare_type” to represent the fare categories (요금 범주를 나타내는 새 열 “fare_type”을 만듭니다.)
  5. 상관 계수 (Correlation Analysis)
    1. Perform a correlation analysis and show the result ! (hint : use the Spearman Rank Correlation) (상관 분석을 수행하고 결과를 보여줍니다! (힌트 : Spearman Rank Correlation 사용))
    2. (optional) Visualize the correlation analysis using heatmap (hint: you can use seaborn heatmap) ((선택사항) 히트맵을 사용하여 상관관계 분석 시각화(힌트: seaborn 히트맵을 사용할 수 있음))
  6. CSV 파일로 내보내기 (CSV file data export)
    1. Please export the final data (i.e., preprocessed data) into a single .csv file ! (format : titanic_preprocessed_YOURNAME) (최종 데이터(즉, 전처리된 데이터)를 하나의 .csv 파일로 내보내십시오! (형식: titanic_preprocessed_YOURNAME))
    2. Save all the data into “dataset” folder at your directory (if you don’t have “dataset” folder, please make it in advance) (모든 데이터를 디렉토리의 “dataset” 폴더에 저장합니다. (“dataset” 폴더가 없다면 미리 만들어두세요))
    3. Upload all your results to server (모든 결과를 서버에 업로드)

Teams 채널 폴더(**필수)에서 ML 워크플로 관련 자료(예: 기능 엔지니어링 섹션)를 읽고 다음 질문에 답하세요.

  • What is Feature Engineering? Feature engineering은 무엇인가요?
  • 피쳐 엔지니어링이란 무엇입니까? 피처 엔지니어링
답: 기계 학습 알고리즘에 사용할 수 있는 기능으로 변환하는 전처리 단계

Transforms raw data into a feature vector
Determines useful features for the training process.
  • Why do we need Feature Engineering? 왜 하는 가요?
  • 기능 엔지니어링이 필요한 이유는 무엇입니까?
답:
1. 머신 러닝 알고리즘과 호환되고 가장 적합한 입력 데이터 세트를 준비합니다.
2. 기계 학습 모델의 성능 개선

To prepare input data that is compatible with model
To improve the performance model (by giving more information to model)
  • What is the goal of feature engineering? 목표는 무엇인가요?
  • 피쳐 엔지니어링의 목표는 무엇입니까
답:
1. 비즈니스 문제에 맞춰 분석
2. 불필요한 데이터 제거
3. 모델의 확장성 촉진

To reduce the number of inputs of variables to reduce the computational cost 
To cut down the noise in data
In some cases, to improve the performance of the model
import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv('C:/Users/ygjung/Documents/titanic-openml_01.csv')
df
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest
0 1 1 Allen, Miss. Elisabeth Walton female 29 0 0 24160 211.3375 B5 S 2 ? St Louis, MO
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 ? Montreal, PQ / Chesterville, ON
2 1 0 Allison, Miss. Helen Loraine female 2 1 2 113781 151.5500 C22 C26 S ? ? Montreal, PQ / Chesterville, ON
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30 1 2 113781 151.5500 C22 C26 S ? 135 Montreal, PQ / Chesterville, ON
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2 113781 151.5500 C22 C26 S ? ? Montreal, PQ / Chesterville, ON
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35 0 0 349213 7.8958 ? C ? ? ?
996 3 0 Markun, Mr. Johann male 33 0 0 349257 7.8958 ? S ? ? ?
997 3 1 Masselmani, Mrs. Fatima female ? 0 0 2649 7.2250 ? C C ? ?
998 3 0 Matinoff, Mr. Nicola male ? 0 0 349255 7.8958 ? C ? ? ?
999 3 1 McCarthy, Miss. Catherine 'Katie' female ? 0 0 383123 7.7500 ? Q 15 16 ? ?

1000 rows × 14 columns

특징 추출 (Feature Extraction)

  • Create a new column called “status” to categorize the status of passenger with the following conditions:
  • 다음 조건으로 승객의 상태를 분류하기 위해 “상태”라는 새 열을 만듭니다.
    • If age > 65 , then passenger is “elderly”
    • 연령 > 65인 경우 승객은 “고령자”입니다.
    • If age >= 18 & age <= 65, then passenger is “adult”
    • 연령 >= 18 & 연령 <= 65인 경우 승객은 “성인”입니다.
    • The rest is “child” passenger
    • 나머지는 “어린이” 승객입니다.
dfRe = df.replace('?', np.NaN)

dfRe['age'] = dfRe['age'].astype(float)

dfRe['status'] = "child"
dfRe.loc[dfRe['age'] > 65, 'status'] = "elderly"
dfRe.loc[(dfRe['age'] >= 18) & (dfRe['age'] <= 65), 'status'] = "adult"

dfRe
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest status
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO adult
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON child
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON child
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON adult
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON adult
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 NaN C NaN NaN NaN adult
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 NaN S NaN NaN NaN adult
997 3 1 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C C NaN NaN child
998 3 0 Matinoff, Mr. Nicola male NaN 0 0 349255 7.8958 NaN C NaN NaN NaN child
999 3 1 McCarthy, Miss. Catherine 'Katie' female NaN 0 0 383123 7.7500 NaN Q 15 16 NaN NaN child

1000 rows × 15 columns

  • Create a new column called “title” to extract the title of passenger from “name” column
  • “이름” 열에서 승객의 직위를 추출하기 위해 “제목”이라는 새 열을 만듭니다.
    • You might extract the title such as Mr, Miss, Lady, Master, etc
    • Mr, Miss, Lady, Master 등과 같은 제목을 추출할 수 있습니다.
dfRe['title'] = "etc"

dfRe['name'] = dfRe['name'].astype(str)

dfRe.loc[dfRe['name'].str.contains('Mr', case=True, regex=False), 'title'] = "Mr"
dfRe.loc[dfRe['name'].str.contains('Miss', case=True, regex=False), 'title'] = "Miss"
dfRe.loc[dfRe['name'].str.contains('Lady', case=True, regex=False), 'title'] = "Lady"
dfRe.loc[dfRe['name'].str.contains('Master', case=True, regex=False), 'title'] = "Master"
dfRe
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest status title
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO adult Miss
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON child Master
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON child Miss
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON adult Mr
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON adult Mr
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 NaN C NaN NaN NaN adult Mr
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 NaN S NaN NaN NaN adult Mr
997 3 1 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C C NaN NaN child Mr
998 3 0 Matinoff, Mr. Nicola male NaN 0 0 349255 7.8958 NaN C NaN NaN NaN child Mr
999 3 1 McCarthy, Miss. Catherine 'Katie' female NaN 0 0 383123 7.7500 NaN Q 15 16 NaN NaN child Miss

1000 rows × 16 columns

데이터 인코딩 (Data Encoding)

  • Encode value from “embarked” column into numerical value and save it as new column “embarked_enc”
  • “embarked” 열의 값을 숫자 값으로 인코딩하고 새 열 “embarked_enc”로 저장
  • For example: S -> 0, C -> 1
  • 예: S -> 0, C -> 1
dfRe['embarked_enc'] = 2

dfRe['embarked'] = dfRe['embarked'].astype(str)

dfRe.loc[dfRe['embarked'] != dfRe['embarked'], 'embarked'] = ''

dfRe.loc[dfRe['embarked'].str.contains('S', case=True, regex=False), 'embarked_enc'] = 0
dfRe.loc[dfRe['embarked'].str.contains('C', case=True, regex=False), 'embarked_enc'] = 1
dfRe
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest status title embarked_enc
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO adult Miss 0
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON child Master 0
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON child Miss 0
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON adult Mr 0
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON adult Mr 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 NaN C NaN NaN NaN adult Mr 1
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 NaN S NaN NaN NaN adult Mr 0
997 3 1 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C C NaN NaN child Mr 1
998 3 0 Matinoff, Mr. Nicola male NaN 0 0 349255 7.8958 NaN C NaN NaN NaN child Mr 1
999 3 1 McCarthy, Miss. Catherine 'Katie' female NaN 0 0 383123 7.7500 NaN Q 15 16 NaN NaN child Miss 2

1000 rows × 17 columns

데이터 구간화 (Data Binning)

  • Perform a data binning operation on the “fare” column (hint: use pandas cut/qcut function)
  • “fare” 열에 대한 데이터 비닝 작업 수행(힌트: pandas cut/qcut 기능 사용)
  • Cut the fare values into 3 categories: “economy”, “business”, and “president” (hint: you can define the fare range (범위) by yourself)
  • 요금 값을 “이코노미”, “비즈니스”, “대통령”의 3가지 범주로 나눕니다. (힌트: 요금 범위(도달)는 스스로 정의할 수 있습니다)
  • Make a new column “fare_type” to represent the fare categories
  • 요금 범주를 나타내는 새 열 “fare_type”을 만듭니다.
dfRe['fare_type'] = pd.cut(dfRe['fare'], 3, labels=['economy', 'business', 'president'])
dfRe
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body homedest status title embarked_enc fare_type
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NaN St Louis, MO adult Miss 0 business
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON child Master 0 economy
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON child Miss 0 economy
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NaN 135 Montreal, PQ / Chesterville, ON adult Mr 0 economy
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON adult Mr 0 economy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 3 0 Markoff, Mr. Marin male 35.0000 0 0 349213 7.8958 NaN C NaN NaN NaN adult Mr 1 economy
996 3 0 Markun, Mr. Johann male 33.0000 0 0 349257 7.8958 NaN S NaN NaN NaN adult Mr 0 economy
997 3 1 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C C NaN NaN child Mr 1 economy
998 3 0 Matinoff, Mr. Nicola male NaN 0 0 349255 7.8958 NaN C NaN NaN NaN child Mr 1 economy
999 3 1 McCarthy, Miss. Catherine 'Katie' female NaN 0 0 383123 7.7500 NaN Q 15 16 NaN NaN child Miss 2 economy

1000 rows × 18 columns

상관 계수 (Correlation Analysis)

  • Perform a correlation analysis and show the result ! (hint : use the Spearman Rank Correlation)
  • 상관 분석을 수행하고 결과를 보여줍니다! (힌트 : Spearman Rank Correlation 사용)
# dfRe['fare'] = dfRe['fare'].astype(int)

# x = dfRe.embarked_enc.values
# y = dfRe.fare.values

#np.cov(x, y)[0, 1]

dfRe.corr()
pclass survived age sibsp parch fare embarked_enc
pclass 1.000000 -0.307329 -0.404092 0.034437 0.018783 -0.538500 -0.004776
survived -0.307329 1.000000 -0.079531 0.035458 0.132062 0.254126 0.060786
age -0.404092 -0.079531 1.000000 -0.185850 -0.153986 0.169605 0.088103
sibsp 0.034437 0.035458 -0.185850 1.000000 0.396863 0.155269 -0.121563
parch 0.018783 0.132062 -0.153986 0.396863 1.000000 0.224579 -0.102599
fare -0.538500 0.254126 0.169605 0.155269 0.224579 1.000000 0.105584
embarked_enc -0.004776 0.060786 0.088103 -0.121563 -0.102599 0.105584 1.000000
  • (optional) Visualize the correlation analysis using heatmap (hint: you can use seaborn heatmap)
  • (선택사항) 히트맵을 사용하여 상관관계 분석 시각화(힌트: seaborn 히트맵을 사용할 수 있음)
sns.heatmap(df.corr())
<AxesSubplot:>

png

CSV 파일로 내보내기 (CSV file data export)

  • Please export the final data (i.e., preprocessed data) into a single .csv file ! (format : titanic_preprocessed_YOURNAME)
  • Save all the data into “dataset” folder at your directory (if you don’t have “dataset” folder, please make it in advance)
  • Upload all your results to server
  • 최종 데이터(즉, 전처리된 데이터)를 하나의 .csv 파일로 내보내십시오! (형식: titanic_preprocessed_YOURNAME)
  • 모든 데이터를 디렉토리의 “dataset” 폴더에 저장합니다. (“dataset” 폴더가 없다면 미리 만들어두세요)
  • 모든 결과를 서버에 업로드
dfRe.to_csv('./dataset/titanic_preprocessed_young.csv')

results matching ""

    No results matching ""

    99 other / uml

    04 react / JSX