1. hr 데이터셋 살펴보기

In [ ]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [ ]:

hr_df = pd.read_csv('/content/drive/MyDrive/KDT/4. 머신러닝과 딥러닝/hr.csv')

In [ ]:

hr_df.head()

Out[ ]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	avg_training_score
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	49
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	60
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	50
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	50
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	73

In [ ]:

hr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           54808 non-null  int64  
 1   department            54808 non-null  object 
 2   region                54808 non-null  object 
 3   education             52399 non-null  object 
 4   gender                54808 non-null  object 
 5   recruitment_channel   54808 non-null  object 
 6   no_of_trainings       54808 non-null  int64  
 7   age                   54808 non-null  int64  
 8   previous_year_rating  50684 non-null  float64
 9   length_of_service     54808 non-null  int64  
 10  awards_won?           54808 non-null  int64  
 11  avg_training_score    54808 non-null  int64  
 12  is_promoted           54808 non-null  int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 5.4+ MB

employee_id: 임의의 직원 아이디
department: 부서
region: 지역
education: 학력
gender: 성별
recruitment_channel: 채용 방법
no_of_trainings: 트레이닝 받은 횟수
age: 나이
previous_year_rating: 이전 년도 고과 점수
length_of_service: 근속 년수
awards_won: 수상 경력
avg_training_score: 평균 고과 점수
is_promoted: 승진 여부

In [ ]:

hr_df.describe()

Out[ ]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	awards_won?	avg_training_score	is_promoted
count	54808.000000	54808.000000	54808.000000	50684.000000	54808.000000	54808.000000	54808.000000	54808.000000
mean	39195.830627	1.253011	34.803915	3.329256	5.865512	0.023172	63.386750	0.085170
std	22586.581449	0.609264	7.660169	1.259993	4.265094	0.150450	13.371559	0.279137
min	1.000000	1.000000	20.000000	1.000000	1.000000	0.000000	39.000000	0.000000
25%	19669.750000	1.000000	29.000000	3.000000	3.000000	0.000000	51.000000	0.000000
50%	39225.500000	1.000000	33.000000	3.000000	5.000000	0.000000	60.000000	0.000000
75%	58730.500000	1.000000	39.000000	4.000000	7.000000	0.000000	76.000000	0.000000
max	78298.000000	10.000000	60.000000	5.000000	37.000000	1.000000	99.000000	1.000000

In [ ]:

sns.barplot(x='previous_year_rating', y='is_promoted', data=hr_df)

Out[ ]:

<Axes: xlabel='previous_year_rating', ylabel='is_promoted'>

In [ ]:

sns.lineplot(x='previous_year_rating', y='is_promoted', data=hr_df)

Out[ ]:

<Axes: xlabel='previous_year_rating', ylabel='is_promoted'>

In [ ]:

sns.lineplot(x='avg_training_score', y='is_promoted', data=hr_df)

Out[ ]:

<Axes: xlabel='avg_training_score', ylabel='is_promoted'>

In [ ]:

sns.barplot(x='recruitment_channel', y='is_promoted', data=hr_df)

Out[ ]:

<Axes: xlabel='recruitment_channel', ylabel='is_promoted'>

In [ ]:

hr_df['recruitment_channel'].value_counts()

Out[ ]:

other       30446
sourcing    23220
referred     1142
Name: recruitment_channel, dtype: int64

In [ ]:

sns.barplot(x='gender', y='is_promoted', data=hr_df)

Out[ ]:

<Axes: xlabel='gender', ylabel='is_promoted'>

In [ ]:

hr_df['gender'].value_counts()

Out[ ]:

m    38496
f    16312
Name: gender, dtype: int64

In [ ]:

sns.barplot(x='department', y='is_promoted', data=hr_df)
plt.xticks(rotation=45)

Out[ ]:

(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
 [Text(0, 0, 'Sales & Marketing'),
  Text(1, 0, 'Operations'),
  Text(2, 0, 'Technology'),
  Text(3, 0, 'Analytics'),
  Text(4, 0, 'R&D'),
  Text(5, 0, 'Procurement'),
  Text(6, 0, 'Finance'),
  Text(7, 0, 'HR'),
  Text(8, 0, 'Legal')])

In [ ]:

hr_df['department'].value_counts()

Out[ ]:

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

In [ ]:

plt.figure(figsize=(14, 10))
sns.barplot(x='region', y='is_promoted', data=hr_df)
plt.xticks(rotation=45)

Out[ ]:

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]),
 [Text(0, 0, 'region_7'),
  Text(1, 0, 'region_22'),
  Text(2, 0, 'region_19'),
  Text(3, 0, 'region_23'),
  Text(4, 0, 'region_26'),
  Text(5, 0, 'region_2'),
  Text(6, 0, 'region_20'),
  Text(7, 0, 'region_34'),
  Text(8, 0, 'region_1'),
  Text(9, 0, 'region_4'),
  Text(10, 0, 'region_29'),
  Text(11, 0, 'region_31'),
  Text(12, 0, 'region_15'),
  Text(13, 0, 'region_14'),
  Text(14, 0, 'region_11'),
  Text(15, 0, 'region_5'),
  Text(16, 0, 'region_28'),
  Text(17, 0, 'region_17'),
  Text(18, 0, 'region_13'),
  Text(19, 0, 'region_16'),
  Text(20, 0, 'region_25'),
  Text(21, 0, 'region_10'),
  Text(22, 0, 'region_27'),
  Text(23, 0, 'region_30'),
  Text(24, 0, 'region_12'),
  Text(25, 0, 'region_21'),
  Text(26, 0, 'region_32'),
  Text(27, 0, 'region_6'),
  Text(28, 0, 'region_33'),
  Text(29, 0, 'region_8'),
  Text(30, 0, 'region_24'),
  Text(31, 0, 'region_3'),
  Text(32, 0, 'region_9'),
  Text(33, 0, 'region_18')])

In [ ]:

hr_df.isna().mean()

Out[ ]:

employee_id             0.000000
department              0.000000
region                  0.000000
education               0.043953
gender                  0.000000
recruitment_channel     0.000000
no_of_trainings         0.000000
age                     0.000000
previous_year_rating    0.075244
length_of_service       0.000000
awards_won?             0.000000
avg_training_score      0.000000
is_promoted             0.000000
dtype: float64

In [ ]:

hr_df['education'].value_counts()

Out[ ]:

Bachelor's          36669
Master's & above    14925
Below Secondary       805
Name: education, dtype: int64

In [ ]:

hr_df['previous_year_rating'].value_counts()

Out[ ]:

3.0    18618
5.0    11741
4.0     9877
1.0     6223
2.0     4225
Name: previous_year_rating, dtype: int64

In [ ]:

hr_df = hr_df.dropna()

In [ ]:

hr_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48660 entries, 0 to 54807
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   employee_id           48660 non-null  int64  
 1   department            48660 non-null  object 
 2   region                48660 non-null  object 
 3   education             48660 non-null  object 
 4   gender                48660 non-null  object 
 5   recruitment_channel   48660 non-null  object 
 6   no_of_trainings       48660 non-null  int64  
 7   age                   48660 non-null  int64  
 8   previous_year_rating  48660 non-null  float64
 9   length_of_service     48660 non-null  int64  
 10  awards_won?           48660 non-null  int64  
 11  avg_training_score    48660 non-null  int64  
 12  is_promoted           48660 non-null  int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 5.2+ MB

In [ ]:

for i in ['department', 'region', 'education', 'gender', 'recruitment_channel']:
    print(i, hr_df[i].nunique())

department 9
region 34
education 3
gender 2
recruitment_channel 3

In [ ]:

hr_df = pd.get_dummies(hr_df, columns=['department', 'region', 'education', 'gender', 'recruitment_channel'])
hr_df.head(3)

Out[ ]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	avg_training_score	...	education_Bachelor's	education_Master's & above	gender_f	gender_m	recruitment_channel_other	recruitment_channel_sourcing
0	65438	1	35	5.0	8	49	...	0	1	1	0	0	1
1	65141	1	30	5.0	4	60	...	1	0	0	1	1	0
2	7513	1	34	3.0	7	50	...	1	0	0	1	0	1

3 rows × 59 columns

In [ ]:

pd.set_option('display.max_columns', 60)

In [ ]:

hr_df.head(3)

Out[ ]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	avg_training_score	department_Operations	department_Sales & Marketing	region_region_19	region_region_22	region_region_7	education_Bachelor's	education_Master's & above	gender_f	gender_m	recruitment_channel_other	recruitment_channel_sourcing
0	65438	1	35	5.0	8	49	0	1	0	0	1	0	1	1	0	0	1
1	65141	1	30	5.0	4	60	1	0	0	1	0	1	0	0	1	1	0
2	7513	1	34	3.0	7	50	0	1	1	0	0	1	0	0	1	0	1

In [ ]:

from sklearn.model_selection import train_test_split

In [ ]:

X_train, X_test, y_train, y_test = train_test_split(hr_df.drop('is_promoted', axis=1), hr_df['is_promoted'], test_size=0.2, random_state=10)

2. 로지스틱 회귀(Logistic Regression)

둘 중의 하나를 결정하는 문제(이진 분류)를 풀기 위한 대표적인 알고리즘
도큐먼트
3개 이상의 클래스에 대한 판별을 하는 경우 OvR(One-vs-Rest), OvO(One-vs-One) 전략으로 판별

대부분 OvR 전략을 선호, 데이터가 한쪽으로 많이 치우진 경우 OvO을 사용

In [ ]:

from sklearn.linear_model import LogisticRegression

In [ ]:

lr = LogisticRegression()

In [ ]:

lr.fit(X_train, y_train)

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

In [ ]:

pred = lr.predict(X_test)

In [ ]:

from sklearn.metrics import accuracy_score, confusion_matrix

In [ ]:

accuracy_score(y_test, pred)

Out[ ]:

0.9114262227702425

In [ ]:

hr_df['is_promoted'].value_counts()

Out[ ]:

0    44428
1     4232
Name: is_promoted, dtype: int64

3. 혼돈 행렬(confusion matrix)

정밀도와 재현율(민감도)을 활용하는 평가용 지수

TN(8869)        FP(0)
FN(862)         TP(1)

TN: 승진하지 못했는데, 승진하지 못했다고 예측
FN: 승진하지 못했는데, 승진했다고 예측
FP: 승진했는데, 승진하지 못했다고 예측
TP: 승진했는데, 승진했다고 예측

In [ ]:

confusion_matrix(y_test, pred)

Out[ ]:

array([[8869,    0],
       [ 862,    1]])

In [ ]:

sns.heatmap(confusion_matrix(y_test, pred), annot=True, cmap='Blues')

Out[ ]:

<Axes: >

3-1. 정밀도(precision)

TP / (TP + FP)
무조건 양성으로 판단해서 계산하는 방법
실제 1인 것중에 얼마 만큼을 제대로 맞췄는가?

3-2. 재현울(recall)

TP / (TP + FN)
정확하게 감지한 양성 샘플의 비율
1이라고 예측한 것 중, 얼마 만큼을 제대로 맞췄는가?
민감도 또는 TPR (True Positive Rate)라고도 부름

3-3. f1 score

정밀도와 재현율의 조화평균을 나타내는 지표

$$2*\frac{정밀도 * 재현율}{정밀도 + 재현율}=\frac{TP}{TP+\frac{FN+FP}{2}}$$

정밀도      재현율      산술평균    조화평균
0.4         0.6         0.5         0.48
0.3         0.7         0.5         0.42
0.5         0.5         0.5         0.5

In [ ]:

from sklearn.metrics import precision_score, recall_score, f1_score

In [ ]:

precision_score(y_test, pred)

Out[ ]:

1.0

In [ ]:

recall_score(y_test, pred)

Out[ ]:

0.0011587485515643105

In [ ]:

f1_score(y_test, pred)

Out[ ]:

0.0023148148148148147

In [ ]:

lr.coef_    # 58개 컬럼에 대한 기울기

Out[ ]:

array([[-5.42682567e-06, -2.11566320e-01, -1.24739314e-01,
         4.04217840e-01,  8.39462548e-02,  1.19382822e-01,
         1.24469097e-02, -4.53116409e-02, -1.59556720e-02,
        -1.69079211e-02, -1.06814883e-02,  3.10169499e-02,
         3.47912379e-03, -1.69516987e-02, -1.37996914e-02,
        -1.51941604e-02, -2.88706914e-03, -2.41993427e-03,
        -1.84270273e-02, -6.87835341e-03, -2.43612077e-03,
        -5.00401302e-03, -7.32654108e-03, -7.67662052e-03,
         7.71412885e-03, -3.21353065e-04, -6.68708228e-03,
         2.55409975e-02, -1.02529819e-02, -7.25376472e-03,
         9.23097034e-03,  9.63606751e-03, -9.39043583e-03,
         5.16728257e-03, -2.15726175e-02, -1.16101632e-02,
         9.39154969e-03, -1.82813463e-02,  1.50261133e-03,
        -1.97860883e-03, -2.28665036e-02, -1.35441186e-02,
        -5.26900930e-03, -6.03421587e-03,  3.29129457e-02,
        -1.48312957e-02, -1.18516648e-02,  2.54568502e-02,
        -2.39518611e-03, -9.66357577e-03, -2.00440936e-01,
        -1.40474208e-02,  1.14182158e-01, -1.53689321e-02,
        -8.49372671e-02, -6.62000218e-02,  4.04855793e-03,
        -3.81547353e-02]])

In [ ]:

X_train

Out[ ]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	awards_won?	avg_training_score	department_Analytics	department_Finance	department_HR	department_Legal	department_Operations	department_Procurement	department_R&D	department_Sales & Marketing	department_Technology	region_region_1	region_region_10	region_region_11	region_region_12	region_region_13	region_region_14	region_region_15	region_region_16	region_region_17	region_region_18	region_region_19	region_region_2	region_region_20	region_region_21	region_region_22	region_region_23	region_region_24	region_region_25	region_region_26	region_region_27	region_region_28	region_region_29	region_region_3	region_region_30	region_region_31	region_region_32	region_region_33	region_region_34	region_region_4	region_region_5	region_region_6	region_region_7	region_region_8	region_region_9	education_Bachelor's	education_Below Secondary	education_Master's & above	gender_f	gender_m	recruitment_channel_other	recruitment_channel_referred	recruitment_channel_sourcing
26382	45970	1	38	4.0	6	0	61	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
13184	68958	1	57	5.0	15	0	71	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	1	0	0
53060	63576	1	36	3.0	9	0	72	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	1
23528	13968	1	34	2.0	7	0	50	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0
29663	61739	1	34	3.0	2	0	84	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45161	17581	1	46	2.0	14	0	51	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0	1
31616	3937	1	40	3.0	15	0	80	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0
32957	29303	1	29	4.0	3	0	56	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1
45163	52256	2	33	1.0	4	0	48	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
19932	13866	1	36	3.0	9	0	65	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	1	0	0

38928 rows × 58 columns

In [ ]:

# 독립변수 2개, 종속변수 1개
TempX = hr_df[['age', 'length_of_service']]
tempY = hr_df['is_promoted']

In [ ]:

temp_lr = LogisticRegression()

In [ ]:

temp_lr.fit(TempX, tempY)

In [ ]:

temp_df = pd.DataFrame({'age':[20, 27, 30], 'length_of_service':[1, 3, 6]})

In [ ]:

temp_df

Out[ ]:

	age	length_of_service
0	20	1
1	27	3
2	30	6

In [ ]:

pred = temp_lr.predict(temp_df)

In [ ]:

pred

Out[ ]:

array([0, 0, 0])

In [ ]:

temp_lr.coef_

Out[ ]:

array([[-0.01074458, -0.00053409]])

In [ ]:

temp_lr.intercept_

Out[ ]:

array([-1.96818509])

In [ ]:

proba = temp_lr.predict_proba(temp_df)
proba

Out[ ]:

array([[0.89876806, 0.10123194],
       [0.9055003 , 0.0944997 ],
       [0.90835617, 0.09164383]])

4. 교차 검증(Cross Validation)

train_test_split에서 발생하는 데이터의 섞임에 따라 성능이 좌우되는 문제를 해결하기 위한 기술
K겹(K-Fold) 교차 검증을 가장 많이 사용

In [ ]:

from sklearn.model_selection import KFold

In [ ]:

kf = KFold(n_splits=5)

In [ ]:

kf

Out[ ]:

KFold(n_splits=5, random_state=None, shuffle=False)

In [ ]:

hr_df

Out[ ]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	awards_won?	avg_training_score	is_promoted	department_Analytics	department_Finance	department_HR	department_Legal	department_Operations	department_Procurement	department_R&D	department_Sales & Marketing	department_Technology	region_region_1	region_region_10	region_region_11	region_region_12	region_region_13	region_region_14	region_region_15	region_region_16	region_region_17	region_region_18	region_region_19	region_region_2	region_region_20	region_region_21	region_region_22	region_region_23	region_region_24	region_region_25	region_region_26	region_region_27	region_region_28	region_region_29	region_region_3	region_region_30	region_region_31	region_region_32	region_region_33	region_region_34	region_region_4	region_region_5	region_region_6	region_region_7	region_region_8	region_region_9	education_Bachelor's	education_Below Secondary	education_Master's & above	gender_f	gender_m	recruitment_channel_other	recruitment_channel_referred	recruitment_channel_sourcing
0	65438	1	35	5.0	8	0	49	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	1	0	0	0	1
1	65141	1	30	5.0	4	0	60	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
2	7513	1	34	3.0	7	0	50	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1
3	2542	2	39	1.0	10	0	50	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
4	48945	1	45	3.0	2	0	73	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54802	6915	2	31	1.0	2	0	49	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
54803	3030	1	48	3.0	17	0	78	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1
54804	74592	1	37	2.0	6	0	56	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	1	0	0
54805	13918	1	27	5.0	3	0	79	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0
54807	51526	1	27	1.0	5	0	49	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	1	0	0

48660 rows × 59 columns

In [ ]:

for train_index, test_index in kf.split(range(len(hr_df))):
    print(train_index, test_index)
    print(len(train_index), len(test_index))

[ 9732  9733  9734 ... 48657 48658 48659] [   0    1    2 ... 9729 9730 9731]
38928 9732
[    0     1     2 ... 48657 48658 48659] [ 9732  9733  9734 ... 19461 19462 19463]
38928 9732
[    0     1     2 ... 48657 48658 48659] [19464 19465 19466 ... 29193 29194 29195]
38928 9732
[    0     1     2 ... 48657 48658 48659] [29196 29197 29198 ... 38925 38926 38927]
38928 9732
[    0     1     2 ... 38925 38926 38927] [38928 38929 38930 ... 48657 48658 48659]
38928 9732

In [ ]:

kf = KFold(n_splits=5, random_state=10, shuffle=True)

In [ ]:

kf

Out[ ]:

KFold(n_splits=5, random_state=10, shuffle=True)

In [ ]:

for train_index, test_index in kf.split(range(len(hr_df))):
    print(train_index, test_index)
    print(len(train_index), len(test_index))

[    2     3     4 ... 48656 48657 48659] [    0     1     5 ... 48652 48653 48658]
38928 9732
[    0     1     2 ... 48657 48658 48659] [   18    23    29 ... 48639 48641 48645]
38928 9732
[    0     1     2 ... 48657 48658 48659] [   12    15    17 ... 48647 48650 48654]
38928 9732
[    0     1     2 ... 48654 48656 48658] [    3    24    31 ... 48655 48657 48659]
38928 9732
[    0     1     3 ... 48657 48658 48659] [    2     4     6 ... 48640 48644 48656]
38928 9732

In [ ]:

acc_list = []

for train_index, test_index in kf.split(range(len(hr_df))):
    X = hr_df.drop('is_promoted', axis=1)
    y = hr_df['is_promoted']

    X_train = X.iloc[train_index]
    X_test = X.iloc[test_index]
    y_train = y.iloc[train_index]
    y_test = y.iloc[test_index]

    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    pred = lr.predict(X_test)
    acc_list.append(accuracy_score(y_test, pred))

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

In [ ]:

acc_list

Out[ ]:

[0.9114262227702425,
 0.9094739005343198,
 0.9173859432799013,
 0.914406083025072,
 0.9125565145910398]

In [ ]:

np.array(acc_list).mean()

Out[ ]:

0.913049732840115

크로스 벨리데이션을 사용하는 이유는 결과를 좋게 하기 위함이 아니라 믿을만한 검증을 하기 위함

(Python) 랜덤 포레스트 (0)	2023.06.15
(Python) 서포트 벡터 머신 (0)	2023.06.14
(Python) 의사 결정 나무 (0)	2023.06.14
(Python) 선형 회귀 (0)	2023.06.12
(Python) 타이타닉 데이터셋 (0)	2023.06.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

흰둥이는 코드를 짤 때 짖어 (왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!)

흰둥이는 코드를 짤 때 짖어 (왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!)

(Python) 로지스틱 회귀 본문

(Python) 로지스틱 회귀

1. hr 데이터셋 살펴보기

2. 로지스틱 회귀(Logistic Regression)

3. 혼돈 행렬(confusion matrix)

3-1. 정밀도(precision)

3-2. 재현울(recall)

3-3. f1 score

4. 교차 검증(Cross Validation)

'파이썬 머신러닝, 딥러닝' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30