250x250
Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
Tags
- __sub__
- CSS
- choice()
- decode()
- discard()
- HTML
- 오버라이딩
- mro()
- remove()
- __getitem__
- fileinput
- shuffle()
- JS
- View
- items()
- locals()
- node.js
- __len__
- shutil
- randrange()
- count()
- inplace()
- MySQL
- Database
- 파이썬
- zipfile
- fnmatch
- __annotations__
- MySqlDB
- glob
Archives
- Today
- Total
흰둥이는 코드를 짤 때 짖어 (왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!)
(Python) 로지스틱 회귀 본문
728x90
반응형
1. hr 데이터셋 살펴보기
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [ ]:
hr_df = pd.read_csv('/content/drive/MyDrive/KDT/4. 머신러닝과 딥러닝/hr.csv')
In [ ]:
hr_df.head()
Out[ ]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 73 | 0 |
In [ ]:
hr_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54808 entries, 0 to 54807
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 employee_id 54808 non-null int64
1 department 54808 non-null object
2 region 54808 non-null object
3 education 52399 non-null object
4 gender 54808 non-null object
5 recruitment_channel 54808 non-null object
6 no_of_trainings 54808 non-null int64
7 age 54808 non-null int64
8 previous_year_rating 50684 non-null float64
9 length_of_service 54808 non-null int64
10 awards_won? 54808 non-null int64
11 avg_training_score 54808 non-null int64
12 is_promoted 54808 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 5.4+ MB
- employee_id: 임의의 직원 아이디
- department: 부서
- region: 지역
- education: 학력
- gender: 성별
- recruitment_channel: 채용 방법
- no_of_trainings: 트레이닝 받은 횟수
- age: 나이
- previous_year_rating: 이전 년도 고과 점수
- length_of_service: 근속 년수
- awards_won: 수상 경력
- avg_training_score: 평균 고과 점수
- is_promoted: 승진 여부
In [ ]:
hr_df.describe()
Out[ ]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|
count | 54808.000000 | 54808.000000 | 54808.000000 | 50684.000000 | 54808.000000 | 54808.000000 | 54808.000000 | 54808.000000 |
mean | 39195.830627 | 1.253011 | 34.803915 | 3.329256 | 5.865512 | 0.023172 | 63.386750 | 0.085170 |
std | 22586.581449 | 0.609264 | 7.660169 | 1.259993 | 4.265094 | 0.150450 | 13.371559 | 0.279137 |
min | 1.000000 | 1.000000 | 20.000000 | 1.000000 | 1.000000 | 0.000000 | 39.000000 | 0.000000 |
25% | 19669.750000 | 1.000000 | 29.000000 | 3.000000 | 3.000000 | 0.000000 | 51.000000 | 0.000000 |
50% | 39225.500000 | 1.000000 | 33.000000 | 3.000000 | 5.000000 | 0.000000 | 60.000000 | 0.000000 |
75% | 58730.500000 | 1.000000 | 39.000000 | 4.000000 | 7.000000 | 0.000000 | 76.000000 | 0.000000 |
max | 78298.000000 | 10.000000 | 60.000000 | 5.000000 | 37.000000 | 1.000000 | 99.000000 | 1.000000 |
In [ ]:
sns.barplot(x='previous_year_rating', y='is_promoted', data=hr_df)
Out[ ]:
<Axes: xlabel='previous_year_rating', ylabel='is_promoted'>
In [ ]:
sns.lineplot(x='previous_year_rating', y='is_promoted', data=hr_df)
Out[ ]:
<Axes: xlabel='previous_year_rating', ylabel='is_promoted'>
In [ ]:
sns.lineplot(x='avg_training_score', y='is_promoted', data=hr_df)
Out[ ]:
<Axes: xlabel='avg_training_score', ylabel='is_promoted'>
In [ ]:
sns.barplot(x='recruitment_channel', y='is_promoted', data=hr_df)
Out[ ]:
<Axes: xlabel='recruitment_channel', ylabel='is_promoted'>
In [ ]:
hr_df['recruitment_channel'].value_counts()
Out[ ]:
other 30446
sourcing 23220
referred 1142
Name: recruitment_channel, dtype: int64
In [ ]:
sns.barplot(x='gender', y='is_promoted', data=hr_df)
Out[ ]:
<Axes: xlabel='gender', ylabel='is_promoted'>
In [ ]:
hr_df['gender'].value_counts()
Out[ ]:
m 38496
f 16312
Name: gender, dtype: int64
In [ ]:
sns.barplot(x='department', y='is_promoted', data=hr_df)
plt.xticks(rotation=45)
Out[ ]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8]),
[Text(0, 0, 'Sales & Marketing'),
Text(1, 0, 'Operations'),
Text(2, 0, 'Technology'),
Text(3, 0, 'Analytics'),
Text(4, 0, 'R&D'),
Text(5, 0, 'Procurement'),
Text(6, 0, 'Finance'),
Text(7, 0, 'HR'),
Text(8, 0, 'Legal')])
In [ ]:
hr_df['department'].value_counts()
Out[ ]:
Sales & Marketing 16840
Operations 11348
Technology 7138
Procurement 7138
Analytics 5352
Finance 2536
HR 2418
Legal 1039
R&D 999
Name: department, dtype: int64
In [ ]:
plt.figure(figsize=(14, 10))
sns.barplot(x='region', y='is_promoted', data=hr_df)
plt.xticks(rotation=45)
Out[ ]:
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]),
[Text(0, 0, 'region_7'),
Text(1, 0, 'region_22'),
Text(2, 0, 'region_19'),
Text(3, 0, 'region_23'),
Text(4, 0, 'region_26'),
Text(5, 0, 'region_2'),
Text(6, 0, 'region_20'),
Text(7, 0, 'region_34'),
Text(8, 0, 'region_1'),
Text(9, 0, 'region_4'),
Text(10, 0, 'region_29'),
Text(11, 0, 'region_31'),
Text(12, 0, 'region_15'),
Text(13, 0, 'region_14'),
Text(14, 0, 'region_11'),
Text(15, 0, 'region_5'),
Text(16, 0, 'region_28'),
Text(17, 0, 'region_17'),
Text(18, 0, 'region_13'),
Text(19, 0, 'region_16'),
Text(20, 0, 'region_25'),
Text(21, 0, 'region_10'),
Text(22, 0, 'region_27'),
Text(23, 0, 'region_30'),
Text(24, 0, 'region_12'),
Text(25, 0, 'region_21'),
Text(26, 0, 'region_32'),
Text(27, 0, 'region_6'),
Text(28, 0, 'region_33'),
Text(29, 0, 'region_8'),
Text(30, 0, 'region_24'),
Text(31, 0, 'region_3'),
Text(32, 0, 'region_9'),
Text(33, 0, 'region_18')])
In [ ]:
hr_df.isna().mean()
Out[ ]:
employee_id 0.000000
department 0.000000
region 0.000000
education 0.043953
gender 0.000000
recruitment_channel 0.000000
no_of_trainings 0.000000
age 0.000000
previous_year_rating 0.075244
length_of_service 0.000000
awards_won? 0.000000
avg_training_score 0.000000
is_promoted 0.000000
dtype: float64
In [ ]:
hr_df['education'].value_counts()
Out[ ]:
Bachelor's 36669
Master's & above 14925
Below Secondary 805
Name: education, dtype: int64
In [ ]:
hr_df['previous_year_rating'].value_counts()
Out[ ]:
3.0 18618
5.0 11741
4.0 9877
1.0 6223
2.0 4225
Name: previous_year_rating, dtype: int64
In [ ]:
hr_df = hr_df.dropna()
In [ ]:
hr_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48660 entries, 0 to 54807
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 employee_id 48660 non-null int64
1 department 48660 non-null object
2 region 48660 non-null object
3 education 48660 non-null object
4 gender 48660 non-null object
5 recruitment_channel 48660 non-null object
6 no_of_trainings 48660 non-null int64
7 age 48660 non-null int64
8 previous_year_rating 48660 non-null float64
9 length_of_service 48660 non-null int64
10 awards_won? 48660 non-null int64
11 avg_training_score 48660 non-null int64
12 is_promoted 48660 non-null int64
dtypes: float64(1), int64(7), object(5)
memory usage: 5.2+ MB
In [ ]:
for i in ['department', 'region', 'education', 'gender', 'recruitment_channel']:
print(i, hr_df[i].nunique())
department 9
region 34
education 3
gender 2
recruitment_channel 3
In [ ]:
hr_df = pd.get_dummies(hr_df, columns=['department', 'region', 'education', 'gender', 'recruitment_channel'])
hr_df.head(3)
Out[ ]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | department_Analytics | department_Finance | ... | region_region_8 | region_region_9 | education_Bachelor's | education_Below Secondary | education_Master's & above | gender_f | gender_m | recruitment_channel_other | recruitment_channel_referred | recruitment_channel_sourcing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 1 | 35 | 5.0 | 8 | 0 | 49 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
1 | 65141 | 1 | 30 | 5.0 | 4 | 0 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 7513 | 1 | 34 | 3.0 | 7 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 rows × 59 columns
In [ ]:
pd.set_option('display.max_columns', 60)
In [ ]:
hr_df.head(3)
Out[ ]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | department_Analytics | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | region_region_1 | region_region_10 | region_region_11 | region_region_12 | region_region_13 | region_region_14 | region_region_15 | region_region_16 | region_region_17 | region_region_18 | region_region_19 | region_region_2 | region_region_20 | region_region_21 | region_region_22 | region_region_23 | region_region_24 | region_region_25 | region_region_26 | region_region_27 | region_region_28 | region_region_29 | region_region_3 | region_region_30 | region_region_31 | region_region_32 | region_region_33 | region_region_34 | region_region_4 | region_region_5 | region_region_6 | region_region_7 | region_region_8 | region_region_9 | education_Bachelor's | education_Below Secondary | education_Master's & above | gender_f | gender_m | recruitment_channel_other | recruitment_channel_referred | recruitment_channel_sourcing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 1 | 35 | 5.0 | 8 | 0 | 49 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
1 | 65141 | 1 | 30 | 5.0 | 4 | 0 | 60 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 7513 | 1 | 34 | 3.0 | 7 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
In [ ]:
from sklearn.model_selection import train_test_split
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(hr_df.drop('is_promoted', axis=1), hr_df['is_promoted'], test_size=0.2, random_state=10)
2. 로지스틱 회귀(Logistic Regression)
- 둘 중의 하나를 결정하는 문제(이진 분류)를 풀기 위한 대표적인 알고리즘
- 도큐먼트
- 3개 이상의 클래스에 대한 판별을 하는 경우 OvR(One-vs-Rest), OvO(One-vs-One) 전략으로 판별
대부분 OvR 전략을 선호, 데이터가 한쪽으로 많이 치우진 경우 OvO을 사용
In [ ]:
from sklearn.linear_model import LogisticRegression
In [ ]:
lr = LogisticRegression()
In [ ]:
lr.fit(X_train, y_train)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
LogisticRegression()
In [ ]:
pred = lr.predict(X_test)
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
In [ ]:
accuracy_score(y_test, pred)
Out[ ]:
0.9114262227702425
In [ ]:
hr_df['is_promoted'].value_counts()
Out[ ]:
0 44428
1 4232
Name: is_promoted, dtype: int64
3. 혼돈 행렬(confusion matrix)
- 정밀도와 재현율(민감도)을 활용하는 평가용 지수
TN(8869) FP(0)
FN(862) TP(1)
- TN: 승진하지 못했는데, 승진하지 못했다고 예측
- FN: 승진하지 못했는데, 승진했다고 예측
- FP: 승진했는데, 승진하지 못했다고 예측
- TP: 승진했는데, 승진했다고 예측
In [ ]:
confusion_matrix(y_test, pred)
Out[ ]:
array([[8869, 0],
[ 862, 1]])
In [ ]:
sns.heatmap(confusion_matrix(y_test, pred), annot=True, cmap='Blues')
Out[ ]:
<Axes: >
3-1. 정밀도(precision)
- TP / (TP + FP)
- 무조건 양성으로 판단해서 계산하는 방법
- 실제 1인 것중에 얼마 만큼을 제대로 맞췄는가?
3-2. 재현울(recall)
- TP / (TP + FN)
- 정확하게 감지한 양성 샘플의 비율
- 1이라고 예측한 것 중, 얼마 만큼을 제대로 맞췄는가?
- 민감도 또는 TPR (True Positive Rate)라고도 부름
3-3. f1 score
- 정밀도와 재현율의 조화평균을 나타내는 지표
정밀도 재현율 산술평균 조화평균
0.4 0.6 0.5 0.48
0.3 0.7 0.5 0.42
0.5 0.5 0.5 0.5
In [ ]:
from sklearn.metrics import precision_score, recall_score, f1_score
In [ ]:
precision_score(y_test, pred)
Out[ ]:
1.0
In [ ]:
recall_score(y_test, pred)
Out[ ]:
0.0011587485515643105
In [ ]:
f1_score(y_test, pred)
Out[ ]:
0.0023148148148148147
In [ ]:
lr.coef_ # 58개 컬럼에 대한 기울기
Out[ ]:
array([[-5.42682567e-06, -2.11566320e-01, -1.24739314e-01,
4.04217840e-01, 8.39462548e-02, 1.19382822e-01,
1.24469097e-02, -4.53116409e-02, -1.59556720e-02,
-1.69079211e-02, -1.06814883e-02, 3.10169499e-02,
3.47912379e-03, -1.69516987e-02, -1.37996914e-02,
-1.51941604e-02, -2.88706914e-03, -2.41993427e-03,
-1.84270273e-02, -6.87835341e-03, -2.43612077e-03,
-5.00401302e-03, -7.32654108e-03, -7.67662052e-03,
7.71412885e-03, -3.21353065e-04, -6.68708228e-03,
2.55409975e-02, -1.02529819e-02, -7.25376472e-03,
9.23097034e-03, 9.63606751e-03, -9.39043583e-03,
5.16728257e-03, -2.15726175e-02, -1.16101632e-02,
9.39154969e-03, -1.82813463e-02, 1.50261133e-03,
-1.97860883e-03, -2.28665036e-02, -1.35441186e-02,
-5.26900930e-03, -6.03421587e-03, 3.29129457e-02,
-1.48312957e-02, -1.18516648e-02, 2.54568502e-02,
-2.39518611e-03, -9.66357577e-03, -2.00440936e-01,
-1.40474208e-02, 1.14182158e-01, -1.53689321e-02,
-8.49372671e-02, -6.62000218e-02, 4.04855793e-03,
-3.81547353e-02]])
In [ ]:
X_train
Out[ ]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | department_Analytics | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | region_region_1 | region_region_10 | region_region_11 | region_region_12 | region_region_13 | region_region_14 | region_region_15 | region_region_16 | region_region_17 | region_region_18 | region_region_19 | region_region_2 | region_region_20 | region_region_21 | region_region_22 | region_region_23 | region_region_24 | region_region_25 | region_region_26 | region_region_27 | region_region_28 | region_region_29 | region_region_3 | region_region_30 | region_region_31 | region_region_32 | region_region_33 | region_region_34 | region_region_4 | region_region_5 | region_region_6 | region_region_7 | region_region_8 | region_region_9 | education_Bachelor's | education_Below Secondary | education_Master's & above | gender_f | gender_m | recruitment_channel_other | recruitment_channel_referred | recruitment_channel_sourcing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
26382 | 45970 | 1 | 38 | 4.0 | 6 | 0 | 61 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
13184 | 68958 | 1 | 57 | 5.0 | 15 | 0 | 71 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
53060 | 63576 | 1 | 36 | 3.0 | 9 | 0 | 72 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
23528 | 13968 | 1 | 34 | 2.0 | 7 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
29663 | 61739 | 1 | 34 | 3.0 | 2 | 0 | 84 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45161 | 17581 | 1 | 46 | 2.0 | 14 | 0 | 51 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
31616 | 3937 | 1 | 40 | 3.0 | 15 | 0 | 80 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
32957 | 29303 | 1 | 29 | 4.0 | 3 | 0 | 56 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
45163 | 52256 | 2 | 33 | 1.0 | 4 | 0 | 48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
19932 | 13866 | 1 | 36 | 3.0 | 9 | 0 | 65 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
38928 rows × 58 columns
In [ ]:
# 독립변수 2개, 종속변수 1개
TempX = hr_df[['age', 'length_of_service']]
tempY = hr_df['is_promoted']
In [ ]:
temp_lr = LogisticRegression()
In [ ]:
temp_lr.fit(TempX, tempY)
LogisticRegression()
In [ ]:
temp_df = pd.DataFrame({'age':[20, 27, 30], 'length_of_service':[1, 3, 6]})
In [ ]:
temp_df
Out[ ]:
age | length_of_service | |
---|---|---|
0 | 20 | 1 |
1 | 27 | 3 |
2 | 30 | 6 |
In [ ]:
pred = temp_lr.predict(temp_df)
In [ ]:
pred
Out[ ]:
array([0, 0, 0])
In [ ]:
temp_lr.coef_
Out[ ]:
array([[-0.01074458, -0.00053409]])
In [ ]:
temp_lr.intercept_
Out[ ]:
array([-1.96818509])
In [ ]:
proba = temp_lr.predict_proba(temp_df)
proba
Out[ ]:
array([[0.89876806, 0.10123194],
[0.9055003 , 0.0944997 ],
[0.90835617, 0.09164383]])
4. 교차 검증(Cross Validation)
- train_test_split에서 발생하는 데이터의 섞임에 따라 성능이 좌우되는 문제를 해결하기 위한 기술
- K겹(K-Fold) 교차 검증을 가장 많이 사용
In [ ]:
from sklearn.model_selection import KFold
In [ ]:
kf = KFold(n_splits=5)
In [ ]:
kf
Out[ ]:
KFold(n_splits=5, random_state=None, shuffle=False)
In [ ]:
hr_df
Out[ ]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | department_Analytics | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | region_region_1 | region_region_10 | region_region_11 | region_region_12 | region_region_13 | region_region_14 | region_region_15 | region_region_16 | region_region_17 | region_region_18 | region_region_19 | region_region_2 | region_region_20 | region_region_21 | region_region_22 | region_region_23 | region_region_24 | region_region_25 | region_region_26 | region_region_27 | region_region_28 | region_region_29 | region_region_3 | region_region_30 | region_region_31 | region_region_32 | region_region_33 | region_region_34 | region_region_4 | region_region_5 | region_region_6 | region_region_7 | region_region_8 | region_region_9 | education_Bachelor's | education_Below Secondary | education_Master's & above | gender_f | gender_m | recruitment_channel_other | recruitment_channel_referred | recruitment_channel_sourcing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 1 | 35 | 5.0 | 8 | 0 | 49 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
1 | 65141 | 1 | 30 | 5.0 | 4 | 0 | 60 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 7513 | 1 | 34 | 3.0 | 7 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 2542 | 2 | 39 | 1.0 | 10 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
4 | 48945 | 1 | 45 | 3.0 | 2 | 0 | 73 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54802 | 6915 | 2 | 31 | 1.0 | 2 | 0 | 49 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
54803 | 3030 | 1 | 48 | 3.0 | 17 | 0 | 78 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
54804 | 74592 | 1 | 37 | 2.0 | 6 | 0 | 56 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
54805 | 13918 | 1 | 27 | 5.0 | 3 | 0 | 79 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
54807 | 51526 | 1 | 27 | 1.0 | 5 | 0 | 49 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
48660 rows × 59 columns
In [ ]:
for train_index, test_index in kf.split(range(len(hr_df))):
print(train_index, test_index)
print(len(train_index), len(test_index))
[ 9732 9733 9734 ... 48657 48658 48659] [ 0 1 2 ... 9729 9730 9731]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [ 9732 9733 9734 ... 19461 19462 19463]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [19464 19465 19466 ... 29193 29194 29195]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [29196 29197 29198 ... 38925 38926 38927]
38928 9732
[ 0 1 2 ... 38925 38926 38927] [38928 38929 38930 ... 48657 48658 48659]
38928 9732
In [ ]:
kf = KFold(n_splits=5, random_state=10, shuffle=True)
In [ ]:
kf
Out[ ]:
KFold(n_splits=5, random_state=10, shuffle=True)
In [ ]:
for train_index, test_index in kf.split(range(len(hr_df))):
print(train_index, test_index)
print(len(train_index), len(test_index))
[ 2 3 4 ... 48656 48657 48659] [ 0 1 5 ... 48652 48653 48658]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [ 18 23 29 ... 48639 48641 48645]
38928 9732
[ 0 1 2 ... 48657 48658 48659] [ 12 15 17 ... 48647 48650 48654]
38928 9732
[ 0 1 2 ... 48654 48656 48658] [ 3 24 31 ... 48655 48657 48659]
38928 9732
[ 0 1 3 ... 48657 48658 48659] [ 2 4 6 ... 48640 48644 48656]
38928 9732
In [ ]:
acc_list = []
for train_index, test_index in kf.split(range(len(hr_df))):
X = hr_df.drop('is_promoted', axis=1)
y = hr_df['is_promoted']
X_train = X.iloc[train_index]
X_test = X.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)
acc_list.append(accuracy_score(y_test, pred))
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
In [ ]:
acc_list
Out[ ]:
[0.9114262227702425,
0.9094739005343198,
0.9173859432799013,
0.914406083025072,
0.9125565145910398]
In [ ]:
np.array(acc_list).mean()
Out[ ]:
0.913049732840115
크로스 벨리데이션을 사용하는 이유는 결과를 좋게 하기 위함이 아니라 믿을만한 검증을 하기 위함
728x90
반응형
'파이썬 머신러닝, 딥러닝' 카테고리의 다른 글
(Python) 랜덤 포레스트 (0) | 2023.06.15 |
---|---|
(Python) 서포트 벡터 머신 (0) | 2023.06.14 |
(Python) 의사 결정 나무 (0) | 2023.06.14 |
(Python) 선형 회귀 (0) | 2023.06.12 |
(Python) 타이타닉 데이터셋 (0) | 2023.06.12 |
Comments