250x250
Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
Tags
- MySqlDB
- shutil
- decode()
- 파이썬
- count()
- __sub__
- CSS
- discard()
- JS
- items()
- Database
- fnmatch
- View
- inplace()
- glob
- node.js
- randrange()
- shuffle()
- zipfile
- __len__
- mro()
- fileinput
- locals()
- MySQL
- HTML
- 오버라이딩
- __annotations__
- remove()
- __getitem__
- choice()
Archives
- Today
- Total
흰둥이는 코드를 짤 때 짖어 (왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!왈!)
(Python) 랜덤 포레스트 본문
728x90
반응형
1. hotel 데이터셋 살펴보기
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [ ]:
hotel_df = pd.read_csv('/content/drive/MyDrive/KDT/4. 머신러닝과 딥러닝/hotel.csv')
In [ ]:
hotel_df
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status_date | name | phone-number | credit_card | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | 0 | Transient | 0.00 | 0 | 0 | 2015-07-01 | Ernest Barnes | Ernest.Barnes31@outlook.com | 669-792-1661 | ************4322 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | 0 | Transient | 0.00 | 0 | 0 | 2015-07-01 | Andrea Baker | Andrea_Baker94@aol.com | 858-637-6955 | ************9157 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | 0 | Transient | 75.00 | 0 | 0 | 2015-07-02 | Rebecca Parker | Rebecca_Parker@comcast.net | 652-885-2745 | ************3734 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | 0 | Transient | 75.00 | 0 | 0 | 2015-07-02 | Laura Murray | Laura_M@gmail.com | 364-656-8427 | ************5677 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | 0 | Transient | 98.00 | 0 | 1 | 2015-07-03 | Linda Hines | LHines@verizon.com | 713-226-5883 | ************5498 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | ... | 0 | Transient | 96.14 | 0 | 0 | 2017-09-06 | Claudia Johnson | Claudia.J@yahoo.com | 403-092-5582 | ************8647 |
119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | ... | 0 | Transient | 225.43 | 0 | 2 | 2017-09-07 | Wesley Aguilar | WAguilar@xfinity.com | 238-763-0612 | ************4333 |
119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | 0 | Transient | 157.71 | 0 | 4 | 2017-09-07 | Mary Morales | Mary_Morales@hotmail.com | 395-518-4100 | ************1821 |
119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | ... | 0 | Transient | 104.40 | 0 | 0 | 2017-09-07 | Caroline Conley MD | MD_Caroline@comcast.net | 531-528-1017 | ************7860 |
119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | ... | 0 | Transient | 151.20 | 0 | 2 | 2017-09-07 | Ariana Michael | Ariana_M@xfinity.com | 422-804-6403 | ************4482 |
119390 rows × 32 columns
In [ ]:
pd.set_option('display.max_columns', 100)
In [ ]:
hotel_df.head()
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status_date | name | phone-number | credit_card | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2015-07-01 | Ernest Barnes | Ernest.Barnes31@outlook.com | 669-792-1661 | ************4322 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2015-07-01 | Andrea Baker | Andrea_Baker94@aol.com | 858-637-6955 | ************9157 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 2015-07-02 | Rebecca Parker | Rebecca_Parker@comcast.net | 652-885-2745 | ************3734 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 2015-07-02 | Laura Murray | Laura_M@gmail.com | 364-656-8427 | ************5677 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.0 | 0 | 1 | 2015-07-03 | Linda Hines | LHines@verizon.com | 713-226-5883 | ************5498 |
In [ ]:
hotel_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 distribution_channel 119390 non-null object
15 is_repeated_guest 119390 non-null int64
16 previous_cancellations 119390 non-null int64
17 previous_bookings_not_canceled 119390 non-null int64
18 reserved_room_type 119390 non-null object
19 assigned_room_type 119390 non-null object
20 booking_changes 119390 non-null int64
21 deposit_type 119390 non-null object
22 days_in_waiting_list 119390 non-null int64
23 customer_type 119390 non-null object
24 adr 119390 non-null float64
25 required_car_parking_spaces 119390 non-null int64
26 total_of_special_requests 119390 non-null int64
27 reservation_status_date 119390 non-null object
28 name 119390 non-null object
29 email 119390 non-null object
30 phone-number 119390 non-null object
31 credit_card 119390 non-null object
dtypes: float64(2), int64(16), object(14)
memory usage: 29.1+ MB
- hotel: 호텔 종류
- is_canceled: 취소 여부
- lead_time: 예약 시점으로부터 체크인 될 때까지의 기간(얼마나 미리 예약했는지)
- arrival_date_year: 예약 연도
- arrival_date_month: 예약 월
- arrival_date_week_number: 예약 주
- arrival_date_day_of_month: 예약 일
- stays_in_weekend_nights: 주말을 끼고 얼마나 묵었는지
- stays_in_week_nights: 평일을 끼고 얼마나 묵었는지
- adults: 성인 인원수
- children: 어린이 인원수
- babies: 아기 인원수
- meal: 식사 형태
- country: 지역
- distribution_channel: 어떤 방식으로 예약했는지
- is_repeated_guest: 예약한적이 있는 고객인지
- previous_cancellations: 몇번 예약을 취소했었는지
- previous_bookings_not_canceled: 예약을 취소하지 않고 정상 숙박한 횟수
- reserved_room_type: 희망한 룸타입
- assigned_room_type: 실제 배정된 룸타입
- booking_changes: 예약 후 서비스가 몇번 변경되었는지
- deposit_type: 요금 납부 방식
- days_in_waiting_list: 예약을 위해 기다린 날짜
- customer_type: 고객 타입
- adr: 특정일에 높아지거나 낮아지는 가격
- required_car_parking_spaces: 주차공간을 요구했는지
- total_of_special_requests: 특별한 별도의 요청사항이 있는지
- reservation_status_date: 예약한 날짜
- name: 이름
- email: 이메일
- phone-number: 전화번호
- credit_card: 카드번호
In [ ]:
hotel_df.drop(['credit_card', 'email', 'name', 'phone-number', 'reservation_status_date'], axis=1, inplace=True)
In [ ]:
hotel_df.head()
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.0 | 0 | 1 |
In [ ]:
hotel_df.describe()
Out[ ]:
is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119386.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 |
mean | 0.370416 | 104.011416 | 2016.156554 | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | 0.103890 | 0.007949 | 0.031912 | 0.087118 | 0.137097 | 0.221124 | 2.321149 | 101.831122 | 0.062518 | 0.571363 |
std | 0.482918 | 106.863097 | 0.707476 | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | 0.398561 | 0.097436 | 0.175767 | 0.844336 | 1.497437 | 0.652306 | 17.594721 | 50.535790 | 0.245291 | 0.792798 |
min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
25% | 0.000000 | 18.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 69.290000 | 0.000000 | 0.000000 |
50% | 0.000000 | 69.000000 | 2016.000000 | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 94.575000 | 0.000000 | 0.000000 |
75% | 1.000000 | 160.000000 | 2017.000000 | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 126.000000 | 0.000000 | 1.000000 |
max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
In [ ]:
sns.displot(hotel_df['lead_time'])
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7f5e470750f0>
In [ ]:
sns.boxplot(y=hotel_df['lead_time'])
Out[ ]:
<Axes: ylabel='lead_time'>
In [ ]:
sns.barplot(x=hotel_df['distribution_channel'], y=hotel_df['is_canceled'])
Out[ ]:
<Axes: xlabel='distribution_channel', ylabel='is_canceled'>
In [ ]:
hotel_df['distribution_channel'].value_counts()
Out[ ]:
TA/TO 97870
Direct 14645
Corporate 6677
GDS 193
Undefined 5
Name: distribution_channel, dtype: int64
In [ ]:
sns.barplot(x=hotel_df['hotel'], y=hotel_df['is_canceled'])
Out[ ]:
<Axes: xlabel='hotel', ylabel='is_canceled'>
In [ ]:
sns.barplot(x=hotel_df['arrival_date_year'], y=hotel_df['is_canceled'])
Out[ ]:
<Axes: xlabel='arrival_date_year', ylabel='is_canceled'>
In [ ]:
plt.figure(figsize=(15, 5))
sns.barplot(x=hotel_df['arrival_date_month'], y=hotel_df['is_canceled'])
Out[ ]:
<Axes: xlabel='arrival_date_month', ylabel='is_canceled'>
In [ ]:
import calendar
In [ ]:
print(calendar.month_name[1])
print(calendar.month_name[2])
print(calendar.month_name[3])
January
February
March
In [ ]:
months = []
for i in range(1, 13):
months.append(calendar.month_name[i])
In [ ]:
months
Out[ ]:
['January',
'February',
'March',
'April',
'May',
'June',
'July',
'August',
'September',
'October',
'November',
'December']
In [ ]:
plt.figure(figsize=(15, 5))
sns.barplot(x=hotel_df['arrival_date_month'], y=hotel_df['is_canceled'], order=months)
Out[ ]:
<Axes: xlabel='arrival_date_month', ylabel='is_canceled'>
In [ ]:
sns.barplot(x=hotel_df['is_repeated_guest'], y=hotel_df['is_canceled'])
Out[ ]:
<Axes: xlabel='is_repeated_guest', ylabel='is_canceled'>
In [ ]:
sns.barplot(x=hotel_df['deposit_type'], y=hotel_df['is_canceled'])
Out[ ]:
<Axes: xlabel='deposit_type', ylabel='is_canceled'>
In [ ]:
hotel_df['deposit_type'].value_counts()
Out[ ]:
No Deposit 104641
Non Refund 14587
Refundable 162
Name: deposit_type, dtype: int64
In [ ]:
plt.figure(figsize=(15, 15))
sns.heatmap(hotel_df.corr(), cmap='coolwarm', vmax=1, vmin=-1, annot=True)
<ipython-input-25-e4a176d0acdd>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
sns.heatmap(hotel_df.corr(), cmap='coolwarm', vmax=1, vmin=-1, annot=True)
Out[ ]:
<Axes: >
In [ ]:
hotel_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 distribution_channel 119390 non-null object
15 is_repeated_guest 119390 non-null int64
16 previous_cancellations 119390 non-null int64
17 previous_bookings_not_canceled 119390 non-null int64
18 reserved_room_type 119390 non-null object
19 assigned_room_type 119390 non-null object
20 booking_changes 119390 non-null int64
21 deposit_type 119390 non-null object
22 days_in_waiting_list 119390 non-null int64
23 customer_type 119390 non-null object
24 adr 119390 non-null float64
25 required_car_parking_spaces 119390 non-null int64
26 total_of_special_requests 119390 non-null int64
dtypes: float64(2), int64(16), object(9)
memory usage: 24.6+ MB
In [ ]:
hotel_df = hotel_df.dropna()
In [ ]:
hotel_df
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.00 | 0 | 0 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.00 | 0 | 0 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.00 | 0 | 0 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.00 | 0 | 0 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.00 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | 0.0 | 0 | BB | BEL | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 96.14 | 0 | 0 |
119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | 0.0 | 0 | BB | FRA | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 0 | Transient | 225.43 | 0 | 2 |
119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | 0.0 | 0 | BB | DEU | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 0 | Transient | 157.71 | 0 | 4 |
119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 104.40 | 0 | 0 |
119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | 0.0 | 0 | HB | DEU | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 151.20 | 0 | 2 |
118898 rows × 27 columns
In [ ]:
hotel_df[hotel_df['adults'] == 0]
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2224 | Resort Hotel | 0 | 1 | 2015 | October | 41 | 6 | 0 | 3 | 0 | 0.0 | 0 | SC | PRT | Corporate | 0 | 0 | 0 | A | I | 1 | No Deposit | 0 | Transient-Party | 0.00 | 0 | 0 |
2409 | Resort Hotel | 0 | 0 | 2015 | October | 42 | 12 | 0 | 0 | 0 | 0.0 | 0 | SC | PRT | Corporate | 0 | 0 | 0 | A | I | 0 | No Deposit | 0 | Transient | 0.00 | 0 | 0 |
3181 | Resort Hotel | 0 | 36 | 2015 | November | 47 | 20 | 1 | 2 | 0 | 0.0 | 0 | SC | ESP | TA/TO | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient-Party | 0.00 | 0 | 0 |
3684 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 1 | 4 | 0 | 0.0 | 0 | SC | PRT | TA/TO | 0 | 0 | 0 | A | A | 1 | No Deposit | 122 | Transient-Party | 0.00 | 0 | 0 |
3708 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 2 | 4 | 0 | 0.0 | 0 | SC | PRT | TA/TO | 0 | 0 | 0 | A | C | 1 | No Deposit | 122 | Transient-Party | 0.00 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
117204 | City Hotel | 0 | 296 | 2017 | July | 30 | 27 | 1 | 3 | 0 | 2.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | B | A | 0 | No Deposit | 0 | Transient | 98.85 | 0 | 1 |
117274 | City Hotel | 0 | 276 | 2017 | July | 31 | 30 | 2 | 1 | 0 | 2.0 | 0 | BB | DEU | TA/TO | 0 | 0 | 0 | B | B | 1 | No Deposit | 0 | Transient | 93.64 | 0 | 2 |
117303 | City Hotel | 0 | 291 | 2017 | July | 30 | 29 | 2 | 2 | 0 | 2.0 | 0 | BB | PRT | TA/TO | 0 | 0 | 0 | B | A | 0 | No Deposit | 0 | Transient | 98.85 | 0 | 1 |
117453 | City Hotel | 0 | 159 | 2017 | July | 31 | 31 | 1 | 3 | 0 | 2.0 | 0 | SC | FRA | TA/TO | 0 | 0 | 0 | A | A | 1 | No Deposit | 0 | Transient | 121.88 | 0 | 1 |
118200 | City Hotel | 0 | 10 | 2017 | August | 32 | 12 | 2 | 2 | 0 | 3.0 | 0 | BB | MAR | Direct | 0 | 0 | 0 | B | A | 1 | No Deposit | 0 | Transient-Party | 6.00 | 0 | 1 |
393 rows × 27 columns
In [ ]:
hotel_df['people'] = hotel_df['adults'] + hotel_df['children'] + hotel_df['babies']
<ipython-input-30-7b8904ba1394>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df['people'] = hotel_df['adults'] + hotel_df['children'] + hotel_df['babies']
In [ ]:
hotel_df.head()
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.0 | 0 | 1 | 2.0 |
In [ ]:
hotel_df[hotel_df['people'] == 0]
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2224 | Resort Hotel | 0 | 1 | 2015 | October | 41 | 6 | 0 | 3 | 0 | 0.0 | 0 | SC | PRT | Corporate | 0 | 0 | 0 | A | I | 1 | No Deposit | 0 | Transient-Party | 0.00 | 0 | 0 | 0.0 |
2409 | Resort Hotel | 0 | 0 | 2015 | October | 42 | 12 | 0 | 0 | 0 | 0.0 | 0 | SC | PRT | Corporate | 0 | 0 | 0 | A | I | 0 | No Deposit | 0 | Transient | 0.00 | 0 | 0 | 0.0 |
3181 | Resort Hotel | 0 | 36 | 2015 | November | 47 | 20 | 1 | 2 | 0 | 0.0 | 0 | SC | ESP | TA/TO | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient-Party | 0.00 | 0 | 0 | 0.0 |
3684 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 1 | 4 | 0 | 0.0 | 0 | SC | PRT | TA/TO | 0 | 0 | 0 | A | A | 1 | No Deposit | 122 | Transient-Party | 0.00 | 0 | 0 | 0.0 |
3708 | Resort Hotel | 0 | 165 | 2015 | December | 53 | 30 | 2 | 4 | 0 | 0.0 | 0 | SC | PRT | TA/TO | 0 | 0 | 0 | A | C | 1 | No Deposit | 122 | Transient-Party | 0.00 | 0 | 0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
115029 | City Hotel | 0 | 107 | 2017 | June | 26 | 27 | 0 | 3 | 0 | 0.0 | 0 | BB | CHE | TA/TO | 0 | 0 | 0 | A | A | 1 | No Deposit | 0 | Transient | 100.80 | 0 | 0 | 0.0 |
115091 | City Hotel | 0 | 1 | 2017 | June | 26 | 30 | 0 | 1 | 0 | 0.0 | 0 | SC | PRT | Direct | 0 | 0 | 0 | E | K | 0 | No Deposit | 0 | Transient | 0.00 | 1 | 1 | 0.0 |
116251 | City Hotel | 0 | 44 | 2017 | July | 28 | 15 | 1 | 1 | 0 | 0.0 | 0 | SC | SWE | TA/TO | 0 | 0 | 0 | A | K | 2 | No Deposit | 0 | Transient | 73.80 | 0 | 0 | 0.0 |
116534 | City Hotel | 0 | 2 | 2017 | July | 28 | 15 | 2 | 5 | 0 | 0.0 | 0 | SC | RUS | TA/TO | 0 | 0 | 0 | A | K | 1 | No Deposit | 0 | Transient-Party | 22.86 | 0 | 1 | 0.0 |
117087 | City Hotel | 0 | 170 | 2017 | July | 30 | 27 | 0 | 2 | 0 | 0.0 | 0 | BB | BRA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 0.00 | 0 | 0 | 0.0 |
170 rows × 28 columns
In [ ]:
hotel_df = hotel_df[hotel_df['people'] != 0]
In [ ]:
hotel_df
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.00 | 0 | 0 | 2.0 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.00 | 0 | 0 | 2.0 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.00 | 0 | 0 | 1.0 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.00 | 0 | 0 | 1.0 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.00 | 0 | 1 | 2.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | 0.0 | 0 | BB | BEL | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 96.14 | 0 | 0 | 2.0 |
119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | 0.0 | 0 | BB | FRA | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 0 | Transient | 225.43 | 0 | 2 | 3.0 |
119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | 0.0 | 0 | BB | DEU | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 0 | Transient | 157.71 | 0 | 4 | 2.0 |
119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 104.40 | 0 | 0 | 2.0 |
119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | 0.0 | 0 | HB | DEU | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 151.20 | 0 | 2 | 2.0 |
118728 rows × 28 columns
In [ ]:
hotel_df['total_nights'] = hotel_df['stays_in_week_nights'] + hotel_df['stays_in_weekend_nights']
<ipython-input-35-9d363fd31157>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df['total_nights'] = hotel_df['stays_in_week_nights'] + hotel_df['stays_in_weekend_nights']
In [ ]:
hotel_df[hotel_df['total_nights'] == 0]
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 |
167 | Resort Hotel | 0 | 111 | 2015 | July | 28 | 6 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | TA/TO | 0 | 0 | 0 | A | H | 0 | No Deposit | 0 | Transient | 0.0 | 0 | 2 | 2.0 | 0 |
168 | Resort Hotel | 0 | 0 | 2015 | July | 28 | 6 | 0 | 0 | 1 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | E | H | 0 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 1.0 | 0 |
196 | Resort Hotel | 0 | 8 | 2015 | July | 28 | 7 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 0.0 | 0 | 1 | 2.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
115483 | City Hotel | 0 | 15 | 2017 | July | 27 | 6 | 0 | 0 | 1 | 0.0 | 0 | SC | FRA | Direct | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient-Party | 0.0 | 0 | 0 | 1.0 | 0 |
117701 | City Hotel | 0 | 0 | 2017 | August | 32 | 8 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | TA/TO | 1 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 |
118029 | City Hotel | 0 | 0 | 2017 | August | 33 | 14 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 1 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 |
118631 | City Hotel | 0 | 78 | 2017 | August | 34 | 23 | 0 | 0 | 1 | 0.0 | 0 | BB | PRT | TA/TO | 0 | 0 | 0 | A | K | 7 | No Deposit | 0 | Transient-Party | 0.0 | 0 | 0 | 1.0 | 0 |
118963 | City Hotel | 0 | 1 | 2017 | August | 35 | 27 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 |
640 rows × 29 columns
In [ ]:
hotel_df['arrival_date_month'].apply(lambda x: 'spring' if x in ['March', 'April', 'May'] else 'summer' if x in ['June', 'July', 'August'] else 'fall' if x in ['September', 'October', 'November'] else 'winter')
Out[ ]:
0 summer
1 summer
2 summer
3 summer
4 summer
...
119385 summer
119386 summer
119387 summer
119388 summer
119389 summer
Name: arrival_date_month, Length: 118728, dtype: object
In [ ]:
season_dic = {'spring':[3, 4, 5], 'summer':[6, 7, 8], 'fall':[9, 10, 11], 'winter':[12, 1, 2]}
In [ ]:
new_season_dic = {}
for i in season_dic:
for j in season_dic[i]:
new_season_dic[calendar.month_name[j]] = i
In [ ]:
new_season_dic
Out[ ]:
{'March': 'spring',
'April': 'spring',
'May': 'spring',
'June': 'summer',
'July': 'summer',
'August': 'summer',
'September': 'fall',
'October': 'fall',
'November': 'fall',
'December': 'winter',
'January': 'winter',
'February': 'winter'}
In [ ]:
hotel_df['season'] = hotel_df['arrival_date_month'].map(new_season_dic)
<ipython-input-41-71c40adf4d6e>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df['season'] = hotel_df['arrival_date_month'].map(new_season_dic)
In [ ]:
hotel_df.head()
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 | summer |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 | summer |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 | 1 | summer |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 | 1 | summer |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.0 | 0 | 1 | 2.0 | 2 | summer |
In [ ]:
hotel_df['expected_room_type'] = (hotel_df['reserved_room_type'] == hotel_df['assigned_room_type']).astype(int)
<ipython-input-43-d7dcc6110bd8>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df['expected_room_type'] = (hotel_df['reserved_room_type'] == hotel_df['assigned_room_type']).astype(int)
In [ ]:
hotel_df.head()
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | season | expected_room_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 | summer | 1 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 | summer | 1 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 | 1 | summer | 0 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 | 1 | summer | 1 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.0 | 0 | 1 | 2.0 | 2 | summer | 1 |
In [ ]:
hotel_df['cancel_rate'] = hotel_df['previous_cancellations'] / (hotel_df['previous_cancellations'] + hotel_df['previous_bookings_not_canceled'])
<ipython-input-45-01c32095fb5f>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df['cancel_rate'] = hotel_df['previous_cancellations'] / (hotel_df['previous_cancellations'] + hotel_df['previous_bookings_not_canceled'])
In [ ]:
hotel_df.head()
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | season | expected_room_type | cancel_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 | summer | 1 | NaN |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.0 | 0 | 0 | 2.0 | 0 | summer | 1 | NaN |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 | 1 | summer | 0 | NaN |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.0 | 0 | 0 | 1.0 | 1 | summer | 1 | NaN |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.0 | 0 | 1 | 2.0 | 2 | summer | 1 | NaN |
In [ ]:
hotel_df[hotel_df['cancel_rate'].isna()] # 처음 방문한 사람들
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | season | expected_room_type | cancel_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | 0 | Transient | 0.00 | 0 | 0 | 2.0 | 0 | summer | 1 | NaN |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | 0 | Transient | 0.00 | 0 | 0 | 2.0 | 0 | summer | 1 | NaN |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | 0 | Transient | 75.00 | 0 | 0 | 1.0 | 1 | summer | 0 | NaN |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 75.00 | 0 | 0 | 1.0 | 1 | summer | 1 | NaN |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 98.00 | 0 | 1 | 2.0 | 2 | summer | 1 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
119385 | City Hotel | 0 | 23 | 2017 | August | 35 | 30 | 2 | 5 | 2 | 0.0 | 0 | BB | BEL | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 96.14 | 0 | 0 | 2.0 | 7 | summer | 1 | NaN |
119386 | City Hotel | 0 | 102 | 2017 | August | 35 | 31 | 2 | 5 | 3 | 0.0 | 0 | BB | FRA | TA/TO | 0 | 0 | 0 | E | E | 0 | No Deposit | 0 | Transient | 225.43 | 0 | 2 | 3.0 | 7 | summer | 1 | NaN |
119387 | City Hotel | 0 | 34 | 2017 | August | 35 | 31 | 2 | 5 | 2 | 0.0 | 0 | BB | DEU | TA/TO | 0 | 0 | 0 | D | D | 0 | No Deposit | 0 | Transient | 157.71 | 0 | 4 | 2.0 | 7 | summer | 1 | NaN |
119388 | City Hotel | 0 | 109 | 2017 | August | 35 | 31 | 2 | 5 | 2 | 0.0 | 0 | BB | GBR | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 104.40 | 0 | 0 | 2.0 | 7 | summer | 1 | NaN |
119389 | City Hotel | 0 | 205 | 2017 | August | 35 | 29 | 2 | 7 | 2 | 0.0 | 0 | HB | DEU | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 0 | Transient | 151.20 | 0 | 2 | 2.0 | 9 | summer | 1 | NaN |
109523 rows × 32 columns
In [ ]:
hotel_df[~hotel_df['cancel_rate'].isna()] # 방문한적은 있지만 캔슬이 없는 경우
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | season | expected_room_type | cancel_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13808 | Resort Hotel | 0 | 6 | 2016 | January | 5 | 26 | 0 | 2 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 0 | 1 | A | D | 1 | No Deposit | 0 | Transient | 27.0 | 0 | 0 | 1.0 | 2 | winter | 0 | 0.0 |
13813 | Resort Hotel | 0 | 1 | 2016 | February | 6 | 2 | 0 | 2 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 0 | 1 | A | D | 1 | No Deposit | 0 | Transient | 27.0 | 0 | 0 | 1.0 | 2 | winter | 0 | 0.0 |
13814 | Resort Hotel | 0 | 6 | 2016 | November | 47 | 14 | 1 | 0 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 0 | 2 | A | A | 0 | No Deposit | 0 | Transient | 27.0 | 0 | 0 | 1.0 | 1 | fall | 1 | 0.0 |
13815 | Resort Hotel | 0 | 6 | 2017 | January | 3 | 17 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 0 | 3 | A | A | 0 | No Deposit | 0 | Transient | 35.0 | 0 | 0 | 1.0 | 1 | winter | 1 | 0.0 |
13817 | Resort Hotel | 0 | 1 | 2017 | February | 8 | 21 | 0 | 2 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 0 | 1 | A | A | 0 | No Deposit | 0 | Transient-Party | 35.0 | 0 | 0 | 1.0 | 2 | winter | 1 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
117424 | City Hotel | 0 | 3 | 2017 | August | 35 | 31 | 0 | 1 | 2 | 1.0 | 0 | BB | PRT | Corporate | 1 | 0 | 1 | A | A | 0 | No Deposit | 0 | Transient | 95.0 | 0 | 4 | 3.0 | 1 | summer | 1 | 0.0 |
117841 | City Hotel | 0 | 7 | 2017 | August | 35 | 30 | 0 | 2 | 1 | 0.0 | 0 | BB | PRT | Corporate | 1 | 0 | 1 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 2 | 1.0 | 2 | summer | 1 | 0.0 |
118581 | City Hotel | 0 | 11 | 2017 | August | 34 | 25 | 0 | 2 | 2 | 0.0 | 0 | BB | FRA | TA/TO | 0 | 0 | 1 | D | D | 1 | No Deposit | 0 | Group | 125.0 | 0 | 0 | 2.0 | 2 | summer | 1 | 0.0 |
118651 | City Hotel | 0 | 189 | 2017 | August | 35 | 27 | 2 | 0 | 2 | 0.0 | 0 | BB | ITA | TA/TO | 0 | 0 | 1 | A | A | 1 | No Deposit | 0 | Transient-Party | 119.0 | 0 | 3 | 2.0 | 2 | summer | 1 | 0.0 |
118654 | City Hotel | 0 | 189 | 2017 | August | 35 | 27 | 2 | 0 | 2 | 0.0 | 0 | BB | ITA | TA/TO | 0 | 0 | 1 | A | A | 1 | No Deposit | 0 | Transient | 119.0 | 0 | 2 | 2.0 | 2 | summer | 1 | 0.0 |
9205 rows × 32 columns
In [ ]:
hotel_df[hotel_df['cancel_rate'] > 0] # 취소한 경우
Out[ ]:
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | season | expected_room_type | cancel_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13825 | Resort Hotel | 0 | 6 | 2016 | March | 13 | 21 | 1 | 0 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 1 | 1 | A | A | 0 | No Deposit | 0 | Transient | 40.0 | 0 | 0 | 1.0 | 1 | spring | 1 | 0.500000 |
13826 | Resort Hotel | 0 | 7 | 2016 | June | 26 | 21 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 1 | 2 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 0 | 1.0 | 1 | summer | 1 | 0.333333 |
13827 | Resort Hotel | 0 | 8 | 2016 | September | 40 | 27 | 0 | 2 | 2 | 0.0 | 0 | BB | PRT | Corporate | 0 | 1 | 3 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 0 | 2.0 | 2 | fall | 1 | 0.250000 |
13855 | Resort Hotel | 0 | 5 | 2015 | November | 48 | 25 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 1 | 1 | A | A | 0 | No Deposit | 0 | Transient | 25.0 | 0 | 0 | 1.0 | 1 | fall | 1 | 0.500000 |
13856 | Resort Hotel | 0 | 0 | 2015 | December | 52 | 22 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 0 | 1 | 2 | A | A | 0 | No Deposit | 0 | Transient | 25.0 | 0 | 0 | 1.0 | 1 | winter | 1 | 0.333333 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
111356 | City Hotel | 0 | 10 | 2017 | June | 25 | 22 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 1 | 1 | 4 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 0 | 1.0 | 1 | summer | 1 | 0.200000 |
111357 | City Hotel | 0 | 20 | 2017 | July | 28 | 11 | 0 | 3 | 1 | 0.0 | 0 | BB | PRT | Corporate | 1 | 1 | 5 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 0 | 1.0 | 3 | summer | 1 | 0.166667 |
111358 | City Hotel | 0 | 8 | 2017 | July | 30 | 25 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 1 | 1 | 6 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 1 | 0 | 1.0 | 1 | summer | 1 | 0.142857 |
111359 | City Hotel | 0 | 13 | 2017 | August | 35 | 29 | 0 | 1 | 1 | 0.0 | 0 | BB | PRT | Corporate | 1 | 1 | 7 | A | A | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 0 | 1.0 | 1 | summer | 1 | 0.125000 |
111925 | City Hotel | 1 | 6 | 2017 | July | 29 | 17 | 1 | 0 | 1 | 0.0 | 0 | BB | PRT | Corporate | 1 | 1 | 1 | A | D | 0 | No Deposit | 0 | Transient | 65.0 | 0 | 0 | 1.0 | 1 | summer | 0 | 0.500000 |
6442 rows × 32 columns
In [ ]:
hotel_df['cancel_rate'] = hotel_df['cancel_rate'].fillna(-1)
<ipython-input-50-599e806aa6a0>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df['cancel_rate'] = hotel_df['cancel_rate'].fillna(-1)
In [ ]:
hotel_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 118728 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 118728 non-null object
1 is_canceled 118728 non-null int64
2 lead_time 118728 non-null int64
3 arrival_date_year 118728 non-null int64
4 arrival_date_month 118728 non-null object
5 arrival_date_week_number 118728 non-null int64
6 arrival_date_day_of_month 118728 non-null int64
7 stays_in_weekend_nights 118728 non-null int64
8 stays_in_week_nights 118728 non-null int64
9 adults 118728 non-null int64
10 children 118728 non-null float64
11 babies 118728 non-null int64
12 meal 118728 non-null object
13 country 118728 non-null object
14 distribution_channel 118728 non-null object
15 is_repeated_guest 118728 non-null int64
16 previous_cancellations 118728 non-null int64
17 previous_bookings_not_canceled 118728 non-null int64
18 reserved_room_type 118728 non-null object
19 assigned_room_type 118728 non-null object
20 booking_changes 118728 non-null int64
21 deposit_type 118728 non-null object
22 days_in_waiting_list 118728 non-null int64
23 customer_type 118728 non-null object
24 adr 118728 non-null float64
25 required_car_parking_spaces 118728 non-null int64
26 total_of_special_requests 118728 non-null int64
27 people 118728 non-null float64
28 total_nights 118728 non-null int64
29 season 118728 non-null object
30 expected_room_type 118728 non-null int64
31 cancel_rate 118728 non-null float64
dtypes: float64(4), int64(18), object(10)
memory usage: 29.9+ MB
In [ ]:
hotel_df['country'].dtype # dtype('O')
Out[ ]:
dtype('O')
In [ ]:
hotel_df['people'].dtype # dtype('float64')
Out[ ]:
dtype('float64')
In [ ]:
obj_list = []
for i in hotel_df.columns:
if hotel_df[i].dtype == 'O':
obj_list.append(i)
In [ ]:
obj_list
Out[ ]:
['hotel',
'arrival_date_month',
'meal',
'country',
'distribution_channel',
'reserved_room_type',
'assigned_room_type',
'deposit_type',
'customer_type',
'season']
In [ ]:
for i in obj_list:
print(i, hotel_df[i].nunique)
hotel <bound method IndexOpsMixin.nunique of 0 Resort Hotel
1 Resort Hotel
2 Resort Hotel
3 Resort Hotel
4 Resort Hotel
...
119385 City Hotel
119386 City Hotel
119387 City Hotel
119388 City Hotel
119389 City Hotel
Name: hotel, Length: 118728, dtype: object>
arrival_date_month <bound method IndexOpsMixin.nunique of 0 July
1 July
2 July
3 July
4 July
...
119385 August
119386 August
119387 August
119388 August
119389 August
Name: arrival_date_month, Length: 118728, dtype: object>
meal <bound method IndexOpsMixin.nunique of 0 BB
1 BB
2 BB
3 BB
4 BB
..
119385 BB
119386 BB
119387 BB
119388 BB
119389 HB
Name: meal, Length: 118728, dtype: object>
country <bound method IndexOpsMixin.nunique of 0 PRT
1 PRT
2 GBR
3 GBR
4 GBR
...
119385 BEL
119386 FRA
119387 DEU
119388 GBR
119389 DEU
Name: country, Length: 118728, dtype: object>
distribution_channel <bound method IndexOpsMixin.nunique of 0 Direct
1 Direct
2 Direct
3 Corporate
4 TA/TO
...
119385 TA/TO
119386 TA/TO
119387 TA/TO
119388 TA/TO
119389 TA/TO
Name: distribution_channel, Length: 118728, dtype: object>
reserved_room_type <bound method IndexOpsMixin.nunique of 0 C
1 C
2 A
3 A
4 A
..
119385 A
119386 E
119387 D
119388 A
119389 A
Name: reserved_room_type, Length: 118728, dtype: object>
assigned_room_type <bound method IndexOpsMixin.nunique of 0 C
1 C
2 C
3 A
4 A
..
119385 A
119386 E
119387 D
119388 A
119389 A
Name: assigned_room_type, Length: 118728, dtype: object>
deposit_type <bound method IndexOpsMixin.nunique of 0 No Deposit
1 No Deposit
2 No Deposit
3 No Deposit
4 No Deposit
...
119385 No Deposit
119386 No Deposit
119387 No Deposit
119388 No Deposit
119389 No Deposit
Name: deposit_type, Length: 118728, dtype: object>
customer_type <bound method IndexOpsMixin.nunique of 0 Transient
1 Transient
2 Transient
3 Transient
4 Transient
...
119385 Transient
119386 Transient
119387 Transient
119388 Transient
119389 Transient
Name: customer_type, Length: 118728, dtype: object>
season <bound method IndexOpsMixin.nunique of 0 summer
1 summer
2 summer
3 summer
4 summer
...
119385 summer
119386 summer
119387 summer
119388 summer
119389 summer
Name: season, Length: 118728, dtype: object>
In [ ]:
hotel_df['meal'].value_counts()
Out[ ]:
BB 91789
HB 14429
SC 10547
Undefined 1165
FB 798
Name: meal, dtype: int64
In [ ]:
hotel_df.drop(['country', 'meal'], axis=1, inplace=True)
<ipython-input-58-55d6171d5808>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
hotel_df.drop(['country', 'meal'], axis=1, inplace=True)
In [ ]:
obj_list.remove('country')
obj_list.remove('meal')
In [ ]:
hotel_df = pd.get_dummies(hotel_df, columns=obj_list)
In [ ]:
hotel_df.head()
Out[ ]:
is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | people | total_nights | expected_room_type | cancel_rate | hotel_City Hotel | hotel_Resort Hotel | arrival_date_month_April | arrival_date_month_August | arrival_date_month_December | arrival_date_month_February | arrival_date_month_January | arrival_date_month_July | arrival_date_month_June | arrival_date_month_March | arrival_date_month_May | arrival_date_month_November | arrival_date_month_October | arrival_date_month_September | distribution_channel_Corporate | distribution_channel_Direct | distribution_channel_GDS | distribution_channel_TA/TO | distribution_channel_Undefined | reserved_room_type_A | reserved_room_type_B | reserved_room_type_C | reserved_room_type_D | reserved_room_type_E | reserved_room_type_F | reserved_room_type_G | reserved_room_type_H | reserved_room_type_L | assigned_room_type_A | assigned_room_type_B | assigned_room_type_C | assigned_room_type_D | assigned_room_type_E | assigned_room_type_F | assigned_room_type_G | assigned_room_type_H | assigned_room_type_I | assigned_room_type_K | assigned_room_type_L | deposit_type_No Deposit | deposit_type_Non Refund | deposit_type_Refundable | customer_type_Contract | customer_type_Group | customer_type_Transient | customer_type_Transient-Party | season_fall | season_spring | season_summer | season_winter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 342 | 2015 | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | 0 | 0 | 0 | 3 | 0 | 0.0 | 0 | 0 | 2.0 | 0 | 1 | -1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 737 | 2015 | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | 0 | 0 | 0 | 4 | 0 | 0.0 | 0 | 0 | 2.0 | 0 | 1 | -1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 7 | 2015 | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 75.0 | 0 | 0 | 1.0 | 1 | 0 | -1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 13 | 2015 | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 75.0 | 0 | 0 | 1.0 | 1 | 1 | -1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 14 | 2015 | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 98.0 | 0 | 1 | 2.0 | 2 | 1 | -1.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
In [ ]:
from sklearn.model_selection import train_test_split
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(hotel_df.drop('is_canceled', axis=1), hotel_df['is_canceled'], test_size=0.3, random_state=10)
2. 랜덤 포레스트(Random Forest)
- Decision Tree는 매우 훌륭한 모델이지만 학습 데이터에 오버피팅 하는 경향이 있음(가지치기 같은 방법을 통해 부작용을 최소화할 수 있지만 부족함)
- 학습을 통해 구성해 놓은 다수의 나무들(Decision Tree)로부터 분류 결과를 취합해서 결론을 얻는 방식의 모델
- Decision Tree 기반의 Bagging 앙상블 모델
- 굉장히 인기있는 모델이며, 사용성이 쉽고 성능도 꽤 우수한 편
2-1. 앙상블(Ensemble) 모델
- 여러개의 머신러닝 모델을 이용해 최적의 답을 찾아내는 기법
- 보팅(Voting)
- 다른 알고리즘 model을 조합해서 사용
- 모델에 대해 투표로 결과를 도출
- 배깅(Bagging)
- 같은 알고리즘 내에서 다른 sample 조합을 사
- 샘플 중복 생성을 통해 결과를 도출
- 부스팅(Boosting)
- 이전 오차를 보완해가면서 가중치를 부여
- 성능이 매우 우수하지만, 잘못된 레이블이나 이웃라이어에 대해 필요이상으로 민
- 스태킹(Stacking)
- 여러 모델을 기반으로 예측된 결과를 통해 meta 모델이 다시 한번 예측
- 성능을 극으로 끌어올릴 때 활용하지만 과대적합을 유발할 수 있음(특히 데이터셋이 적은 경우)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
In [ ]:
rf = RandomForestClassifier()
In [ ]:
rf.fit(X_train, y_train)
RandomForestClassifier()
In [ ]:
pred1 = rf.predict(X_test)
In [ ]:
pred1
Out[ ]:
array([0, 0, 0, ..., 0, 0, 0])
In [ ]:
proba1 = rf.predict_proba(X_test)
proba1
Out[ ]:
array([[0.955, 0.045],
[1. , 0. ],
[0.72 , 0.28 ],
...,
[1. , 0. ],
[0.9 , 0.1 ],
[0.96 , 0.04 ]])
In [ ]:
# 첫번째 테스트 데이터에 대한 예측 결과
proba1[0]
Out[ ]:
array([0.955, 0.045])
In [ ]:
# 모든 테스트 데이터에 대한 호텔 예약을 취소할 확률만 출력
proba1[:, 1]
Out[ ]:
array([0.045, 0. , 0.28 , ..., 0. , 0.1 , 0.04 ])
3. ROC Curve
- 이진 분류의 성능을 측정하는 도구
- 민감도와 특이도를 그려지는 곡선을 의미
- FPR(False Positive Rate)
- 특이도
- FP / TN + FP
- 실제값은 음성이지만 양성으로 잘 못 분류
- TPR(True Positive Rate)
- 민감도. 참인 양성 비율
- TP / FN + TP
- 실제로 양성이고 양성으로 잘 분류
4. AUC(Area Under the ROC Curve)
- ROC 커브와 직선 사이의 면적을 의미
- AUC값의 범위는 0.5~1이며 값이 클수록 예측의 정확도가 높음
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
In [ ]:
accuracy_score(y_test, pred1)
Out[ ]:
0.8630225441477863
In [ ]:
confusion_matrix(y_test, pred1)
Out[ ]:
array([[20779, 1579],
[ 3300, 9961]])
In [ ]:
print(classification_report(y_test, pred1))
precision recall f1-score support
0 0.86 0.93 0.89 22358
1 0.86 0.75 0.80 13261
accuracy 0.86 35619
macro avg 0.86 0.84 0.85 35619
weighted avg 0.86 0.86 0.86 35619
In [ ]:
roc_auc_score(y_test, proba1[:, 1])
Out[ ]:
0.9303955154719541
In [ ]:
# 하이퍼 파라미터 수정
rf2 = RandomForestClassifier(max_depth=30, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:, 1])
# 하이퍼 파라미터 적용 전: 0.9303955154719541
# 하이퍼 파라미터 적용(max_depth=30 을 적용) 후: 0.9320285483491656
Out[ ]:
0.9320285483491656
In [ ]:
# 하이퍼 파라미터 수정(max_septh=50 을 적용)
rf2 = RandomForestClassifier(max_depth=50, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:, 1])
# 하이퍼 파라미터 적용 전: 0.9303955154719541
# 하이퍼 파라미터 적용(max_depth=30 을 적용) 후: 0.9320285483491656
# 하이퍼 파라미터 적용(max_depth=50 을 적용) 후: 0.9303745518246758
Out[ ]:
0.9303745518246758
In [ ]:
# 하이퍼 파라미터 수정(min_samples_split=5 을 적용)
rf2 = RandomForestClassifier(min_samples_split=5, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:, 1])
# 하이퍼 파라미터 적용 전: 0.9303955154719541
# 하이퍼 파라미터 적용(min_samples_split=5 을 적용) 후: 0.931436154565479
Out[ ]:
0.931436154565479
In [ ]:
# 하이퍼 파라미터 수정(min_samples_split=7 을 적용)
rf2 = RandomForestClassifier(min_samples_split=7, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:, 1])
# 하이퍼 파라미터 적용 전: 0.9303955154719541
# 하이퍼 파라미터 적용(min_samples_split=5 을 적용) 후: 0.931436154565479
# 하이퍼 파라미터 적용(min_samples_split=7 을 적용) 후: 0.9312578210627522
Out[ ]:
0.9312578210627522
In [ ]:
# 하이퍼 파라미터 최적의 값
rf2 = RandomForestClassifier(max_depth=30, min_samples_split=5, random_state=10)
rf2.fit(X_train, y_train)
proba2 = rf2.predict_proba(X_test)
roc_auc_score(y_test, proba2[:, 1])
# 하이퍼 파라미터 적용 전: 0.9303955154719541
# 하이퍼 파라미터 적용(max_depth=30 을 적용) 후: 0.9320285483491656
# 하이퍼 파라미터 적용(min_samples_split=5 을 적용) 후: 0.931436154565479
# 하이퍼 파라미터 적용(max_depth=30, min_samples_split=5) 후: 0.9320878540030826
Out[ ]:
0.9320878540030826
5. 하이퍼 파라미터 최적의 값을 찾는 방법
- GridSearchCV: 원하는 모든 하이퍼 파라미터를 적용하여 최적의 값을 찾음
In [ ]:
from sklearn.model_selection import GridSearchCV
In [ ]:
params = {
'max_depth': [None, 10, 30, 50],
'min_samples_split': [2, 3, 5, 7, 10]
}
In [ ]:
rf3 = RandomForestClassifier(random_state=10)
In [ ]:
grid_df = GridSearchCV(rf3, params, cv=5) # cv: 데이터 교차검증
In [ ]:
grid_df.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=10),
param_grid={'max_depth': [None, 10, 30, 50],
'min_samples_split': [2, 3, 5, 7, 10]})
RandomForestClassifier(random_state=10)
RandomForestClassifier(random_state=10)
In [ ]:
grid_df.best_params_
Out[ ]:
{'max_depth': 30, 'min_samples_split': 2}
In [ ]:
rf3 = RandomForestClassifier(max_depth=30, min_samples_split=2, random_state=10)
rf3.fit(X_train, y_train)
proba3 = rf3.predict_proba(X_test)
roc_auc_score(y_test, proba3[:, 1])
# 하이퍼 파라미터 적용(max_depth=30, min_samples_split=5) 후: 0.9320285483491656
Out[ ]:
0.9320285483491656
6. 피쳐 중요도(Feature Importances)
- Decisiom Tree에서 노드를 분기할 때 해당 피쳐가 클래스를 나누는데 얼마나 영향을 미쳤는지를 표기하는 척도
- 0이면 클래스를 구분하는데 해당 피쳐가 선택되지 않았다는 것이며, 1이면 해당 피쳐가 클래스를 완벽하게 나눴다는 것을 의미
In [ ]:
proba3 = rf3.predict_proba(X_test)
proba3
Out[ ]:
array([[0.96333333, 0.03666667],
[0.98747831, 0.01252169],
[0.67944316, 0.32055684],
...,
[1. , 0. ],
[0.90383261, 0.09616739],
[0.97 , 0.03 ]])
In [ ]:
roc_auc_score(y_test, proba3[:, 1])
Out[ ]:
0.9320285483491656
In [ ]:
rf3.feature_importances_
Out[ ]:
array([1.27137138e-01, 2.14563953e-02, 4.68663037e-02, 5.96305446e-02,
2.16746957e-02, 3.25152451e-02, 1.02347526e-02, 5.40307416e-03,
8.84445875e-04, 2.02141842e-03, 2.44460558e-02, 3.14871409e-03,
2.18776128e-02, 3.06142123e-03, 9.51803193e-02, 2.23244236e-02,
5.75968205e-02, 1.36759899e-02, 3.53588232e-02, 2.80447856e-02,
3.80039740e-02, 7.62870375e-03, 6.72543460e-03, 3.44290597e-03,
4.03857944e-03, 2.04861894e-03, 2.63415501e-03, 1.84105707e-03,
3.95932072e-03, 3.62772728e-03, 3.05844877e-03, 3.56107858e-03,
2.38180451e-03, 3.26791942e-03, 2.91845948e-03, 3.01678974e-03,
9.42477811e-03, 2.46969262e-04, 1.08307976e-02, 0.00000000e+00,
5.74350945e-03, 9.03907596e-04, 6.54901146e-04, 4.05826969e-03,
2.30412987e-03, 1.25861929e-03, 1.10018824e-03, 3.65382816e-04,
3.43650984e-05, 1.10160554e-02, 1.51175712e-03, 1.19112472e-03,
5.19366703e-03, 2.61859311e-03, 1.46746839e-03, 1.15967758e-03,
3.98805744e-04, 1.35937049e-04, 1.43906056e-04, 2.28125116e-06,
6.99747351e-02, 9.55569434e-02, 4.52910589e-04, 3.08796348e-03,
4.79153629e-04, 1.67417487e-02, 1.19682713e-02, 3.45258733e-03,
4.02311808e-03, 4.35533780e-03, 3.44818237e-03])
In [ ]:
1.27137138e-01
Out[ ]:
0.127137138
In [ ]:
feat_imp = pd.DataFrame({
'features': X_train.columns,
'importances': rf3.feature_importances_
})
In [ ]:
feat_imp
Out[ ]:
features | importances | |
---|---|---|
0 | lead_time | 0.127137 |
1 | arrival_date_year | 0.021456 |
2 | arrival_date_week_number | 0.046866 |
3 | arrival_date_day_of_month | 0.059631 |
4 | stays_in_weekend_nights | 0.021675 |
... | ... | ... |
66 | customer_type_Transient-Party | 0.011968 |
67 | season_fall | 0.003453 |
68 | season_spring | 0.004023 |
69 | season_summer | 0.004355 |
70 | season_winter | 0.003448 |
71 rows × 2 columns
In [ ]:
top10 = feat_imp.sort_values('importances', ascending=False).head(10)
top10
Out[ ]:
features | importances | |
---|---|---|
0 | lead_time | 0.127137 |
61 | deposit_type_Non Refund | 0.095557 |
14 | adr | 0.095180 |
60 | deposit_type_No Deposit | 0.069975 |
3 | arrival_date_day_of_month | 0.059631 |
16 | total_of_special_requests | 0.057597 |
2 | arrival_date_week_number | 0.046866 |
20 | cancel_rate | 0.038004 |
18 | total_nights | 0.035359 |
5 | stays_in_week_nights | 0.032515 |
In [ ]:
plt.figure(figsize=(5, 10))
sns.barplot(x='importances', y='features', data=top10)
Out[ ]:
<Axes: xlabel='importances', ylabel='features'>
728x90
반응형
'파이썬 머신러닝, 딥러닝' 카테고리의 다른 글
(Python) KMeans (1) | 2023.06.15 |
---|---|
(Python) lightGBM (0) | 2023.06.15 |
(Python) 서포트 벡터 머신 (0) | 2023.06.14 |
(Python) 로지스틱 회귀 (0) | 2023.06.14 |
(Python) 의사 결정 나무 (0) | 2023.06.14 |