ANN - Regression: Bike Rental Prediction
Background
ANN을 이용한 자전거 대여량 예측 Prediction of bicycle rental volume using ANN
Loading the dataset
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
bike = pd.read_csv("bike-sharing-daily.csv")
bike.isnull().sum() # 결측치 확인 missing values
instant 0
dteday 0
season 0
yr 0
mnth 0
holiday 0
weekday 0
workingday 0
weathersit 0
temp 0
hum 0
windspeed 0
casual 0
registered 0
cnt 0
dtype: int64
상기 결과는 결측치가 없음을 보여준다. The results show that there are no missing values.
# 불필요한 열 제거 Removing unnecessary columns
bike.drop(labels = ["instant"], axis = 1, inplace=True) # inplace: apply changes to 'bike'
# 시계열 time series
bike.dteday = pd.to_datetime(bike.dteday, format="%m/%d/%Y") # formatting datetime
bike.index = pd.DatetimeIndex(bike.dteday) # indexing the datetime
bike.drop(labels=["dteday"], axis = 1, inplace=True) # removing the duplicate 'dteday' column
Visualizing the dataset
bike["cnt"].asfreq("W").plot(linewidth = 3) # by week
plt.title("Bike Usage Per Week")
plt.xlabel("Week")
plt.ylabel("Bike Rental")
bike["cnt"].asfreq("M").plot(linewidth = 3) # by month
plt.title("Bike Usage Per Month")
plt.xlabel("Month")
plt.ylabel("Bike Rental")
bike["cnt"].asfreq("Q").plot(linewidth = 3) # by quarter
plt.title("Bike Usage Per Quarter")
plt.xlabel("Quarter")
plt.ylabel("Bike Rental")
# 한 눈에 여러 시각화 확인 Using the visualization tool
sns.pairplot(bike)
Building the ANN
EDA
X_numerical = bike[ ["temp", "hum", "windspeed", "cnt"] ]
X_numerical
sns.pairplot(X_numerical) # correlation between independent variables
X_numerical.corr() # correlation analysis
sns.heatmap(X_numerical.corr(), annot = True) # confusion matrix
annot: 수치 표시 showing numerical values
Preprocessing
X_cat = bike[ ["season", "yr", "mnth", "holiday", "weekday", "workingday", "weathersit"] ]
X_cat
상기 언급된 독립변수를 신경망 학습에 사용한다. We are going to train an ANN with the independent variables listed above.
# converting categorical data
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
X_cat.shape
(731, 32)
X_cat = pd.DataFrame(X_cat) # 테이블 형태로 데이터 확인 converting to dataframe for visualization
X_cat
X_numerical = X_numerical.reset_index() # 이전에 datetime이 인덱스로 지정되있었음 previously datetime set to index
X_numerical
# integrating all the X candidates
X_all = pd.concat( [X_cat, X_numerical], axis = 1)
# removing unnecessary variables
X_all.drop("dteday", axis = 1, inplace = True)
X = X_all.iloc[:, :-1].values
y = X_all.iloc[:, -1:].values
X.shape, type(X)
((731, 35), numpy.ndarray)
y.shape, type(y)
((731, 1), numpy.ndarray)
Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
y = scaler.fit_transform(y)
Splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Designing the model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units = 100, activation = "relu", input_shape = (35, )))
model.add(tf.keras.layers.Dense(units = 100, activation = "relu"))
model.add(tf.keras.layers.Dense(units = 100, activation = "relu"))
model.add(tf.keras.layers.Dense(units = 1, activation="linear"))
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 100) 3600
dense_3 (Dense) (None, 100) 10100
dense_4 (Dense) (None, 100) 10100
dense_5 (Dense) (None, 1) 101
=================================================================
Total params: 23,901
Trainable params: 23,901
Non-trainable params: 0
_________________________________________________________________
Training the model
model.compile(optimizer="Adam", loss="mean_squared_error")
epochs_hist = model.fit(X_train, y_train, epochs= 50, batch_size = 50, validation_split=0.2)
Output exceeds the size limit. Open the full output data in a text editor
Epoch 1/50
10/10 [==============================] - 1s 43ms/step - loss: 0.1554 - val_loss: 0.0687
Epoch 2/50
10/10 [==============================] - 0s 13ms/step - loss: 0.0343 - val_loss: 0.0355
Epoch 3/50
10/10 [==============================] - 0s 13ms/step - loss: 0.0188 - val_loss: 0.0208
Epoch 4/50
...
Epoch 49/50
10/10 [==============================] - 0s 21ms/step - loss: 0.0019 - val_loss: 0.0129
Epoch 50/50
10/10 [==============================] - 0s 19ms/step - loss: 0.0025 - val_loss: 0.0115
epochs_hist.history.keys()
dict_keys(['loss', 'val_loss'])
모델 성능 평가 피라미터로 ‘loss’와 ‘val_loss’가 있다. We have two evaluating parameters, ‘loss’ and ‘val_loss’.
‘loss’는 테스트셋을 대상으로 학습한 손실값, ‘val_loss’는 학습 데이터의 검증셋으로 도출한 손실값 분포이다. ‘loss’ is based on the test set, and ‘val_loss’ is based on the validation set that is part of the training set.
Visualizing the training and test results
plt.plot(epochs_hist.history["loss"])
plt.plot(epochs_hist.history["val_loss"])
plt.title("Model Loss Progress During Traning")
plt.xlabel("Epoch")
plt.ylabel("Traning Loss and Validation Loss")
plt.legend(["Traning Loss", "Validation Loss"])
y_predict = model.predict(X_test)
plt.plot(y_test, y_predict, "^", color = "r")
plt.xlabel("Model Predictions")
plt.ylabel("True Values")
상기 분포의 단위가 정규화된 것을 볼 수 있다. It can be seen that the units of the distribution are normalized.
따라서, 원래 단위로 다시 변환해주도록 하자. So, let’s convert it back to the original unit.
y_predict_org = scaler.inverse_transform(y_predict)
y_test_org = scaler.inverse_transform(y_test)
Evaluating the model
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
RMSE = float(format(np.sqrt(mean_squared_error(y_test_org, y_predict_org)), ".3f"))
MSE = mean_squared_error(y_test_org, y_predict_org)
MAE = mean_absolute_error(y_test_org, y_predict_org)
r2 = r2_score(y_test_org, y_predict_org)
adj_r2 = 1 - (1-r2)*(n-1)/(n-147-1)
모델의 성능을 검증하기 위한 손실/비용 함수의 종류에는 여러 가지가 존재한다. There are several types of loss/cost functions for verifying the performance of the model.
각 손실함수 사용에 따른 모델 성능을 확인해보자. Let’s check the model performance according to the use of each loss function.
print(f"RMSE = {RMSE}, MSE = {MSE}, MAE = {MAE}, R2 = {r2}, Adjusted R2 = {adj_r2}")
RMSE = 1070.871,
MSE = 1146764.4761727168,
MAE = 807.9534366633617,
R2 = 0.7237822988637166,
Adjusted R2 = 41.32778436589738
공통적으로 모두 수치가 낮을수록 모델의 좋은 성능을 의미한다. In general, the lower the number, the better the model’s performance.
각각에 대한 보다 자세한 설명은 생략한다. A more detailed description of each will be omitted.
댓글남기기