4 분 소요

Background

ANN을 이용한 자전거 대여량 예측 Prediction of bicycle rental volume using ANN

Loading the dataset

import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
bike = pd.read_csv("bike-sharing-daily.csv")

bike.isnull().sum() # 결측치 확인 missing values
    instant       0
    dteday        0
    season        0
    yr            0
    mnth          0
    holiday       0
    weekday       0
    workingday    0
    weathersit    0
    temp          0
    hum           0
    windspeed     0
    casual        0
    registered    0
    cnt           0
    dtype: int64

상기 결과는 결측치가 없음을 보여준다. The results show that there are no missing values.

# 불필요한 열 제거 Removing unnecessary columns
bike.drop(labels = ["instant"], axis = 1, inplace=True) # inplace: apply changes to 'bike'

# 시계열 time series
bike.dteday = pd.to_datetime(bike.dteday, format="%m/%d/%Y") # formatting datetime
bike.index = pd.DatetimeIndex(bike.dteday) # indexing the datetime
bike.drop(labels=["dteday"], axis = 1, inplace=True) # removing the duplicate 'dteday' column

Visualizing the dataset

bike["cnt"].asfreq("W").plot(linewidth = 3) # by week
plt.title("Bike Usage Per Week")
plt.xlabel("Week")
plt.ylabel("Bike Rental")

image

bike["cnt"].asfreq("M").plot(linewidth = 3) # by month
plt.title("Bike Usage Per Month")
plt.xlabel("Month")
plt.ylabel("Bike Rental")

image

bike["cnt"].asfreq("Q").plot(linewidth = 3) # by quarter
plt.title("Bike Usage Per Quarter")
plt.xlabel("Quarter")
plt.ylabel("Bike Rental")

image

# 한 눈에 여러 시각화 확인 Using the visualization tool
sns.pairplot(bike)

image

Building the ANN

EDA

X_numerical = bike[ ["temp", "hum", "windspeed", "cnt"] ]
X_numerical

image

sns.pairplot(X_numerical) # correlation between independent variables

image

X_numerical.corr() # correlation analysis

image

sns.heatmap(X_numerical.corr(), annot = True) # confusion matrix

image

annot: 수치 표시 showing numerical values

Preprocessing

X_cat = bike[ ["season", "yr", "mnth", "holiday", "weekday", "workingday", "weathersit"] ]
X_cat

image

상기 언급된 독립변수를 신경망 학습에 사용한다. We are going to train an ANN with the independent variables listed above.

# converting categorical data
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
X_cat.shape
    (731, 32)
X_cat = pd.DataFrame(X_cat) # 테이블 형태로 데이터 확인 converting to dataframe for visualization
X_cat

image

X_numerical = X_numerical.reset_index() # 이전에 datetime이 인덱스로 지정되있었음 previously datetime set to index
X_numerical

image

# integrating all the X candidates
X_all = pd.concat( [X_cat, X_numerical], axis = 1)

# removing unnecessary variables
X_all.drop("dteday", axis = 1, inplace = True)
X = X_all.iloc[:, :-1].values
y = X_all.iloc[:, -1:].values
X.shape, type(X)
    ((731, 35), numpy.ndarray)
y.shape, type(y)
    ((731, 1), numpy.ndarray)        

Feature Scaling

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
y = scaler.fit_transform(y)

Splitting the dataset into Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Designing the model

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units = 100, activation = "relu", input_shape = (35, )))
model.add(tf.keras.layers.Dense(units = 100, activation = "relu"))
model.add(tf.keras.layers.Dense(units = 100, activation = "relu"))
model.add(tf.keras.layers.Dense(units = 1, activation="linear"))

model.summary()
    Model: "sequential_1"
    _________________________________________________________________
    Layer (type)                Output Shape              Param #   
    =================================================================
    dense_2 (Dense)             (None, 100)               3600      
                                                                    
    dense_3 (Dense)             (None, 100)               10100     
                                                                    
    dense_4 (Dense)             (None, 100)               10100     
                                                                    
    dense_5 (Dense)             (None, 1)                 101       
                                                                    
    =================================================================
    Total params: 23,901
    Trainable params: 23,901
    Non-trainable params: 0
    _________________________________________________________________

Training the model

model.compile(optimizer="Adam", loss="mean_squared_error")
epochs_hist = model.fit(X_train, y_train, epochs= 50, batch_size = 50, validation_split=0.2)
    Output exceeds the size limit. Open the full output data in a text editor
    Epoch 1/50
    10/10 [==============================] - 1s 43ms/step - loss: 0.1554 - val_loss: 0.0687
    Epoch 2/50
    10/10 [==============================] - 0s 13ms/step - loss: 0.0343 - val_loss: 0.0355
    Epoch 3/50
    10/10 [==============================] - 0s 13ms/step - loss: 0.0188 - val_loss: 0.0208
    Epoch 4/50
    ...
    Epoch 49/50
    10/10 [==============================] - 0s 21ms/step - loss: 0.0019 - val_loss: 0.0129
    Epoch 50/50
    10/10 [==============================] - 0s 19ms/step - loss: 0.0025 - val_loss: 0.0115
epochs_hist.history.keys()
    dict_keys(['loss', 'val_loss'])

모델 성능 평가 피라미터로 ‘loss’와 ‘val_loss’가 있다. We have two evaluating parameters, ‘loss’ and ‘val_loss’.

‘loss’는 테스트셋을 대상으로 학습한 손실값, ‘val_loss’는 학습 데이터의 검증셋으로 도출한 손실값 분포이다. ‘loss’ is based on the test set, and ‘val_loss’ is based on the validation set that is part of the training set.

Visualizing the training and test results

plt.plot(epochs_hist.history["loss"])
plt.plot(epochs_hist.history["val_loss"])
plt.title("Model Loss Progress During Traning")
plt.xlabel("Epoch")
plt.ylabel("Traning Loss and Validation Loss")
plt.legend(["Traning Loss", "Validation Loss"])

image

y_predict = model.predict(X_test)
plt.plot(y_test, y_predict, "^", color = "r")
plt.xlabel("Model Predictions")
plt.ylabel("True Values")

image

상기 분포의 단위가 정규화된 것을 볼 수 있다. It can be seen that the units of the distribution are normalized.

따라서, 원래 단위로 다시 변환해주도록 하자. So, let’s convert it back to the original unit.

y_predict_org = scaler.inverse_transform(y_predict)
y_test_org = scaler.inverse_transform(y_test)

image

Evaluating the model

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
RMSE = float(format(np.sqrt(mean_squared_error(y_test_org, y_predict_org)), ".3f"))
MSE = mean_squared_error(y_test_org, y_predict_org)
MAE = mean_absolute_error(y_test_org, y_predict_org)
r2 = r2_score(y_test_org, y_predict_org)
adj_r2 = 1 - (1-r2)*(n-1)/(n-147-1)

모델의 성능을 검증하기 위한 손실/비용 함수의 종류에는 여러 가지가 존재한다. There are several types of loss/cost functions for verifying the performance of the model.

각 손실함수 사용에 따른 모델 성능을 확인해보자. Let’s check the model performance according to the use of each loss function.

print(f"RMSE = {RMSE}, MSE = {MSE}, MAE = {MAE}, R2 = {r2}, Adjusted R2 = {adj_r2}")
    RMSE = 1070.871, 
    MSE = 1146764.4761727168, 
    MAE = 807.9534366633617, 
    R2 = 0.7237822988637166, 
    Adjusted R2 = 41.32778436589738

공통적으로 모두 수치가 낮을수록 모델의 좋은 성능을 의미한다. In general, the lower the number, the better the model’s performance.

각각에 대한 보다 자세한 설명은 생략한다. A more detailed description of each will be omitted.

댓글남기기