ANN - Regression: House Sales Prediction
Learning Goals
ANN을 이용한 주택 가격 예측 House Price Prediction Using ANN
Loading the dataset
[Notice] Download Dataset (Kaggle)
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
house_df = pd.read_csv("kc-house-data.csv")
house_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
EDA
f, ax = plt.subplots(figsize = (20, 20))
sns.heatmap(house_df.corr(), annot=True)
sns.scatterplot(x = "sqft_living", y="price", data=house_df)
맛보기로 ‘sqft_living’ 변수와 종속변수(‘Price’)의 상관관계 분포도를 만들었다. For testing, I created a plot of the correlation between ‘sqft_living’ and ‘Price’.
상기 그리드는 모든 변수들 간의 상관관계 분포도를 나타낸다. The grid represents correlation distributions between all variables.
house_df.hist(bins=20, figsize=(20, 20))
Preprocessing
selected_features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors', 'sqft_above', 'sqft_basement', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'yr_built',
'yr_renovated', 'zipcode', 'lat', 'long', 'sqft_living15', 'sqft_lot15']
X = house_df[selected_features]
y = house_df["price"]
Feature Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y = y.values.reshape(-1, 1)
y_scaled = scaler.fit_transform(y)
Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size = 0.2)
Building the model
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units = 100, activation="relu", input_shape=(19, )))
model.add(tf.keras.layers.Dense(units = 100, activation="relu"))
model.add(tf.keras.layers.Dense(units = 100, activation="relu"))
model.add(tf.keras.layers.Dense(units = 1, activation="linear"))
model.compile(optimizer = "Adam", loss = "mean_squared_error")
epochs_hist = model.fit(X_train, y_train, epochs = 50, batch_size = 50, validation_split = 0.2)
Epoch 1/50
277/277 [==============================] - 1s 2ms/step - loss: 8.2607e-04 - val_loss: 4.8727e-04
Epoch 2/50
277/277 [==============================] - 0s 1ms/step - loss: 5.2934e-04 - val_loss: 4.1871e-04
Epoch 3/50
277/277 [==============================] - 0s 1ms/step - loss: 4.4549e-04 - val_loss: 4.3053e-04
...
Epoch 49/50
277/277 [==============================] - 0s 2ms/step - loss: 1.9294e-04 - val_loss: 2.6583e-04
Epoch 50/50
277/277 [==============================] - 0s 1ms/step - loss: 1.9350e-04 - val_loss: 2.8596e-04
‘epoch’ 개수를 증가시키면 모델이 적은 손실값과 함께 더 정확한 예측을 해낼 것이다. Increasing the number of ‘epochs’ will cause the model to make more accurate predictions with fewer losses.
하지만, 이것은 과적합 현상을 불러올 수 있으니, 적절한 개수 설정을 수동으로 바꿔가며 확인해볼 필요가 있다. However, this can lead to overfitting, so you need to manually change the appropriate number setting to check.
epochs_hist.history.keys()
dict_keys(['loss', 'val_loss'])
Visualizing the results
plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])
plt.title('Model Loss Progress During Training')
plt.ylabel('Training and Validation Loss')
plt.xlabel('Epoch number')
plt.legend(['Training Loss', 'Validation Loss'])
만약, 학습 과정에서 epoch’ 개수를 50이 아니라 100으로 했다면, ‘validation loss’ 분산이 더 적게 나타났을 것이다. If the number of epochs was 100 instead of 50 in the training process, the ‘validation loss’ variance would be smaller.
참고로, 입력 피처의 개수를 적게 설정하면 하기 분포처럼 ‘Validation loss’가 굉장히 튀는 모습을 보일 것이다. For reference, if the number of input features is set to be smaller, the ‘Validation loss’ will be very bouncing like the following distribution.
이것은 마치 우리가 시험 성적에 영향을 미치는 요인 중에, 시험 전날 수면 시간이 주요한 입력 피처임에도 무시하고 있다가, 뒤늦게 모델 학습에 반영하여 모델이 새로운 데이터들에 대해 더 정확한 예측을 해내는 것과 비슷한 이치이다. For example, if you had an exam tomorrow but hadn’t slept yesterday, you would likely screw up the exam. Similarly, we see that sleep time before the exam is a pivotal feature in training our AI model.
Predicting the Test set
y_predict = model.predict(X_test)
plt.plot(y_test, y_predict, "^", color = 'r')
plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Linear Regression Predictions')
plt.show()
역시 시각화에서 단위가 정규화된 모습이다. As shown in the image, the units are normalized.
원래 단위로 변환시켜보자. Let’s convert to the original unit
y_predict_orig = scaler.inverse_transform(y_predict)
y_test_orig = scaler.inverse_transform(y_test)
y_predict = model.predict(X_test)
plt.plot(y_test_orig, y_predict_orig, "^", color = 'r')
plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Linear Regression Predictions')
plt.show()
Evaluating the model performance
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 135724.235
MSE = 18421068040.530396
MAE = 79133.06868421813
R2 = 0.8656443648009154
Adjusted R2 = 0.8654264066441615
상기 다양한 종류의 손실값을 활용하여 모델 성능 평가가 가능하다. It is possible to evaluate the model performance by using the above various types of loss values.
좋은 모델일수록 각 손실값이 작을 것이다. The better the model, the smaller each loss value will be.
앞서 언급한 ‘epoch’와 같은 요인들을 수동으로 바꿔가며 각 손실값들의 변화를 관찰하여 최적의 설정값을 구하자. You need to manually change factors such as ‘epoch’ and find the optimal combination of parameters by observing the change in each loss value.
또한, 혼동 행렬 같은 기법으로 성능 분석이 가능할 것이다. You might want to use ‘confusion matrix’ for performance analysis.
댓글남기기