11 분 소요


[Notice] download here

Learning Goals

시계열 예측을 위한 페이스북 Propjet 이해 Understanding Facebook Propjet for Time Series Prediction

PART 1: Chicago Crime Rate



절도범이 어느 시간대에 가장 잘 잡히는지, 범죄율이 올라가는 가장 높은 시간대는 언제인지 등을 관찰해보고 Prophet 활용하여 미래 ‘Crime’ 결과도 예측해본다. Observing when thieves are best caught and when the crime rate rises the most, and predict future ‘Crime’ results using Prophet.

Observing the dataset

Dataset contains the following columns:

  • ID: Unique identifier for the record.
  • Case Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
  • Date: Date when the incident occurred.
  • Block: address where the incident occurred
  • IUCR: The Illinois Unifrom Crime Reporting code.
  • Primary Type: The primary description of the IUCR code.
  • Description: The secondary description of the IUCR code, a subcategory of the primary description.
  • Location Description: Description of the location where the incident occurred.
  • Arrest: Indicates whether an arrest was made.
  • Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  • Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car.
  • District: Indicates the police district where the incident occurred.
  • Ward: The ward (City Council district) where the incident occurred.
  • Community Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
  • FBI Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
  • X Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Y Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Year: Year the incident occurred.
  • Updated On: Date and time the record was last updated.
  • Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  • Longitude: The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  • Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

Datasource: https://www.kaggle.com/currie32/crimes-in-chicago

Loading the dataset

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from fbprophet import Prophet

# training and testing datasets 
chicago_df_1 = pd.read_csv('Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False) # error_bad_lines: 손상된 줄이나 누락된 행을 무시한다 Ignoring corrupted or missing lines
chicago_df_2 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)
chicago_df_3 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)

chicago_df = pd.concat([chicago_df_1, chicago_df_2, chicago_df_3], ignore_index=False, axis=0) # concatnate dataframes

Organizing the dataset

# 불필요한 열 제거하기
chicago_df.drop(['Unnamed: 0', 'Case Number', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location', 'District', 'Latitude' , 'Longitude'], inplace=True, axis=1)

inplace: 메모리에서 실제열(= 불필요한열)을 삭제한다 Delete real rows (= unnecessary rows) from memory

axis=1: 전체 열을 탈락시킨다 dissipating all heat

하기 코드는 이 프로젝트의 시계열 처리에 수반되는 전처리 과정이다. The following code is the preprocessing process involved in the time series processing of this project.

# Date 형식 수정 Modifying Date Format
chicago_df.Date = pd.to_datetime(chicago_df.Date, format='%m/%d/%Y %I:%M:%S %p')

# Date을 인덱스로 활용한다 Usnig Date as an index
chicago_df.index = pd.DatetimeIndex(chicago_df.Date)

DatetimeIndex: 특정한 순간에 기록된 타임스탬프(timestamp) 형식의 시계열 자료를 다루기 위한 인덱스 Index for handling time series data in timestamp format recorded at a specific moment

Data Visualization

sns.heatmap(chicago_df.isnull(), cbar = False, cmap = 'YlGnBu')


# 어떤 종류의 폭력이 가장 많이 발생했나 What kind of violence occurred the most
plt.figure(figsize = (15, 10))
sns.countplot(y= 'Primary Type', data = chicago_df, order = chicago_df['Primary Type'].value_counts().iloc[:15].index)


‘MOTOR VEHICLE THEFT’ 대략 20만 여개의 차량이 도난됐다. ‘MOTOR VEHICLE THEFT’ About 200,000 vehicles were stolen.

# 어느 지역에서 가장 폭력이 많이 발생했는가 Which region has the most violence?
plt.figure(figsize = (15, 10))
sns.countplot(y= 'Location Description', data = chicago_df, order = chicago_df['Location Description'].value_counts().iloc[:15].index)


‘거리’에서 발생한 폭력이 가장 많은 것을 확인해볼 수 있다. It can be seen that the most violence occurred on the ‘street’.

# 특정 연도에 범죄가 얼마나 발생했나 How many crimes occurred in a particular year
plt.plot(chicago_df.resample('Y').size()) # 연도(Y) 기준으로 resample하여 특정 연도에 발생한 샘플 개수(사건 수)를 도출 Resampling based on year (Y) to derive # samples (# events) that occurred in a specific year
plt.title('Crimes Count Per Year')
plt.ylabel('Number of Crimes')


    2005-12-31    455811
    2006-12-31    794684
    2007-12-31    621848
    2008-12-31    852053
    2009-12-31    783900
    2010-12-31    700691
    2011-12-31    352066
    2012-12-31    335670
    2013-12-31    306703
    2014-12-31    274527
    2015-12-31    262995
    2016-12-31    265462
    2017-12-31     11357
# 특정 달에 범죄가 얼마나 발생했나 How many crimes occurred in a particular month
plt.title('Crimes Count Per Month')
plt.ylabel('Number of Crimes')


# 특정 분기에 범죄가 얼마나 발생했나 How many crimes occurred in a particular quarter
plt.title('Crimes Count Per Quarter')
plt.ylabel('Number of Crimes')


Data Preprocessing

chicago_prophet = chicago_df.resample('M').size().reset_index() # 인덱스화 되어있는 테이블을 초기화시킨다 Initializing an indexed table
chicago_prophet.columns = ['Date', 'Crime Count']
chicago_prophet_df = pd.DataFrame(chicago_prophet)
chicago_prophet_df_final = chicago_prophet_df.rename(columns={'Date':'ds', 'Crime Count':'y'})



m = Prophet() # 'Crime의' 미래를 예측하는 역할 Predicting the future of 'Crime'

# Forcasting into the future
future = m.make_future_dataframe(periods=365) # 앞으로 1년 동안의 'Crime'을 Prophet 활용하여 예측 Prediction using Prophet 'Crime' for the next year
forecast = m.predict(future)
ds trend yhat_lower yhat_upper trend_lower trend_upper additive_terms additive_terms_lower additive_terms_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2005-01-31 60454.550849 38579.873129 72458.344880 60454.550849 60454.550849 -4762.896867 -4762.896867 -4762.896867 -4762.896867 -4762.896867 -4762.896867 0.0 0.0 0.0 55691.653982
1 2005-02-28 60322.147047 34221.041714 67421.454566 60322.147047 60322.147047 -9500.949898 -9500.949898 -9500.949898 -9500.949898 -9500.949898 -9500.949898 0.0 0.0 0.0 50821.197149
2 2005-03-31 60175.557124 42823.370306 75698.921616 60175.557124 60175.557124 -1224.296867 -1224.296867 -1224.296867 -1224.296867 -1224.296867 -1224.296867 0.0 0.0 0.0 58951.260257
3 2005-04-30 60033.695908 44555.300811 79384.772219 60033.695908 60033.695908 1182.976012 1182.976012 1182.976012 1182.976012 1182.976012 1182.976012 0.0 0.0 0.0 61216.671919
4 2005-05-31 59887.105985 49410.229024 81613.777868 59887.105985 59887.105985 5498.632207 5498.632207 5498.632207 5498.632207 5498.632207 5498.632207 0.0 0.0 0.0 65385.738191
5 2005-06-30 59745.244769 47098.245873 78678.030782 59745.244769 59745.244769 3577.501610 3577.501610 3577.501610 3577.501610 3577.501610 3577.501610 0.0 0.0 0.0 63322.746379
6 2005-07-31 59598.654838 47806.382068 80665.905762 59598.654838 59598.654838 4583.361194 4583.361194 4583.361194 4583.361194 4583.361194 4583.361194 0.0 0.0 0.0 64182.016032
7 2005-08-31 59452.064908 47368.296071 80756.033012 59452.064908 59452.064908 4499.375562 4499.375562 4499.375562 4499.375562 4499.375562 4499.375562 0.0 0.0 0.0 63951.440470
8 2005-09-30 59310.203685 44535.626188 77006.014932 59310.203685 59310.203685 1749.549105 1749.549105 1749.549105 1749.549105 1749.549105 1749.549105 0.0 0.0 0.0 61059.752790
9 2005-10-31 59163.613755 45431.878281 78473.609767 59163.613755 59163.613755 2397.346677 2397.346677 2397.346677 2397.346677 2397.346677 2397.346677 0.0 0.0 0.0 61560.960432
10 2005-11-30 59021.752529 39855.843322 73045.342043 59021.752529 59021.752529 -2065.033670 -2065.033670 -2065.033670 -2065.033670 -2065.033670 -2065.033670 0.0 0.0 0.0 56956.718858
11 2005-12-31 58875.162595 35993.544575 71131.998977 58875.162595 58875.162595 -5992.119657 -5992.119657 -5992.119657 -5992.119657 -5992.119657 -5992.119657 0.0 0.0 0.0 52883.042938
12 2006-01-31 58728.572661 36668.886036 70839.471318 58728.572661 58728.572661 -4772.659269 -4772.659269 -4772.659269 -4772.659269 -4772.659269 -4772.659269 0.0 0.0 0.0 53955.913392
13 2006-02-28 58596.168850 32430.561382 65806.466898 58596.168850 58596.168850 -9503.051717 -9503.051717 -9503.051717 -9503.051717 -9503.051717 -9503.051717 0.0 0.0 0.0 49093.117133
14 2006-03-31 58449.578916 39193.541836 71919.946197 58449.578916 58449.578916 -1224.434198 -1224.434198 -1224.434198 -1224.434198 -1224.434198 -1224.434198 0.0 0.0 0.0 57225.144718
15 2006-04-30 58307.717686 42625.987661 77057.400277 58307.717686 58307.717686 1187.100547 1187.100547 1187.100547 1187.100547 1187.100547 1187.100547 0.0 0.0 0.0 59494.818233
16 2006-05-31 58161.127748 45990.584105 80277.571713 58161.127748 58161.127748 5451.418874 5451.418874 5451.418874 5451.418874 5451.418874 5451.418874 0.0 0.0 0.0 63612.546621
17 2006-06-30 58019.266517 45778.759963 79069.169730 58019.266517 58019.266517 3564.138248 3564.138248 3564.138248 3564.138248 3564.138248 3564.138248 0.0 0.0 0.0 61583.404765
18 2006-07-31 57872.676579 47002.201998 79782.570772 57872.676579 57872.676579 4563.254349 4563.254349 4563.254349 4563.254349 4563.254349 4563.254349 0.0 0.0 0.0 62435.930927
19 2006-08-31 57726.086594 45447.102442 79160.703585 57726.086594 57726.086594 4479.990711 4479.990711 4479.990711 4479.990711 4479.990711 4479.990711 0.0 0.0 0.0 62206.077306
20 2006-09-30 57584.225319 41994.861861 76377.523816 57584.225319 57584.225319 1829.842795 1829.842795 1829.842795 1829.842795 1829.842795 1829.842795 0.0 0.0 0.0 59414.068114
21 2006-10-31 57437.635335 43300.623636 77873.098798 57437.635335 57437.635335 2439.830765 2439.830765 2439.830765 2439.830765 2439.830765 2439.830765 0.0 0.0 0.0 59877.466100
22 2006-11-30 57295.774060 38210.477095 73388.386714 57295.774060 57295.774060 -2045.360906 -2045.360906 -2045.360906 -2045.360906 -2045.360906 -2045.360906 0.0 0.0 0.0 55250.413154
23 2006-12-31 57149.184075 34675.761615 67310.501555 57149.184075 57149.184075 -6013.413267 -6013.413267 -6013.413267 -6013.413267 -6013.413267 -6013.413267 0.0 0.0 0.0 51135.770808
24 2007-01-31 56994.480254 35873.268789 69294.200682 56994.480254 56994.480254 -4783.036614 -4783.036614 -4783.036614 -4783.036614 -4783.036614 -4783.036614 0.0 0.0 0.0 52211.443640
25 2007-02-28 56854.747771 29801.353956 62906.688914 56854.747771 56854.747771 -9501.921423 -9501.921423 -9501.921423 -9501.921423 -9501.921423 -9501.921423 0.0 0.0 0.0 47352.826348
26 2007-03-31 56700.043950 38154.033064 72707.777440 56700.043950 56700.043950 -1225.266871 -1225.266871 -1225.266871 -1225.266871 -1225.266871 -1225.266871 0.0 0.0 0.0 55474.777078
27 2007-04-30 56550.330574 40491.474353 74706.900200 56550.330574 56550.330574 1190.223261 1190.223261 1190.223261 1190.223261 1190.223261 1190.223261 0.0 0.0 0.0 57740.553835
28 2007-05-31 56395.626753 45020.807451 77993.605388 56395.626753 56395.626753 5402.206493 5402.206493 5402.206493 5402.206493 5402.206493 5402.206493 0.0 0.0 0.0 61797.833246
29 2007-06-30 56230.582698 43009.450294 77307.962960 56230.582698 56230.582698 3551.457755 3551.457755 3551.457755 3551.457755 3551.457755 3551.457755 0.0 0.0 0.0 59782.040452
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
480 2018-01-02 10430.753220 -10645.820107 20038.001128 10287.038125 10556.745840 -5814.639079 -5814.639079 -5814.639079 -5814.639079 -5814.639079 -5814.639079 0.0 0.0 0.0 4616.114140
481 2018-01-03 10417.686680 -12131.618633 21312.023151 10273.332005 10544.184093 -5725.854616 -5725.854616 -5725.854616 -5725.854616 -5725.854616 -5725.854616 0.0 0.0 0.0 4691.832063
482 2018-01-04 10404.620140 -11264.733448 21481.058839 10259.615204 10531.622345 -5640.534387 -5640.534387 -5640.534387 -5640.534387 -5640.534387 -5640.534387 0.0 0.0 0.0 4764.085753
483 2018-01-05 10391.553601 -11293.427706 21583.702828 10245.910236 10519.060598 -5560.875533 -5560.875533 -5560.875533 -5560.875533 -5560.875533 -5560.875533 0.0 0.0 0.0 4830.678068
484 2018-01-06 10378.487061 -10571.821161 21827.517378 10232.514272 10506.498851 -5488.665623 -5488.665623 -5488.665623 -5488.665623 -5488.665623 -5488.665623 0.0 0.0 0.0 4889.821439
485 2018-01-07 10365.420522 -11213.789750 21737.667858 10218.899641 10493.937103 -5425.240870 -5425.240870 -5425.240870 -5425.240870 -5425.240870 -5425.240870 0.0 0.0 0.0 4940.179652
486 2018-01-08 10352.353982 -11068.421737 22756.628202 10205.056042 10481.449293 -5371.460788 -5371.460788 -5371.460788 -5371.460788 -5371.460788 -5371.460788 0.0 0.0 0.0 4980.893194
487 2018-01-09 10339.287442 -11751.596928 22353.659111 10191.219798 10469.065920 -5327.699997 -5327.699997 -5327.699997 -5327.699997 -5327.699997 -5327.699997 0.0 0.0 0.0 5011.587446
488 2018-01-10 10326.220903 -12214.265224 21555.904799 10177.503749 10456.812418 -5293.857323 -5293.857323 -5293.857323 -5293.857323 -5293.857323 -5293.857323 0.0 0.0 0.0 5032.363580
489 2018-01-11 10313.154363 -13379.641545 21442.105207 10163.873381 10444.834064 -5269.381789 -5269.381789 -5269.381789 -5269.381789 -5269.381789 -5269.381789 0.0 0.0 0.0 5043.772575
490 2018-01-12 10300.087824 -11407.652737 22014.507343 10150.269457 10432.577518 -5253.314515 -5253.314515 -5253.314515 -5253.314515 -5253.314515 -5253.314515 0.0 0.0 0.0 5046.773309
491 2018-01-13 10287.021284 -11767.060141 23105.051714 10136.676775 10420.474065 -5244.345070 -5244.345070 -5244.345070 -5244.345070 -5244.345070 -5244.345070 0.0 0.0 0.0 5042.676214
492 2018-01-14 10273.954744 -12453.383825 22490.477042 10122.988290 10408.031781 -5240.880304 -5240.880304 -5240.880304 -5240.880304 -5240.880304 -5240.880304 0.0 0.0 0.0 5033.074441
493 2018-01-15 10260.888205 -11792.410668 21383.620828 10109.359927 10395.589498 -5241.123306 -5241.123306 -5241.123306 -5241.123306 -5241.123306 -5241.123306 0.0 0.0 0.0 5019.764899
494 2018-01-16 10247.821665 -10884.794995 21807.204144 10095.446589 10383.147214 -5243.159787 -5243.159787 -5243.159787 -5243.159787 -5243.159787 -5243.159787 0.0 0.0 0.0 5004.661878
495 2018-01-17 10234.755126 -11117.490200 21908.112254 10081.328118 10370.704930 -5245.048924 -5245.048924 -5245.048924 -5245.048924 -5245.048924 -5245.048924 0.0 0.0 0.0 4989.706201
496 2018-01-18 10221.688586 -11471.233652 22612.102358 10067.579986 10358.232019 -5244.915571 -5244.915571 -5244.915571 -5244.915571 -5244.915571 -5244.915571 0.0 0.0 0.0 4976.773015
497 2018-01-19 10208.622046 -12616.078056 22604.293475 10053.863185 10345.699704 -5241.040633 -5241.040633 -5241.040633 -5241.040633 -5241.040633 -5241.040633 0.0 0.0 0.0 4967.581413
498 2018-01-20 10195.555507 -11085.109200 21862.639163 10040.146384 10333.148577 -5231.946498 -5231.946498 -5231.946498 -5231.946498 -5231.946498 -5231.946498 0.0 0.0 0.0 4963.609009
499 2018-01-21 10182.488967 -12116.756382 22124.561839 10026.429583 10320.597449 -5216.474494 -5216.474494 -5216.474494 -5216.474494 -5216.474494 -5216.474494 0.0 0.0 0.0 4966.014473
500 2018-01-22 10169.422428 -11799.236660 22128.247318 10012.712782 10308.046322 -5193.851637 -5193.851637 -5193.851637 -5193.851637 -5193.851637 -5193.851637 0.0 0.0 0.0 4975.570791
501 2018-01-23 10156.355888 -11397.718675 21488.804694 9998.995980 10295.688169 -5163.744193 -5163.744193 -5163.744193 -5163.744193 -5163.744193 -5163.744193 0.0 0.0 0.0 4992.611695
502 2018-01-24 10143.289348 -12170.143585 21137.408443 9985.289993 10283.354063 -5126.296039 -5126.296039 -5126.296039 -5126.296039 -5126.296039 -5126.296039 0.0 0.0 0.0 5016.993309
503 2018-01-25 10130.222809 -11855.103144 21615.603916 9971.447744 10271.019957 -5082.150230 -5082.150230 -5082.150230 -5082.150230 -5082.150230 -5082.150230 0.0 0.0 0.0 5048.072579
504 2018-01-26 10117.156269 -12025.376518 22220.684386 9957.493001 10258.685851 -5032.452727 -5032.452727 -5032.452727 -5032.452727 -5032.452727 -5032.452727 0.0 0.0 0.0 5084.703542
505 2018-01-27 10104.089730 -12326.781502 21628.736366 9943.939751 10246.351744 -4978.837801 -4978.837801 -4978.837801 -4978.837801 -4978.837801 -4978.837801 0.0 0.0 0.0 5125.251929
506 2018-01-28 10091.023190 -12000.100061 22722.886587 9930.400801 10234.017638 -4923.395183 -4923.395183 -4923.395183 -4923.395183 -4923.395183 -4923.395183 0.0 0.0 0.0 5167.628007
507 2018-01-29 10077.956651 -11021.460045 21882.372752 9916.765027 10221.727326 -4868.619657 -4868.619657 -4868.619657 -4868.619657 -4868.619657 -4868.619657 0.0 0.0 0.0 5209.336993
508 2018-01-30 10064.890111 -11496.619265 20772.074491 9903.067605 10209.526120 -4817.344316 -4817.344316 -4817.344316 -4817.344316 -4817.344316 -4817.344316 0.0 0.0 0.0 5247.545795
509 2018-01-31 10051.823571 -13360.948215 21954.650132 9889.370183 10197.222692 -4772.659269 -4772.659269 -4772.659269 -4772.659269 -4772.659269 -4772.659269 0.0 0.0 0.0 5279.164302
figure = m.plot(forecast, xlabel='Date', ylabel='Crime Rate')


시각화 자료에서 볼 수 있듯이 데이터에 포함된 2017년 이후의 연도 또한 Prophet을 통하여 표현이 가능하다. As can be seen from the visualization data, the years after 2017 included in the data can also be expressed through Prophet.

# 예측된 추세가 어떤 모양일지 도출 Determining what the predicted trend will look like
figure3 = m.plot_components(forecast)


시카고와 같은 경우 그래프에서 7월달(여름)까지 범죄율이 상승하다가, 그 이후로 겨울을 맞아 날씨가 추워지면서 범죄율이 하락하는 현상을 관찰해볼 수 있다. In the case of Chicago, the graph shows that the crime rate rises until July (summer), and then the crime rate decreases as the weather gets colder in winter.

PART 2: Avocado Market



페이스북 Prophet을 사용해 미래 물가를 예측한다.

Observing the dataset

Some relevant columns in the dataset:

  • Date - The date of the observation
  • AveragePrice - the average price of a single avocado
  • type - conventional or organic
  • year - the year
  • Region - the city or region of the observation
  • Total Volume - Total number of avocados sold
  • 4046 - Total number of avocados with PLU 4046 sold
  • 4225 - Total number of avocados with PLU 4225 sold
  • 4770 - Total number of avocados with PLU 4770 sold

Loading the dataset

# import libraries 
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis 
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import random
import seaborn as sns
from fbprophet import Prophet

avocado_df = pd.read_csv('avocado.csv')

# 날짜별 아보카도 가격분포 Avocado price distribution by date
avocado_df = avocado_df.sort_values("Date") # 시간 순으로 정렬 order by time
plt.plot(avocado_df['Date'], avocado_df['AveragePrice'])


# 지역별 아보카도 가격분포 Avocado Price Distribution by Region
sns.countplot(x = 'region', data = avocado_df)
plt.xticks(rotation = 45)


# 연도별 아보카도 가격분포 Avocado Price Distribution by Year
sns.countplot(x = 'year', data = avocado_df)
plt.xticks(rotation = 45)



avocado_prophet_df = avocado_df[['Date', 'AveragePrice']] # Prophet에 필요한 열만 추출 Extracting only the columns needed by Prophet
avocado_prophet_df = avocado_prophet_df.rename(columns={'Date':'ds', 'AveragePrice':'y'}) # Prophet 열이름 사전설정 Prophet column name presets



# Applying the Prophet
m = Prophet()

# Forcasting into the future
future = m.make_future_dataframe(periods=365) # 미래 1년 동안의 아보카도 가격 예측 Avocado Price Prediction for the Future Year
forecast = m.predict(future)

figure = m.plot(forecast, xlabel='Date', ylabel='Price')


figure3 = m.plot_components(forecast)

