Data Preprocessing Techniques
‘raw_data’ is the temporary dataset, and we are going to address various preprocessing tasks with it.
데이터 확인 (Describing the dataset)
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 0 to 29
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customer_id 23 non-null int64
1 date 23 non-null datetime64[ns]
2 age 23 non-null int32
3 gender 23 non-null object
4 country 23 non-null object
5 item purchased 23 non-null float64
6 value 23 non-null int64
7 monthly visits 23 non-null float64
dtypes: datetime64[ns](1), float64(2), int32(1), int64(2), object(2)
memory usage: 1.5+ KB
raw_data["country"].unique(), raw_data["country"].nunique()
array(['US', 'India', 'France', 'Sweden', 'USA', 'Germany', 'Chile',
'Saudi Arabia', 'Japan', 'Norway', 'Spain', 'United Kingdom',
'Switzerland', 'Russia'], dtype=object)
14
raw_data = ["value", "monthly visits", "item purchased"]
raw_data[ out_list ] # show the columns only
from scipy import stats
np.abs(stats.zscore(raw_data[ out_list ])) # convert data in zscore
raw_data.drop( ["customer_id", "date"], axis = 1, inplace = True ) # remove columns
결측치 (Missing Values)
# 방법 1
raw_data.isna().sum() # return the number of missing values in the dataset
raw_data.dropna(inplace=True) # remove rows including 'NaN'
# 방법 2
for c in df_missing.columns:
missing = df_missing[c].isnull().sum()
if missing > 0:
print("{} has {} missing values".format(c, missing))
df_missing.fillna("DONE")
# df_missing["Sales"].fillna( df_missing["Sales"].mean() )
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy= "mean") # usually replaced with mean value
raw_data = imputer.fit_transform(raw_data) # apply the tranformation
데이터 정리 (Organizing the data)
raw_data.replace("ERR", np.nan, inplace = True) # replace 'ERR' to 'NaN'
raw_data["gender"].str.strip() # remove blanks at edge
raw_data["gender"] = raw_data["gender"].map({"Male": 0, "Female": 1}) # categorical data
0 0
1 1
4 0
6 0
7 0
8 1
9 0
11 1
12 0
13 0
14 1
15 0
16 1
18 0
19 1
20 0
21 1
23 0
24 1
25 0
27 0
28 0
29 1
Name: gender, dtype: int64
raw_data["age"] = raw_data["age"].astype(int) # convert type
def classify(label):
if label < 500:
return "Normal"
else:
return "Active"
raw_data["label"] = raw_data["monthly visits"].apply( lambda x: classify(x) )
raw_data = pd.get_dummies(raw_data, columns = ["country"]) # make new columns with the data in the 'country' column
Concat, Merge, Join
df_1, df_2
# 비어 있는 항목에는 NaN 으로 채워진다.
df_cat = pd.concat([df_1, df_2, df_3]) # 아래로 합치기
df_cat = pd.concat([df_1, df_2, df_3], axis = 1) # 옆으로 합치기
pd.merge(df_1, df_2, on = "Customer Name", how = "inner")
pd.merge(df_2, df_1, on = "Customer Name", how = "outer")
pd.merge(df_1, df_2, on = "Customer Name", how = "outer").drop_duplicates()
df_1.join(df_2, how="left"), df_1.join(df_2, how="right"), df_1.join(df_2, how="inner"), df_1.join(df_2, how="outer")
Indexing
df_2.set_index( ["Customer Name"], inplace = True)
댓글남기기