2 분 소요

‘raw_data’ is the temporary dataset, and we are going to address various preprocessing tasks with it.


데이터 확인 (Describing the dataset)

raw_data.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 23 entries, 0 to 29
    Data columns (total 8 columns):
    #   Column          Non-Null Count  Dtype         
    ---  ------          --------------  -----         
    0   customer_id     23 non-null     int64         
    1   date            23 non-null     datetime64[ns]
    2   age             23 non-null     int32         
    3   gender          23 non-null     object        
    4   country         23 non-null     object        
    5   item purchased  23 non-null     float64       
    6   value           23 non-null     int64         
    7   monthly visits  23 non-null     float64       
    dtypes: datetime64[ns](1), float64(2), int32(1), int64(2), object(2)
    memory usage: 1.5+ KB
raw_data["country"].unique(), raw_data["country"].nunique()
    array(['US', 'India', 'France', 'Sweden', 'USA', 'Germany', 'Chile',
        'Saudi Arabia', 'Japan', 'Norway', 'Spain', 'United Kingdom',
        'Switzerland', 'Russia'], dtype=object)

    14
raw_data = ["value", "monthly visits", "item purchased"]
raw_data[ out_list ] # show the columns only

image

from scipy import stats
np.abs(stats.zscore(raw_data[ out_list ])) # convert data in zscore

image

raw_data.drop( ["customer_id", "date"], axis = 1, inplace = True ) # remove columns

결측치 (Missing Values)

# 방법 1
raw_data.isna().sum() # return the number of missing values in the dataset

raw_data.dropna(inplace=True) # remove rows including 'NaN'
# 방법 2
for c in df_missing.columns:
    missing = df_missing[c].isnull().sum()
    if missing > 0:
        print("{} has {} missing values".format(c, missing))

df_missing.fillna("DONE")
# df_missing["Sales"].fillna( df_missing["Sales"].mean() )
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy= "mean") # usually replaced with mean value

raw_data = imputer.fit_transform(raw_data) # apply the tranformation

데이터 정리 (Organizing the data)

raw_data.replace("ERR", np.nan, inplace = True) # replace 'ERR' to 'NaN'

raw_data["gender"].str.strip() # remove blanks at edge
raw_data["gender"] = raw_data["gender"].map({"Male": 0, "Female": 1}) # categorical data
    0     0
    1     1
    4     0
    6     0
    7     0
    8     1
    9     0
    11    1
    12    0
    13    0
    14    1
    15    0
    16    1
    18    0
    19    1
    20    0
    21    1
    23    0
    24    1
    25    0
    27    0
    28    0
    29    1
    Name: gender, dtype: int64
raw_data["age"] = raw_data["age"].astype(int) # convert type
def classify(label):
    if label < 500:
        return "Normal"
    else:
        return "Active"

raw_data["label"] = raw_data["monthly visits"].apply( lambda x: classify(x) )
raw_data = pd.get_dummies(raw_data, columns = ["country"]) # make new columns with the data in the 'country' column

image

Concat, Merge, Join

df_1, df_2

image image

# 비어 있는 항목에는  NaN 으로 채워진다.
df_cat = pd.concat([df_1, df_2, df_3]) # 아래로 합치기
df_cat = pd.concat([df_1, df_2, df_3], axis = 1) # 옆으로 합치기
pd.merge(df_1, df_2, on = "Customer Name", how = "inner")

image

pd.merge(df_2, df_1, on = "Customer Name", how = "outer")

image

pd.merge(df_1, df_2, on = "Customer Name", how = "outer").drop_duplicates()

image

df_1.join(df_2, how="left"), df_1.join(df_2, how="right"), df_1.join(df_2, how="inner"), df_1.join(df_2, how="outer")

image image image image



Indexing

df_2.set_index( ["Customer Name"], inplace = True)

image

댓글남기기