Data Preprocessing Techniques

2 분 소요

‘raw_data’ is the temporary dataset, and we are going to address various preprocessing tasks with it.

데이터 확인 (Describing the dataset)

raw_data.info()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 23 entries, 0 to 29
    Data columns (total 8 columns):
    #   Column          Non-Null Count  Dtype         
    ---  ------          --------------  -----         
    0   customer_id     23 non-null     int64         
    1   date            23 non-null     datetime64[ns]
    2   age             23 non-null     int32         
    3   gender          23 non-null     object        
    4   country         23 non-null     object        
    5   item purchased  23 non-null     float64       
    6   value           23 non-null     int64         
    7   monthly visits  23 non-null     float64       
    dtypes: datetime64[ns](1), float64(2), int32(1), int64(2), object(2)
    memory usage: 1.5+ KB

raw_data["country"].unique(), raw_data["country"].nunique()

    array(['US', 'India', 'France', 'Sweden', 'USA', 'Germany', 'Chile',
        'Saudi Arabia', 'Japan', 'Norway', 'Spain', 'United Kingdom',
        'Switzerland', 'Russia'], dtype=object)

    14

raw_data = ["value", "monthly visits", "item purchased"]
raw_data[ out_list ] # show the columns only

from scipy import stats
np.abs(stats.zscore(raw_data[ out_list ])) # convert data in zscore

raw_data.drop( ["customer_id", "date"], axis = 1, inplace = True ) # remove columns

결측치 (Missing Values)

# 방법 1
raw_data.isna().sum() # return the number of missing values in the dataset

raw_data.dropna(inplace=True) # remove rows including 'NaN'

# 방법 2
for c in df_missing.columns:
    missing = df_missing[c].isnull().sum()
    if missing > 0:
        print("{} has {} missing values".format(c, missing))

df_missing.fillna("DONE")
# df_missing["Sales"].fillna( df_missing["Sales"].mean() )

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy= "mean") # usually replaced with mean value

raw_data = imputer.fit_transform(raw_data) # apply the tranformation

데이터 정리 (Organizing the data)

raw_data.replace("ERR", np.nan, inplace = True) # replace 'ERR' to 'NaN'

raw_data["gender"].str.strip() # remove blanks at edge

raw_data["gender"] = raw_data["gender"].map({"Male": 0, "Female": 1}) # categorical data

   0
   1
   0
   0
   0
   1
   0
  1
  0
  0
  1
  0
  1
  0
  1
  0
  1
  0
  1
  0
  0
  0
  1
    Name: gender, dtype: int64

raw_data["age"] = raw_data["age"].astype(int) # convert type

def classify(label):
    if label < 500:
        return "Normal"
    else:
        return "Active"

raw_data["label"] = raw_data["monthly visits"].apply( lambda x: classify(x) )

raw_data = pd.get_dummies(raw_data, columns = ["country"]) # make new columns with the data in the 'country' column

Concat, Merge, Join

df_1, df_2

# 비어 있는 항목에는  NaN 으로 채워진다.
df_cat = pd.concat([df_1, df_2, df_3]) # 아래로 합치기
df_cat = pd.concat([df_1, df_2, df_3], axis = 1) # 옆으로 합치기

pd.merge(df_1, df_2, on = "Customer Name", how = "inner")

pd.merge(df_2, df_1, on = "Customer Name", how = "outer")

pd.merge(df_1, df_2, on = "Customer Name", how = "outer").drop_duplicates()

df_1.join(df_2, how="left"), df_1.join(df_2, how="right"), df_1.join(df_2, how="inner"), df_1.join(df_2, how="outer")

Indexing

df_2.set_index( ["Customer Name"], inplace = True)

Twitter Facebook LinkedIn

쭌스🎄

Data Preprocessing Techniques

데이터 확인 (Describing the dataset)

결측치 (Missing Values)

데이터 정리 (Organizing the data)

Concat, Merge, Join

Indexing

공유하기

댓글남기기

참고

2024.10.02
Evaluating on Image Hallucination for TTI Generative Models in I-HallA via PaliGemma

2023.12.04
[논문분석] Saliency as Pseudo-Pixel Supervision for Weakly and Semi-Supervised Semantic Segmentation (PAMI 2023)

2023.12.03
[논문분석] Segment Anything (ICCV 2023)

2023.12.03
[논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2023.12.03
[논문분석] Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (CVPR 2018)

2023.12.01
[논문분석] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (ICCV 2017)

2023.12.01
[논문분석] Entropy regularization for weakly supervised object localization (PRL 2023)

2023.11.29
[논문분석] Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation (CVPR 2021)

2023.11.25
[논문분석] Weaker Than You Think: A Critical Look at Weakly Supervised Learning (ACL 2023)

2023.08.03
[논문분석] PETR: Position Embedding Transformation for Multi-View 3D Object Detection (ECCV, 2022)

쭌스🎄

데이터 확인 (Describing the dataset)

결측치 (Missing Values)

데이터 정리 (Organizing the data)

Concat, Merge, Join

Indexing

공유하기

댓글남기기

참고

2024.10.02 Evaluating on Image Hallucination for TTI Generative Models in I-HallA via PaliGemma

2023.12.04 [논문분석] Saliency as Pseudo-Pixel Supervision for Weakly and Semi-Supervised Semantic Segmentation (PAMI 2023)

2023.12.03 [논문분석] Segment Anything (ICCV 2023)

2023.12.03 [논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2023.12.03 [논문분석] Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (CVPR 2018)

2023.12.01 [논문분석] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (ICCV 2017)

2023.12.01 [논문분석] Entropy regularization for weakly supervised object localization (PRL 2023)

2023.11.29 [논문분석] Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation (CVPR 2021)

2023.11.25 [논문분석] Weaker Than You Think: A Critical Look at Weakly Supervised Learning (ACL 2023)

2023.08.03 [논문분석] PETR: Position Embedding Transformation for Multi-View 3D Object Detection (ECCV, 2022)

2024.10.02
Evaluating on Image Hallucination for TTI Generative Models in I-HallA via PaliGemma

2023.12.04
[논문분석] Saliency as Pseudo-Pixel Supervision for Weakly and Semi-Supervised Semantic Segmentation (PAMI 2023)

2023.12.03
[논문분석] Segment Anything (ICCV 2023)

2023.12.03
[논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2023.12.03
[논문분석] Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (CVPR 2018)

2023.12.01
[논문분석] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (ICCV 2017)

2023.12.01
[논문분석] Entropy regularization for weakly supervised object localization (PRL 2023)

2023.11.29
[논문분석] Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation (CVPR 2021)

2023.11.25
[논문분석] Weaker Than You Think: A Critical Look at Weakly Supervised Learning (ACL 2023)

2023.08.03
[논문분석] PETR: Position Embedding Transformation for Multi-View 3D Object Detection (ECCV, 2022)