NLP: Data Preprocessing

1 분 소요

Loading the libraries and dataset

import pandas as pd
import numpy as np

raw_data = pd.read_csv("Corona_NLP_train.csv", encoding = "latin-1") # load the dataset

raw_data.shape # check the shape

raw_data.head() # view the dataset briefly

raw_data.drop(["UserName", "ScreenName", "Location", "TweetAt"], axis = 1) # remove unnecessary columns

raw_data = raw_data [ ["OriginalTweet", "Sentiment"] ] # load the necessary columns

How to preprocess data for NLP

import nltk
from nltk.tokenize import RegexpTokenizer # tokenizer 
from nltk.corpus import stopwords # 불용어
from nltk.stem.porter import PorterStemmer # 어간추출

tokenizer = RegexpTokenizer(r"\w+") # tokenize the text by word

nltk.download("stopwords")
stop_words = stopwords.words("english") # 불용어 데이터 저장

stemmer = PorterStemmer() # 어간추출

# create the preprocessing function
def text_preprocess(text):
    text = tokenizer.tokenize(text)
    text = [i.lower() for i in text if i not in stop_words]
    text = [stemmer.stem(i) for i in text]
    return text

txt = "@tea_lover: I love tea!!! I drink at least 2 cups of tea everyday #tea"
text_preprocess(txt)

    ['tea_lov',
    'i',
    'love',
    'tea',
    'i',
    'drink',
    'least',
    '2',
    'cup',
    'tea',
    'everyday',
    'tea']

Exercise with real dataset

raw_data

raw_data["OriginalTweet"] = raw_data["OriginalTweet"].apply(lambda x: text_preprocess(x))

      [menyrbi, phil_gahan, chrisitv, http, co, ifz9...
      [advic, talk, neighbour, famili, exchang, phon...
      [coronaviru, australia, woolworth, give, elder...
      [my, food, stock, one, empti, pleas, panic, th...
      [me, readi, go, supermarket, covid19, outbreak...
                                ...                        
  [airlin, pilot, offer, stock, supermarket, she...
  [respons, complaint, provid, cite, covid, 19, ...
  [you, know, itâ, get, tough, kameronwild, rati...
  [is, wrong, smell, hand, sanit, start, turn, c...
  [tartiicat, well, new, use, rift, s, go, 700, ...
    Name: OriginalTweet, Length: 41157, dtype: object

Twitter Facebook LinkedIn

쭌스🎄

NLP: Data Preprocessing

Loading the libraries and dataset

How to preprocess data for NLP

Exercise with real dataset

공유하기

댓글남기기

참고

2024.10.02
Evaluating on Image Hallucination for TTI Generative Models in I-HallA via PaliGemma

2023.12.04
[논문분석] Saliency as Pseudo-Pixel Supervision for Weakly and Semi-Supervised Semantic Segmentation (PAMI 2023)

2023.12.03
[논문분석] Segment Anything (ICCV 2023)

2023.12.03
[논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2023.12.03
[논문분석] Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (CVPR 2018)

2023.12.01
[논문분석] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (ICCV 2017)

2023.12.01
[논문분석] Entropy regularization for weakly supervised object localization (PRL 2023)

2023.11.29
[논문분석] Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation (CVPR 2021)

2023.11.25
[논문분석] Weaker Than You Think: A Critical Look at Weakly Supervised Learning (ACL 2023)

2023.08.03
[논문분석] PETR: Position Embedding Transformation for Multi-View 3D Object Detection (ECCV, 2022)

쭌스🎄

Loading the libraries and dataset

How to preprocess data for NLP

Exercise with real dataset

공유하기

댓글남기기

참고

2024.10.02 Evaluating on Image Hallucination for TTI Generative Models in I-HallA via PaliGemma

2023.12.04 [논문분석] Saliency as Pseudo-Pixel Supervision for Weakly and Semi-Supervised Semantic Segmentation (PAMI 2023)

2023.12.03 [논문분석] Segment Anything (ICCV 2023)

2023.12.03 [논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2023.12.03 [논문분석] Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (CVPR 2018)

2023.12.01 [논문분석] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (ICCV 2017)

2023.12.01 [논문분석] Entropy regularization for weakly supervised object localization (PRL 2023)

2023.11.29 [논문분석] Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation (CVPR 2021)

2023.11.25 [논문분석] Weaker Than You Think: A Critical Look at Weakly Supervised Learning (ACL 2023)

2023.08.03 [논문분석] PETR: Position Embedding Transformation for Multi-View 3D Object Detection (ECCV, 2022)

2024.10.02
Evaluating on Image Hallucination for TTI Generative Models in I-HallA via PaliGemma

2023.12.04
[논문분석] Saliency as Pseudo-Pixel Supervision for Weakly and Semi-Supervised Semantic Segmentation (PAMI 2023)

2023.12.03
[논문분석] Segment Anything (ICCV 2023)

2023.12.03
[논문분석] Learning Transferable Visual Models From Natural Language Supervision (ICMR 2021)

2023.12.03
[논문분석] Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation (CVPR 2018)

2023.12.01
[논문분석] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization (ICCV 2017)

2023.12.01
[논문분석] Entropy regularization for weakly supervised object localization (PRL 2023)

2023.11.29
[논문분석] Railroad is not a Train: Saliency as Pseudo-pixel Supervision for Weakly Supervised Semantic Segmentation (CVPR 2021)

2023.11.25
[논문분석] Weaker Than You Think: A Critical Look at Weakly Supervised Learning (ACL 2023)

2023.08.03
[논문분석] PETR: Position Embedding Transformation for Multi-View 3D Object Detection (ECCV, 2022)