NLP: Data Preprocessing
Loading the libraries and dataset
import pandas as pd
import numpy as np
raw_data = pd.read_csv("Corona_NLP_train.csv", encoding = "latin-1") # load the dataset
raw_data.shape # check the shape
raw_data.head() # view the dataset briefly
raw_data.drop(["UserName", "ScreenName", "Location", "TweetAt"], axis = 1) # remove unnecessary columns
raw_data = raw_data [ ["OriginalTweet", "Sentiment"] ] # load the necessary columns
How to preprocess data for NLP
import nltk
from nltk.tokenize import RegexpTokenizer # tokenizer
from nltk.corpus import stopwords # 불용어
from nltk.stem.porter import PorterStemmer # 어간추출
tokenizer = RegexpTokenizer(r"\w+") # tokenize the text by word
nltk.download("stopwords")
stop_words = stopwords.words("english") # 불용어 데이터 저장
stemmer = PorterStemmer() # 어간추출
# create the preprocessing function
def text_preprocess(text):
text = tokenizer.tokenize(text)
text = [i.lower() for i in text if i not in stop_words]
text = [stemmer.stem(i) for i in text]
return text
txt = "@tea_lover: I love tea!!! I drink at least 2 cups of tea everyday #tea"
text_preprocess(txt)
['tea_lov',
'i',
'love',
'tea',
'i',
'drink',
'least',
'2',
'cup',
'tea',
'everyday',
'tea']
Exercise with real dataset
raw_data
raw_data["OriginalTweet"] = raw_data["OriginalTweet"].apply(lambda x: text_preprocess(x))
0 [menyrbi, phil_gahan, chrisitv, http, co, ifz9...
1 [advic, talk, neighbour, famili, exchang, phon...
2 [coronaviru, australia, woolworth, give, elder...
3 [my, food, stock, one, empti, pleas, panic, th...
4 [me, readi, go, supermarket, covid19, outbreak...
...
41152 [airlin, pilot, offer, stock, supermarket, she...
41153 [respons, complaint, provid, cite, covid, 19, ...
41154 [you, know, itâ, get, tough, kameronwild, rati...
41155 [is, wrong, smell, hand, sanit, start, turn, c...
41156 [tartiicat, well, new, use, rift, s, go, 700, ...
Name: OriginalTweet, Length: 41157, dtype: object
댓글남기기