Text pre-processing (cranfieldDocs)

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

juooo1117

Text pre-processing (cranfieldDocs) 본문

Artificial Intelligence

Text pre-processing (cranfieldDocs)

Hyo__ni 2023. 10. 27. 15:19

cranfieldDocs 파일(.txt)들을 이용해서 text pre-processing 하는 과정

이용한 전처리 방법들은 다음과 같다.

Remove markups
Convert to lowercase
특정 tag안의 내용만 가져오기
Remove punctuation, number
Tokenization

Practice

필요한 패키지를 import하고 'cranfieldDocs'를 불러온 뒤, 파일 안의 line들을 하나의 string안에 각각 길게 저장한다.

from bs4 import BeautifulSoup
import string
from nltk.stem import PorterStemmer

# read file
doc = ""
for line in open('/Users/juhyeon/python-workspace/cranfieldDocs/cranfield0001', 'r'):
    doc += line

처리된 파일의 결과는 아래와 같은 형태로 저장된다.

markup을 지우고 모든 내용을 소문자로 바꾼다.

문서의 내용 중에서 특정 <tag> 내부에 있는 내용만 가져오고 싶다면, BeautifulSoup으로 읽어서 text 부분만(.find('text')) 찾으면 된다.

# step1: Remove markups
cleantext = BeautifulSoup(doc, 'lxml').text
# step2: convert to lowercase
lowertext = doc.lower()

# markup 중에서 <TEXT> 태그만 찾아서, 안의 내용을 가져옴
soup = BeautifulSoup(doc, 'html5lib')
found_text = soup.find('text').text

punctuation과 number를 제거한 뒤에 tokenize하는 작업을 수행한다.

string.punctuation의 결과는 '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' 이므로, 해당 문장부호들은 삭제처리가 가능하다.

(단, 한글자로 이루어진 문장부호들만 제거가 가능하므로 ! 는 제거되지만, !!!!! 는 제거되지 않는다.)

# tokenization(remove punctuation, numbers)
## 숫자가 아니면 string으로 저장
lowertext = ''.join(i for i in lowertext if not i.isdigit())
tokenedtext = lowertext.split()
tokenedtext2 = [x for x in tokenedtext if x not in string.punctuation]

하지만 일반적인 문서에는 문장부호가 단어에 붙어있어서, 문장부호와 문장사이에 공백이 존재하지 않기 때문에 이 경우에는 제거가 어렵다.

따라서 're package'를 사용해서 문장부호, 특수문자 제거와 tokenize를 동시에 진행할 수 있다.

import re

lowertext = ''.join(i for i in lowertext if not i.isdigit())
tokenedtext = re.findall(r"[\w']+|[.,!?;]", lowertext)

'\W+' 는 알파벳이나 숫자가 아닌 모든 character, 즉 punctuation을 의미한다.

따라서 re.sub를 이용하여 이 character들을 공백으로 바꾼다음 → 공백을 기준으로 tokenize 할 수 있다.

lowertext = ''.join(i for i in lowertext if not i.isdigit())
lowertext = re.sub('\W+',' ', lowertext)
tokenedtext = lowertext.split()

stopword 처리를 위해서 nltk의 stopwords 단어들의 패키지를 활용한다. 아래의 코드를 실행하면 'tokenedtext'를 돌면서 stopwords가 아니면 'tokenedtext3'에 저장하게 된다.

import nltk
from nltk.corpus import stopwords
 
nltk.download('stopwords')
stopWords = stopwords.words('english')
tokenedtext3 = [x for x in tokenedtext if x not in stopWords]

nltk에서 제공하는 stopwords들을 살펴보면 아래의 단어들과 같다.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", 
"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 
'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 
'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 
'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 
'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', 
"should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 
'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', 
"hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', 
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', 
"wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

패키지의 porterstemmer를 활용해서 stemming 작업을 수행할 수 있다.

ps = PorterStemmer()
stemmedtext = [ps.stem(x) for x in tokenedtext3]

cranfieldDocs의 모든 파일을 불러와서 한 번에 pre-processing

파일들이 저장되어 있는 path를 지정해 주고, 해당 폴더의 파일이름들의 list를 반환한다. cranfield 파일은 총 100개가 존재한다.

import os
dir = '/Users/juhyeon/python-workspace/cranfieldDocs/'

filenames = os.listdir(dir)
print(len(filenames))   # 100개

docpath = []
for filename in os.listdir(dir):
    path = (os.path.join(dir, filename))
    docpath.append(path)

위에서 다룬 pre-processing 방법들을 순차적으로 적용해 줄 수 있는 'pre_processing' 함수를 정의해 준다.

def pre_processing(file_path):

    doc = ""
    for line in open(file_path, 'r', encoding='windows-1252'): 
        doc += line

    # pre-processing
    cleantext = BeautifulSoup(doc, 'html5lib')                  ## step1: <TEXT> 태그를 찾아 안의 내용만 가져옴
    onlytext = cleantext.find('text').text 
    
    lowertext = onlytext.lower()                                ## step2: Convert to lowercase        
    str_text = ''.join(i for i in lowertext if not i.isdigit()) ## step3: remove digit
    str_text = re.sub('\W+',' ', str_text)                      ## step4: remove punctuation 
    tokenedtext = str_text.split()
    result = [x for x in tokenedtext if x not in stopWords]     ## step5: remove stopwords
    
    return result

pre-processing이 완료된 각 파일들을 저장하는 'whole_doc' list를 생성한 뒤, 처리가 완료된 것들을 append해 준다.

whole_doc = []

for doc in docpath:
    token = pre_processing(doc)
    whole_doc.append(token)
    print(f'{doc} 파일 finish!')

모든 text pre-processing이 완료된 파일은 아래와 같다.

print(whole_doc[0])

print(whole_doc[1])

[github]

https://github.com/juooo1117/practice_AI_Learning/blob/main/Text_preprocessing_cranfield.ipynb

'Artificial Intelligence' 카테고리의 다른 글

Text statistics & Zipf's Law (0)	2023.12.12
Web Scraping - Text Mining (0)	2023.12.12
NLP(Natural Language Processing) (0)	2023.10.26
Binary Classifier Modeling(Scratch) - Breast Cancer dataset (0)	2023.10.25
Topic Model (MM, PLSA, LDA) (1)	2023.10.24

'Artificial Intelligence' Related Articles

juooo1117

Text pre-processing (cranfieldDocs) 본문

Text pre-processing (cranfieldDocs)

cranfieldDocs 파일(.txt)들을 이용해서 text pre-processing 하는 과정

Practice

cranfieldDocs의 모든 파일을 불러와서 한 번에 pre-processing

'Artificial Intelligence' 카테고리의 다른 글

티스토리툴바