Text Classification - Sentiment Analysis

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

juooo1117

Text Classification - Sentiment Analysis 본문

Artificial Intelligence

Text Classification - Sentiment Analysis

Hyo__ni 2023. 12. 13. 11:24

Classification

labeling된 학습 데이터를 활용해서 분류 모델을 구축한 후에, 신규 데이터를 분류모델에 입력해서 그 데이터의 클래스(category)를 예측한다.

딥러닝 기반의 방법론들이 다양하게 나오면서, text classification에 최적화된 ANN architecture가 나오기 시작함

Sentiment Analysis

sentiment analysis(감성 분석)은 opinion mining 이라고도 하며 자연어 처리 및 텍스트 분석을 통해서 텍스트에 나타난 주관적인 감성을 식별하는 것을 의미한다.

주로, 주어진 텍스트를 positive category 또는 negative category로 분류하는 것을 의미하며 neutral category를 추가하기도 한다. (다양한 변형이 존재함)

주로 분류 모델을 사용하지만 topic model, clustering, information extraction을 사용해서 감성분석을 수행하기도 한다.

[Practice] - Sentiment Analysis with Yelp review

dataset: 10000개의 review data
dataset column: business_id, data, review_id, stars(1~5 rating for the business), text(review text), type(type of text), user_id
text preprocessing 순서: remove punctuation → remove all stopwords → return the cleaned text as a list of words
CountVectorizer 함수를 이용한 document vectorization → Bag of words(bow)로 변환하는 객체를 생성함
https://github.com/juooo1117/practice_AI_Learning/blob/main/%08sentiment_analysis_yelpReview.ipynb

bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
# vocab dictionary size 확인 -> 26435
len(bow_transformer.vocabulary_)

[Practice] - Text Classification with IMDB dataset (movie review)

tensorflow 공식 tutorial을 이용한 IMDB 텍스트 분류 - 긍정/부정 binary classification
Tensorflow ver. 2.x 를 이용
https://github.com/juooo1117/practice_AI_Learning/blob/main/ReviewText_BinaryClassification.ipynb

Data preprocessing

tf.keras.layers.TextVectorization 레이어를 사용하여 데이터를 표준화, 토큰화 및 벡터화

- 표준화(standardization): 일반적으로 구두점이나 HTML 요소를 제거하여 데이터세트를 단순화하기 위해 텍스트를 전처리하는 것

- 토큰화(tokenization): 문자열을 여러 토큰으로 분할하는 것

- 벡터화(vectorization): 토큰을 숫자로 변환하여 신경망에 공급될 수 있도록 하는 것

- HTML tag를 제거하기 위한 적합한 목적의 함수는 없으므로 사용자 정의 표준화 함수 define

# html tag 제거를 위한 함수 정의
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '')
  
# text vectorization을 위한 layer 생성
## data standardization, tokenization, vectorization
max_features = 10000
sequence_length = 250
vectorize_layer = layers.TextVectorization(standardize=custom_standardization,
                                           max_tokens=max_features,
                                           output_mode='int',
                                           output_sequence_length=sequence_length)

Model(binary classifier) Define

- Embedding Layer: 정수로 인코딩된 리뷰를 입력 받고 각 단어 인덱스에 해당하는 임베딩 벡터를 찾는다. 모델이 훈련되면서 학습되며, 최종 차원은 (batch, sequence, embedding)이 된다.

- GlobalAveragePooling1D Layer: sequence 차원에 대해 평균을 계산하여 각 샘플에 대해 고정된 길이의 출력 벡터를 반환하는데, 이는 길이가 다른 입력을 다루기 위함이다.

- Fully Connected Layer: 마지막 층은 하나의 출력 노드(node)를 가진 fc layer이며, sigmoid 활성화 함수를 사용하여 0과 1 사이의 실수를 출력한다.

embedding_dim = 16
model = tf.keras.Sequential([layers.Embedding(max_features + 1, embedding_dim),   
                             layers.Dropout(0.2),                                
                             layers.GlobalAveragePooling1D(),                    
                             layers.Dropout(0.2),                                 
                             layers.Dense(1)])                                    
model.summary()

'Artificial Intelligence' 카테고리의 다른 글

Named Entity Recognition using RFC(Random Forest Classifier) & CRF(Conditional Random Fields) (0)	2023.12.14
Movie genres - Clustering practice with NMF (0)	2023.12.14
Text Clustering - NMF & Mini Batch k-means (0)	2023.12.12
Information Extraction - Sequence Labeling, Conditional Random Fields (1)	2023.12.12
Text statistics & Zipf's Law (0)	2023.12.12

'Artificial Intelligence' Related Articles

juooo1117

Text Classification - Sentiment Analysis 본문

Text Classification - Sentiment Analysis

[Practice] - Sentiment Analysis with Yelp review

[Practice] - Text Classification with IMDB dataset (movie review)

'Artificial Intelligence' 카테고리의 다른 글

티스토리툴바