Information Retrieval: Sentiment dan Hate Speech Analysis from Tweets

In this notebook, I analyzed whether tweets classified as hate speech tend to have negative sentiments. First, I trained each dataset using the BERT pre-trained model. Then, I did cross-inferencing to analyze whether hate speech tends to have negative sentiments.

Data Preparation

In [ ]:
!pip install transformers -q
     |████████████████████████████████| 1.8MB 5.3MB/s 
     |████████████████████████████████| 890kB 24.7MB/s 
     |████████████████████████████████| 3.2MB 32.9MB/s 
  Building wheel for sacremoses (setup.py) ... done
In [ ]:
import random
import pandas as pd
import numpy as np
import re
import pickle
from itertools import product
import matplotlib.pyplot as plt
from string import punctuation
np.random.seed(seed=42)
In [ ]:
sentiment_dataset = pd.read_csv("sentiment.csv", sep=';', encoding="utf-8", engine="python")
hate_speech_dataset = pd.read_csv("hatespeech.txt", sep='\t', engine="python")
In [ ]:
sentiment_dataset
Out[ ]:
Tweets Label
0 rt @mrtampi: agus makin santai.\nahok makin sa... negatif
1 pilkada dki jangan pilih pki!! berbahaya!! pil... negatif
2 pdip sengaja becah belah rakyat warga dki tida... negatif
3 rt @gunromli: sylviana kesyikan ngomong sendir... negatif
4 rt @gunromli: sylviana kesyikan ngomong sendir... negatif
... ... ...
1501 rt @zul_hasan: sebelum debat kompak dukung agu... positif
1502 #debat2pilkadadki apik kabeh, yo seng adil men... positif
1503 rt @erixputra: rakyat adalah bos kami. kami ad... positif
1504 ahok \u2013 djarot waspadai politik uang dalam... positif
1505 @jokowi harusnya sdh tahu dari awal ini semaki... positif

1506 rows × 2 columns

In [ ]:
hate_speech_dataset
Out[ ]:
Label Tweet
0 Non_HS RT @spardaxyz: Fadli Zon Minta Mendagri Segera...
1 Non_HS RT @baguscondromowo: Mereka terus melukai aksi...
2 Non_HS Sylvi: bagaimana gurbernur melakukan kekerasan...
3 Non_HS Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...
4 Non_HS RT @lisdaulay28: Waspada KTP palsu.....kawal P...
... ... ...
708 HS Muka Si BABi Ahok Tuh Yg Mirip SERBET Lantai.....
709 HS Betul bang hancurkan merka bang, musnahkan chi...
710 HS Sapa Yg bilang Ahok anti korupsi!?, klo grombo...
711 HS Gw juga ngimpi SENTILIN BIJI BABI AHOK, pcetar...
712 HS Mudah2an gw ketemu sama SI BABI IWAN BOPENG DI...

713 rows × 2 columns

Preprocessing Text

Before we feed all data into the model, we check characteristics of each row first

In [ ]:
# Print random tweet given how much we wanna print
def print_tweet(tweets, num_of_print):
  for i in range(num_of_print):
    index = random.randint(0, len(tweets) - 1)
    print(f"---{index}: ")
    print(tweets[index])
In [ ]:
print("Dataset from Hate Speech:")
print_tweet(hate_speech_dataset["Tweet"], 5)

print("\n\n")

print("Dataset from Sentiment:")
print_tweet(sentiment_dataset["Tweets"], 5)
Dataset from Hate Speech:
---203: 
Seharusnya mah basuki sudah bebas ya kan? #FreeAhok https://t.co/Mw3k0vYweF
---267: 
RT @findririn: Dan toleransi itu, ketika muslim bisa menjalankan perintah agamanya, dan non muslim juga bisa menjalankan agamanya.
---132: 
@Boojaee sayang hyung cuma pencitraan :* wkwk
---314: 
RT @digembok: Ya allah... Beneran... Pak ahok akhirnya samperin si emak...
---263: 
Allah SWT tdk butuh anda. Tapi Anda yang butuh Allah,Rezeki, hidup matimu. Allah yg tentukan. Mau khianati Allah? yakin gak nyesel? #SidangAhok



Dataset from Sentiment:
---1062: 
rt @agusfansclub_: bagi mas agus menjadi gub adlh panggilan terbesar dlm hidupnya,krn beliau mrsa mjd calon alternatif utk mjd jakarta lbh\u2026
---11: 
pdip sengaja becah belah rakyat warga dki tidak pilih ahok!! pria as... https:\/\/t.co\/315l4axm8h #debat2pilkadadki https:\/\/t.co\/ybh5dtxnw2
---1478: 
rt @ronaldferdynand: hanya beliau yg berani menyatakan dirinya pelayan dan akan melayani masy dki, sangat mulia, terima kasih pak @basuki_b\u2026
---597: 
rt @dennyja_world: saya melihat data lsi soal pilkada dki yang mengejutkan (survei okt 2016): ahok potensial kalah!
---614: 
rt @dennyja_world: saya melihat data lsi soal pilkada dki yang mengejutkan (survei okt 2016): ahok potensial kalah!

From tweets above, we can see some characteristics of the text. These characteristics are:

  • Often we see some special entities from Twitter like mentions or hashtags. '@agusfansclub_' and '#FreeAhok' is two of examples
  • 'RT' or 'rt' that notes someone retweeted a tweet from a user
  • Sometimes we found hyperlink attached to the tweet
  • Unicode that printed as raw unicode. 'b\u2026' is one of example. This often appeared at sentiment dataset
  • Emoticons (":)", ":*", etc.) that is used by the user

To make the model better, I removed these characteristics from the text. I also removed some punctuations, except exclamation mark, comma, dash, dot, and question mark because I might influence the BERT computation. Also, I removed multiple symbols that always we see in every text. Moreover, I remove new line and multiple spaces that appear after removing some entities.

In [ ]:
# Preprocessing Functions
replace_new_line = lambda tweet: re.sub(r"\n", " ", tweet)
remove_multiple_spaces = lambda tweet: re.sub(r"\s\s+", " ", tweet)
remove_entities = lambda tweet: re.sub(r"[@#](\S+):?", "", tweet)
remove_RT_word = lambda tweet: re.sub(r"\bRT\b|\brt\b", "", tweet)
remove_URL = lambda tweet:  re.sub(r"https?\S+", "", tweet)
remove_multiple_symbols = lambda symbol, tweet: re.sub(f"[{symbol}][{symbol}]+", f"{symbol}", tweet)
remove_punctuation = lambda tweet: re.sub(r"[$%&'()*+/:;<=>[\]^_`{|}~]", "", tweet)
remove_non_unicode = lambda tweet: re.sub(r'[^\x00-\x7F]+', '', tweet)
remove_undetected_unicode = lambda tweet: re.sub(r'\\\S+', '', tweet)
lowercase_tweet = lambda tweet: str.lower(tweet)
In [ ]:
def cleaning_tweet(tweet):
  final_tweet = remove_entities(tweet)
  final_tweet = remove_RT_word(final_tweet)
  final_tweet = remove_URL(final_tweet)
  final_tweet = lowercase_tweet(final_tweet)
  final_tweet = remove_punctuation(final_tweet)
  final_tweet = remove_non_unicode(final_tweet)
  final_tweet = remove_undetected_unicode(final_tweet)
  
  for symbol in ['!', ',', '-', '.', '?']:
    final_tweet = remove_multiple_symbols(symbol, final_tweet)
  
  final_tweet = replace_new_line(final_tweet)
  final_tweet = remove_multiple_spaces(final_tweet)
  final_tweet = str.strip(final_tweet)

  return final_tweet
In [ ]:
def before_after_tweet(tweets, num_of_print):
  for i in range(num_of_print):
      index = random.randint(0, len(tweets) - 1)
      print("*" * 10)
      print(f"---{index}: ")
      print(f"Before: {tweets[index]}")
      print(f"After: {cleaning_tweet(tweets[index])}")
In [ ]:
print("Dataset from Hate Speech:")
before_after_tweet(hate_speech_dataset["Tweet"], 5)

print("\n\n")

print("Dataset from Sentiment:")
before_after_tweet(sentiment_dataset["Tweets"], 5)
Dataset from Hate Speech:
**********
---164: 
Before: RT @TeddyGusnaidi: Bro.. Kalau yg dimaksud Fatwa Penista agama, MUI gak pernah mengeluarkan Fatwa. https://t.co/x9uXhqJKCa
After: bro. kalau yg dimaksud fatwa penista agama, mui gak pernah mengeluarkan fatwa.
**********
---344: 
Before: Terima kasih pak atas segala kebaikan untuk jakarta yang sudah kalian tancapkan di hati warga jakarta
After: terima kasih pak atas segala kebaikan untuk jakarta yang sudah kalian tancapkan di hati warga jakarta
**********
---529: 
Before: Hasil Akhir, Menyatakan @jokowi Melindungi Ahok TERDAKWA PENISTA AGAMA !!!
After: hasil akhir, menyatakan melindungi ahok terdakwa penista agama !
**********
---696: 
Before: uma surveynya si Botak yg menangkan Ahok. Karena apa? Karena dia Cina Kristen. Bukti kalo si Botak dukung sesama Cina Kristen
After: uma surveynya si botak yg menangkan ahok. karena apa? karena dia cina kristen. bukti kalo si botak dukung sesama cina kristen
**********
---88: 
Before: Mas Agus saya akui malam ini....rambut kamu bagus! #DebatFinalPilkadaJKT
After: mas agus saya akui malam ini.rambut kamu bagus!



Dataset from Sentiment:
**********
---992: 
Before: hidup misteri men kita tidak akan tau akan seperti apa\njalan hidup kita kedepannya , ngeri - ngeri sedap mas\nagus yudhoyono kerenss
After: hidup misteri men kita tidak akan tau akan seperti apa hidup kita kedepannya , ngeri - ngeri sedap mas yudhoyono kerenss
**********
---927: 
Before: rt @chocoliq: berasa kan manfaat nya? drpd dulu mending sekarang \n#programbadja\n#debat2pilkadadki \n#ahokdjarot \n#tetapahokdjarot\u2026 
After: berasa kan manfaat nya? drpd dulu mending sekarang
**********
---856: 
Before: pilkada dki jangan pilih pki!! berbahaya!! pilih 1 atau 3 saja nu dukung polda bali usu... https:\/\/t.co\/v4198chkhz #debat2pilkadadki #imlek
After: pilkada dki jangan pilih pki! berbahaya! pilih 1 atau 3 saja nu dukung polda bali usu.
**********
---184: 
Before: rt @iqlimas: strategi terkonyol dalam sejarah acara debat, sangat tidak intelek dan memalukan #debat2pilkadadki  https:\/\/t.co\/at6gnwtus6
After: strategi terkonyol dalam sejarah acara debat, sangat tidak intelek dan memalukan
**********
---465: 
Before: ini pilkada dki bosss, bukan pilkades, pil kb, pilkadut, bkn jg konser boy band...., hadeh.... anak muda alay.....,\u2026 https:\/\/t.co\/xddehzyhbn
After: ini pilkada dki bosss, bukan pilkades, pil kb, pilkadut, bkn jg konser boy band., hadeh. anak muda alay.,

From printing above, we can see that the text looks much better than before being preprocessed. Even though there are some misleading like dot between two characters "word.word" or misleading punctuations "anak muda alay.,", overall all text has neat representation.

In [ ]:
hate_speech_dataset["preprocessed_tweet"] = hate_speech_dataset["Tweet"].apply(cleaning_tweet)
sentiment_dataset["preprocessed_tweet"] = sentiment_dataset["Tweets"].apply(cleaning_tweet)

Deep Learning Model

Classification steps referenced from https://www.tensorflow.org/tutorials/text/classify_text_with_bert

These are ordered list of models for hate speech dataset:

  • BERT Pre-Trained Layer using BERT Indonesia. Credits to Cahya Wirawan who made this BERT Indonesia. You can access it in this link https://huggingface.co/cahya/bert-base-indonesian-522M. Implementation is just the same as BERT from Tensorflow. Output from this model is pooler_output that will be fed into next layer.
  • Dense layer with 128 nodes and ReLU activation
  • Dropout layer 0.4
  • Dense layer with 256 nodes and ReLU activation
  • Dropout layer 0.5
  • Dense layer with 1 node and Sigmoid activation (output)

These are ordered list of models for sentiment dataset:

  • BERT Pre-Trained Layer using BERT Indonesia
  • Dense layer with 64 nodes
  • Dense layer with 128 nodes and ReLU activation
  • Dropout layer 0.2
  • Dense layer with 512 nodes and ReLU activation
  • Dropout layer 0.4
  • Dense layer with 1 node and Sigmoid activation (output)
In [ ]:
from keras.layers import Input, Dropout, Dense
from keras import Sequential
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf

def build_bert_model():
  model_name='cahya/bert-base-indonesian-522M'
  tokenizer = BertTokenizer.from_pretrained(model_name)
  model = TFBertModel.from_pretrained(model_name)

  return model, tokenizer

def get_pooled_output_from_text(text, model, tokenizer):
  encoded_input = tokenizer(text, return_tensors='tf')
  output = model(encoded_input)
  pooled_output = output['pooler_output']

  return tf.reshape(pooled_output, [-1])

def build_model():
  model = Sequential()
  model.add(Dense(128, activation="relu", input_shape=(768, )))
  model.add(Dropout(0.4))
  model.add(Dense(256, activation="relu"))
  model.add(Dropout(0.5))
  model.add(Dense(1, activation="sigmoid"))
  return model

def build_model_2():
  model = Sequential()
  model.add(Dense(64, input_shape=(768, )))
  model.add(Dense(256, activation="relu"))
  model.add(Dropout(0.2))
  model.add(Dense(512, activation="relu"))
  model.add(Dropout(0.4))
  model.add(Dense(1, activation="sigmoid"))
  return model

bert_model, bert_tokenizer = build_bert_model()





Some layers from the model checkpoint at cahya/bert-base-indonesian-522M were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at cahya/bert-base-indonesian-522M.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
In [ ]:
%%time
pooling_output_hate_speech = [
      get_pooled_output_from_text(tweet, bert_model, bert_tokenizer) for tweet in hate_speech_dataset["preprocessed_tweet"].to_list()
]
pooling_output_hate_speech = np.array(pooling_output_hate_speech)

# Save to pickle since we took a time to get pooling output
pickle.dump(pooling_output_hate_speech, open("hate_speech_pooling_out", "wb"))
CPU times: user 3min 34s, sys: 8.55 s, total: 3min 43s
Wall time: 2min 29s
In [ ]:
hate_speech_label_list = hate_speech_dataset["Label"].to_list()
hate_speech_label = np.array([1 if label == "HS" else 0 for label in hate_speech_label_list])
In [ ]:
model_for_hate_speech = build_model()
In [ ]:
hs_indexes = np.arange(len(pooling_output_hate_speech))
np.random.shuffle(hs_indexes)
In [ ]:
TEST_LEN = int(0.2 * len(pooling_output_hate_speech))
hs_train_idx = hs_indexes[:-TEST_LEN]
hs_test_idx = hs_indexes[-TEST_LEN:]
In [ ]:
hate_speech_X_train, hate_speech_y_train = pooling_output_hate_speech[hs_train_idx], hate_speech_label[hs_train_idx]
hate_speech_X_test, hate_speech_y_test = pooling_output_hate_speech[hs_test_idx], hate_speech_label[hs_test_idx]
In [ ]:
model_for_hate_speech.compile(optimizer="adam", loss=tf.keras.losses.BinaryCrossentropy(), metrics=["accuracy"])
In [ ]:
hs_result = model_for_hate_speech.fit(hate_speech_X_train, hate_speech_y_train, epochs=50, validation_data=(hate_speech_X_test, hate_speech_y_test))
Epoch 1/50
18/18 [==============================] - 1s 13ms/step - loss: 0.7233 - accuracy: 0.5751 - val_loss: 0.6366 - val_accuracy: 0.5986
Epoch 2/50
18/18 [==============================] - 0s 5ms/step - loss: 0.6053 - accuracy: 0.6938 - val_loss: 0.5828 - val_accuracy: 0.6901
Epoch 3/50
18/18 [==============================] - 0s 5ms/step - loss: 0.5842 - accuracy: 0.6546 - val_loss: 0.5074 - val_accuracy: 0.7606
Epoch 4/50
18/18 [==============================] - 0s 6ms/step - loss: 0.5231 - accuracy: 0.7486 - val_loss: 0.4604 - val_accuracy: 0.8239
Epoch 5/50
18/18 [==============================] - 0s 5ms/step - loss: 0.4367 - accuracy: 0.8036 - val_loss: 0.4172 - val_accuracy: 0.8099
Epoch 6/50
18/18 [==============================] - 0s 5ms/step - loss: 0.4181 - accuracy: 0.8114 - val_loss: 0.4226 - val_accuracy: 0.8099
Epoch 7/50
18/18 [==============================] - 0s 5ms/step - loss: 0.4427 - accuracy: 0.7979 - val_loss: 0.4071 - val_accuracy: 0.8169
Epoch 8/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3480 - accuracy: 0.8672 - val_loss: 0.3896 - val_accuracy: 0.8239
Epoch 9/50
18/18 [==============================] - 0s 5ms/step - loss: 0.4067 - accuracy: 0.8278 - val_loss: 0.3959 - val_accuracy: 0.8239
Epoch 10/50
18/18 [==============================] - 0s 5ms/step - loss: 0.4004 - accuracy: 0.8498 - val_loss: 0.4018 - val_accuracy: 0.8310
Epoch 11/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3719 - accuracy: 0.8383 - val_loss: 0.3957 - val_accuracy: 0.8380
Epoch 12/50
18/18 [==============================] - 0s 6ms/step - loss: 0.3328 - accuracy: 0.8588 - val_loss: 0.3763 - val_accuracy: 0.8239
Epoch 13/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3786 - accuracy: 0.8495 - val_loss: 0.3776 - val_accuracy: 0.8380
Epoch 14/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3571 - accuracy: 0.8489 - val_loss: 0.3772 - val_accuracy: 0.8239
Epoch 15/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3331 - accuracy: 0.8598 - val_loss: 0.3741 - val_accuracy: 0.8239
Epoch 16/50
18/18 [==============================] - 0s 6ms/step - loss: 0.2999 - accuracy: 0.8867 - val_loss: 0.3990 - val_accuracy: 0.8239
Epoch 17/50
18/18 [==============================] - 0s 6ms/step - loss: 0.3108 - accuracy: 0.8780 - val_loss: 0.4007 - val_accuracy: 0.8239
Epoch 18/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3049 - accuracy: 0.8749 - val_loss: 0.3905 - val_accuracy: 0.8380
Epoch 19/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2945 - accuracy: 0.8760 - val_loss: 0.3735 - val_accuracy: 0.8380
Epoch 20/50
18/18 [==============================] - 0s 6ms/step - loss: 0.2841 - accuracy: 0.8930 - val_loss: 0.3681 - val_accuracy: 0.8380
Epoch 21/50
18/18 [==============================] - 0s 6ms/step - loss: 0.3014 - accuracy: 0.8780 - val_loss: 0.3954 - val_accuracy: 0.8380
Epoch 22/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2634 - accuracy: 0.8882 - val_loss: 0.3862 - val_accuracy: 0.8451
Epoch 23/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3073 - accuracy: 0.8699 - val_loss: 0.4277 - val_accuracy: 0.8310
Epoch 24/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3362 - accuracy: 0.8511 - val_loss: 0.3880 - val_accuracy: 0.8451
Epoch 25/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2502 - accuracy: 0.9171 - val_loss: 0.3932 - val_accuracy: 0.8310
Epoch 26/50
18/18 [==============================] - 0s 6ms/step - loss: 0.2750 - accuracy: 0.8877 - val_loss: 0.4009 - val_accuracy: 0.8099
Epoch 27/50
18/18 [==============================] - 0s 14ms/step - loss: 0.3146 - accuracy: 0.8441 - val_loss: 0.3781 - val_accuracy: 0.8380
Epoch 28/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2711 - accuracy: 0.8977 - val_loss: 0.4069 - val_accuracy: 0.8310
Epoch 29/50
18/18 [==============================] - 0s 4ms/step - loss: 0.2639 - accuracy: 0.8908 - val_loss: 0.3750 - val_accuracy: 0.8169
Epoch 30/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2509 - accuracy: 0.8958 - val_loss: 0.3965 - val_accuracy: 0.8239
Epoch 31/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2411 - accuracy: 0.8846 - val_loss: 0.4502 - val_accuracy: 0.8310
Epoch 32/50
18/18 [==============================] - 0s 5ms/step - loss: 0.3598 - accuracy: 0.8650 - val_loss: 0.3747 - val_accuracy: 0.8380
Epoch 33/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2722 - accuracy: 0.8798 - val_loss: 0.3791 - val_accuracy: 0.8239
Epoch 34/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2554 - accuracy: 0.9026 - val_loss: 0.3909 - val_accuracy: 0.8169
Epoch 35/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2946 - accuracy: 0.8677 - val_loss: 0.3716 - val_accuracy: 0.8451
Epoch 36/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2629 - accuracy: 0.8767 - val_loss: 0.4093 - val_accuracy: 0.8239
Epoch 37/50
18/18 [==============================] - 0s 5ms/step - loss: 0.1967 - accuracy: 0.9360 - val_loss: 0.3985 - val_accuracy: 0.8239
Epoch 38/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2198 - accuracy: 0.9093 - val_loss: 0.3594 - val_accuracy: 0.8239
Epoch 39/50
18/18 [==============================] - 0s 4ms/step - loss: 0.2106 - accuracy: 0.9134 - val_loss: 0.3953 - val_accuracy: 0.8310
Epoch 40/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2245 - accuracy: 0.9174 - val_loss: 0.4030 - val_accuracy: 0.8169
Epoch 41/50
18/18 [==============================] - 0s 6ms/step - loss: 0.2255 - accuracy: 0.9033 - val_loss: 0.3830 - val_accuracy: 0.8380
Epoch 42/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2254 - accuracy: 0.9029 - val_loss: 0.3965 - val_accuracy: 0.8169
Epoch 43/50
18/18 [==============================] - 0s 5ms/step - loss: 0.1850 - accuracy: 0.9268 - val_loss: 0.3840 - val_accuracy: 0.8099
Epoch 44/50
18/18 [==============================] - 0s 6ms/step - loss: 0.2171 - accuracy: 0.9109 - val_loss: 0.3798 - val_accuracy: 0.8310
Epoch 45/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2031 - accuracy: 0.9178 - val_loss: 0.4183 - val_accuracy: 0.8451
Epoch 46/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2200 - accuracy: 0.9131 - val_loss: 0.4373 - val_accuracy: 0.8310
Epoch 47/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2058 - accuracy: 0.9119 - val_loss: 0.4122 - val_accuracy: 0.8451
Epoch 48/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2305 - accuracy: 0.8964 - val_loss: 0.3832 - val_accuracy: 0.8380
Epoch 49/50
18/18 [==============================] - 0s 5ms/step - loss: 0.1685 - accuracy: 0.9380 - val_loss: 0.3961 - val_accuracy: 0.8451
Epoch 50/50
18/18 [==============================] - 0s 5ms/step - loss: 0.2267 - accuracy: 0.8954 - val_loss: 0.4026 - val_accuracy: 0.8028
In [ ]:
plt.figure(figsize=(12, 8))
plt.plot(hs_result.epoch, hs_result.history['accuracy'], label="Train")
plt.plot(hs_result.epoch, hs_result.history['val_accuracy'], label="Validation")
plt.title("Hate Speech: Accuracy")
plt.legend()
plt.show()

For hate speech model, we can see the model did a good job because accuracy graph between train and validation are so close

In [ ]:
%%time
pooling_output_sentiment = [
      get_pooled_output_from_text(tweet, bert_model, bert_tokenizer) for tweet in sentiment_dataset["preprocessed_tweet"].to_list()
]
pooling_output_sentiment = np.array(pooling_output_sentiment)

# Save to pickle since we took a time to get pooling output
pickle.dump(pooling_output_sentiment, open("sentiment_pooling_out", "wb"))
CPU times: user 7min 48s, sys: 22.4 s, total: 8min 10s
Wall time: 5min 29s
In [ ]:
model_for_sentiment = build_model_2()

sentiment_list = sentiment_dataset["Label"].to_list()
sentiment_label = np.array([1 if label == "positif" else 0 for label in sentiment_list])
In [ ]:
s_indexes = np.arange(len(pooling_output_sentiment))
np.random.shuffle(s_indexes)
In [ ]:
TEST_LEN = int(0.2 * len(pooling_output_sentiment))
s_train_idx = s_indexes[:-TEST_LEN]
s_test_idx = s_indexes[-TEST_LEN:]
In [ ]:
sentiment_X_train, sentiment_y_train = pooling_output_sentiment[s_train_idx], sentiment_label[s_train_idx]
sentiment_X_test, sentiment_y_test = pooling_output_sentiment[s_test_idx], sentiment_label[s_test_idx]
In [ ]:
model_for_sentiment.compile(optimizer="adam", loss=tf.keras.losses.BinaryCrossentropy(), metrics=["accuracy"])
s_result = model_for_sentiment.fit(sentiment_X_train, sentiment_y_train, epochs=50, validation_data=(sentiment_X_test, sentiment_y_test))
Epoch 1/50
38/38 [==============================] - 1s 10ms/step - loss: 0.7099 - accuracy: 0.5293 - val_loss: 0.7515 - val_accuracy: 0.5183
Epoch 2/50
38/38 [==============================] - 0s 6ms/step - loss: 0.6365 - accuracy: 0.6258 - val_loss: 0.5856 - val_accuracy: 0.7010
Epoch 3/50
38/38 [==============================] - 0s 6ms/step - loss: 0.5680 - accuracy: 0.7139 - val_loss: 0.6702 - val_accuracy: 0.6312
Epoch 4/50
38/38 [==============================] - 0s 6ms/step - loss: 0.5618 - accuracy: 0.7082 - val_loss: 0.5233 - val_accuracy: 0.7375
Epoch 5/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4837 - accuracy: 0.7549 - val_loss: 0.5535 - val_accuracy: 0.6877
Epoch 6/50
38/38 [==============================] - 0s 6ms/step - loss: 0.5038 - accuracy: 0.7542 - val_loss: 0.5351 - val_accuracy: 0.7409
Epoch 7/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4701 - accuracy: 0.7716 - val_loss: 0.4962 - val_accuracy: 0.7508
Epoch 8/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4428 - accuracy: 0.7926 - val_loss: 0.5282 - val_accuracy: 0.7243
Epoch 9/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4421 - accuracy: 0.7848 - val_loss: 0.5442 - val_accuracy: 0.7076
Epoch 10/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4303 - accuracy: 0.7999 - val_loss: 0.5613 - val_accuracy: 0.7276
Epoch 11/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4232 - accuracy: 0.7966 - val_loss: 0.5535 - val_accuracy: 0.7508
Epoch 12/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4489 - accuracy: 0.7890 - val_loss: 0.5729 - val_accuracy: 0.6910
Epoch 13/50
38/38 [==============================] - 0s 7ms/step - loss: 0.4322 - accuracy: 0.7973 - val_loss: 0.5430 - val_accuracy: 0.7542
Epoch 14/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3729 - accuracy: 0.8245 - val_loss: 0.5183 - val_accuracy: 0.7542
Epoch 15/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4042 - accuracy: 0.8127 - val_loss: 0.5297 - val_accuracy: 0.7508
Epoch 16/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3569 - accuracy: 0.8262 - val_loss: 0.5133 - val_accuracy: 0.7342
Epoch 17/50
38/38 [==============================] - 0s 6ms/step - loss: 0.4058 - accuracy: 0.8152 - val_loss: 0.5477 - val_accuracy: 0.7309
Epoch 18/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3793 - accuracy: 0.8414 - val_loss: 0.5528 - val_accuracy: 0.7708
Epoch 19/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3981 - accuracy: 0.8155 - val_loss: 0.5207 - val_accuracy: 0.7641
Epoch 20/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3370 - accuracy: 0.8568 - val_loss: 0.5634 - val_accuracy: 0.7409
Epoch 21/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3626 - accuracy: 0.8361 - val_loss: 0.5071 - val_accuracy: 0.7741
Epoch 22/50
38/38 [==============================] - 0s 10ms/step - loss: 0.3140 - accuracy: 0.8536 - val_loss: 0.6077 - val_accuracy: 0.7741
Epoch 23/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3667 - accuracy: 0.8390 - val_loss: 0.5697 - val_accuracy: 0.7674
Epoch 24/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3396 - accuracy: 0.8422 - val_loss: 0.5404 - val_accuracy: 0.7940
Epoch 25/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3205 - accuracy: 0.8606 - val_loss: 0.5336 - val_accuracy: 0.7973
Epoch 26/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2811 - accuracy: 0.8829 - val_loss: 0.6373 - val_accuracy: 0.7442
Epoch 27/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3697 - accuracy: 0.8362 - val_loss: 0.5912 - val_accuracy: 0.7542
Epoch 28/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2982 - accuracy: 0.8674 - val_loss: 0.6035 - val_accuracy: 0.7508
Epoch 29/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2920 - accuracy: 0.8669 - val_loss: 0.5903 - val_accuracy: 0.7874
Epoch 30/50
38/38 [==============================] - 0s 6ms/step - loss: 0.3196 - accuracy: 0.8570 - val_loss: 0.5100 - val_accuracy: 0.7641
Epoch 31/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2972 - accuracy: 0.8747 - val_loss: 0.5269 - val_accuracy: 0.7542
Epoch 32/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2954 - accuracy: 0.8716 - val_loss: 0.5896 - val_accuracy: 0.7807
Epoch 33/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2582 - accuracy: 0.8849 - val_loss: 0.5754 - val_accuracy: 0.7475
Epoch 34/50
38/38 [==============================] - 0s 7ms/step - loss: 0.2791 - accuracy: 0.8967 - val_loss: 0.6769 - val_accuracy: 0.7575
Epoch 35/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2635 - accuracy: 0.8764 - val_loss: 0.5536 - val_accuracy: 0.7641
Epoch 36/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2613 - accuracy: 0.8932 - val_loss: 0.5833 - val_accuracy: 0.7907
Epoch 37/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2928 - accuracy: 0.8722 - val_loss: 0.6896 - val_accuracy: 0.7807
Epoch 38/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2279 - accuracy: 0.9067 - val_loss: 0.7814 - val_accuracy: 0.7874
Epoch 39/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2477 - accuracy: 0.9001 - val_loss: 0.6946 - val_accuracy: 0.7641
Epoch 40/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2323 - accuracy: 0.8976 - val_loss: 0.6186 - val_accuracy: 0.7741
Epoch 41/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2367 - accuracy: 0.8938 - val_loss: 0.7529 - val_accuracy: 0.7243
Epoch 42/50
38/38 [==============================] - 0s 7ms/step - loss: 0.2460 - accuracy: 0.8909 - val_loss: 0.7541 - val_accuracy: 0.7741
Epoch 43/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2201 - accuracy: 0.9014 - val_loss: 0.6642 - val_accuracy: 0.8007
Epoch 44/50
38/38 [==============================] - 0s 6ms/step - loss: 0.1976 - accuracy: 0.9129 - val_loss: 0.7279 - val_accuracy: 0.7774
Epoch 45/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2637 - accuracy: 0.8783 - val_loss: 0.6788 - val_accuracy: 0.7907
Epoch 46/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2119 - accuracy: 0.9141 - val_loss: 0.6793 - val_accuracy: 0.7741
Epoch 47/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2661 - accuracy: 0.8695 - val_loss: 0.6615 - val_accuracy: 0.7874
Epoch 48/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2241 - accuracy: 0.9083 - val_loss: 0.6113 - val_accuracy: 0.7774
Epoch 49/50
38/38 [==============================] - 0s 6ms/step - loss: 0.2065 - accuracy: 0.9083 - val_loss: 0.5671 - val_accuracy: 0.7641
Epoch 50/50
38/38 [==============================] - 0s 7ms/step - loss: 0.2297 - accuracy: 0.9056 - val_loss: 0.9079 - val_accuracy: 0.7874
In [ ]:
plt.figure(figsize=(12, 8))
plt.plot(s_result.epoch, s_result.history['accuracy'], label="Train")
plt.plot(s_result.epoch, s_result.history['val_accuracy'], label="Validation")
plt.title("Sentiment: Accuracy")
plt.legend()
plt.show()

The sentiment model is nearly the same good as the hate speech model. Overall accuracy for train is around 85% - 90% while overall of the validation data is nearly 80%.

Cross-inferencing

For cross-inferencing, I trained the model using test dataset that I was divided. For hate speech model, I trained it using test dataset from sentiment and vice-versa. The result is depicted using "confusion matrix" alike.

In [ ]:
# For labelling
label_hate_speech = ["not_hs", "hs"]
label_sentiment = ["negative", "positive"]
In [ ]:
# Cross-inferecing for sentiment dataset (Test sentiment dataset to hate speech model)
ci_sentiment = (model_for_hate_speech.predict(sentiment_X_train) > 0.5).astype("int32")
ci_sentiment = ci_sentiment.reshape(-1)
In [ ]:
# Cross-inferecing for hate speech dataset (Test hate speech dataset to sentiment model)
ci_hate_speech = (model_for_sentiment.predict(hate_speech_X_train) > 0.5).astype("int32")
ci_hate_speech = ci_hate_speech.reshape(-1)
In [ ]:
comparison_matrix = np.zeros((2, 2), dtype="int32")

# Note for comparison_matrix:
# (0, 0) -> not hate speech and negative sentiment
# (1, 0) -> hate speech and negative sentiment
# (0, 1) -> not hate speech and positive sentiment
# (1, 1) -> hate speech and positive sentiment

# Calculate from Hate Speech dataset and its cross-inferencing result
for hs, s in zip(hate_speech_y_train, ci_hate_speech):
  comparison_matrix[hs, s] += 1

# Calculate from Sentiment dataset and its cross-inferencing result
for hs, s in zip(ci_sentiment, sentiment_y_train):
  comparison_matrix[hs, s] += 1
In [ ]:
plt.figure(figsize=(8,8))
plt.imshow(comparison_matrix, interpolation="nearest", cmap=plt.cm.Greens)
plt.title("Comparison Matrix")
plt.colorbar()
tick_marks = np.arange(2)
plt.xticks(tick_marks, label_sentiment, rotation=45)
plt.yticks(tick_marks, label_hate_speech)
threshold = comparison_matrix.max() / 2
for i, j in product(range(comparison_matrix.shape[0]), range(comparison_matrix.shape[1])):
    plt.text(j, i, format(comparison_matrix[i, j], 'd'), 
             horizontalalignment="center", color="white" if comparison_matrix[i, j] > threshold else "black")
    
plt.tight_layout()
plt.ylabel("Hate Speech")
plt.xlabel("Sentiment")
plt.show()

Step-by-step for create visualization above is from https://towardsdatascience.com/a-simple-cnn-multi-image-classifier-31c463324fa.

Plotting above showed that hate speech does not tends to be negative sentiment. As we can see, "hate speech and negative statement" gives the value of 333, not greater than "not hate speech but negative statement", i.e. 490. However, we can see that not hate speech tweet tends to be positive sentiment. Also, we can conclude that these tweets overall are not hate speech because the amount of not hate speech is much bigger than hate speech.