This notebook contains machine learning model implementation with LSTM layer and Embedding for classification text about Amazon Product Reviews, i.e. whether a review is considered as good or bad review. The model gave high accuracy (> 90%) on the training and validation dataset yet the model is overfitted.
import zipfile
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import tensorflow
from nltk.corpus import stopwords
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from google.colab import files
from sklearn.model_selection import train_test_split
%matplotlib inline
# Meng-install Kaggle untuk proses download dataset
!pip install -q kaggle
# Tempat untuk melakukan upload Kaggle JSON
files.upload()
# Memindahkan Kaggle JSON ke current dir
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
# Men-download dataset
!kaggle datasets download -d datafiniti/consumer-reviews-of-amazon-products
I used Amazon Product Reviews dataset for this notebook. The data that will be analyzed is review text and its rating.
# Proses extract zip
file_path = 'consumer-reviews-of-amazon-products.zip'
zip_ref = zipfile.ZipFile(file_path, 'r')
zip_ref.extractall()
zip_ref.close()
# Load file
df = pd.read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')
df.head()
# Hanya mengambil data review dan rating
data = df.loc[:, ['reviews.text', 'reviews.rating']]
data.head()
df.groupby('reviews.rating')['reviews.text'].count().plot(kind='pie')
From the graph results, it is found that the data is very imbalanced with evidence that the product with a rating of 5 is much greater than the rating with a value of 4 or less. Even a rating with a value of 4 is much greater than a rating with a value of 3 or below. Therefore, upsampling will be carried out with the assumption that:
Although there are 3 categories, this notebook will only classify 'good' and 'bad' ratings.
def upsampling(rate):
if rate < 3:
return 'bad'
elif rate > 3:
return 'good'
return 'neutral'
data['reviews.stat'] = data['reviews.rating'].apply(lambda x: upsampling(x))
data.head()
data.groupby('reviews.stat')['reviews.text'].count().plot(kind='pie')
From graph above, it can be seen that the results of the upsampling is still very imbalanced yet relatively better than before. Therefore, for training and evaluating the model, the sample that will be taken for a 'good rating' is only 1600.
# Memisahkan data
bad_data = data[data['reviews.stat'] == 'bad']
good_data = data[data['reviews.stat'] == 'good'].sample(n=1600, replace=True)
final_data = pd.concat([good_data, bad_data], axis=0)
final_data.groupby('reviews.stat')['reviews.text'].count().plot(kind='pie')
In the graph above, it can be seen that the data set is better because there is a balance between 'good' and 'bad' data.
# Melakukan print terkait dengan informasi dataset
# dan jumlah data 'good' and 'bad'
print("Informasi terkait dengan data:\n")
print(final_data.info())
print()
print("Sebaran data:\n", final_data.groupby('reviews.stat')['reviews.text'].count())
# Encoding
# Melakukan perubahan string menjadi int
final_data['reviews.stat.label'] = [0 if stat == 'bad' else 1 for stat in final_data['reviews.stat']]
Removing stop words because the existence of these stop words do not have much influence on the model
# Stop word removal
nltk.download('stopwords')
stopwords_list_en = stopwords.words('english')
def remove_stop_words(word_seq):
new_sntc = " ".join(filter(lambda word: word not in stopwords_list_en, word_seq))
return new_sntc
final_data['text_excl_swords'] = final_data['reviews.text'].apply(lambda words: remove_stop_words(text_to_word_sequence(words)))
final_data.head()
# Memisahkan data feature dan label
X = final_data['text_excl_swords'].values
y = final_data['reviews.stat.label'].values
# Melakukan split train-test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Proses tokenisasi
tokenizer = Tokenizer(num_words=5000, oov_token='x')
tokenizer.fit_on_texts(X_train)
tokenizer.fit_on_texts(X_test)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq)
X_test_pad = pad_sequences(X_test_seq)
# Pembuatan model
model = Sequential([
Embedding(input_dim=5000, output_dim=32),
LSTM(64),
Dense(64, activation='relu'),
Dropout(0.6),
Dense(32, activation='relu'),
Dropout(0.4),
Dense(16, activation='relu'),
Dropout(0.25),
Dense(1, activation='sigmoid')
])
model.summary()
model.compile(loss='binary_crossentropy', optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.0008), metrics=['accuracy'])
ep = 0
counter = 0
# Membuat fungsi callback untuk early stopping
# dengan kriteria apabila telah 5 kali mencapai nilai akurasi train dan validasi lebih dari 0.9
class ModelCallback(Callback):
def on_epoch_end(self, epoch, logs=None):
global counter
global ep
curr_train_acc = round(logs['accuracy'], 2)
curr_val_acc = round(logs['val_accuracy'], 2)
ep += 1
if curr_train_acc >= 0.90 and curr_val_acc >= 0.90:
counter += 1
if counter == 5:
print("Training stopped at epoch of {} with train accuracy: {} and validation accuracy: {}".format(epoch+1, curr_train_acc, curr_val_acc))
self.model.stop_training = True
%%time
history = model.fit(X_train_pad,
y_train,
epochs=25,
validation_data=(X_test_pad, y_test),
callbacks=[ModelCallback()],
verbose=2)
plt.plot(range(1,ep+1), history.history['accuracy'], label='Training')
plt.plot(range(1,ep+1), history.history['val_accuracy'], label='Validation')
plt.title("Train Accuracy - Validation Accuracy")
plt.legend()
plt.show()
plt.clf()
plt.plot(range(1,ep+1), history.history['loss'], label='Training')
plt.plot(range(1,ep+1), history.history['val_loss'], label='Validation')
plt.title("Train Loss - Validation Loss")
plt.legend()
plt.show()
From plotting accuracy and loss above, it can be seen that the model has a high accuracy on training and validation data, i.e. more than 90%. However, the model is overfitted because distance between training loss and validation loss is too high and the validation loss is getting higher every epoch.