Text Classification for Amazon Product Reviews¶

This notebook contains machine learning model implementation with LSTM layer and Embedding for classification text about Amazon Product Reviews, i.e. whether a review is considered as good or bad review. The model gave high accuracy (> 90%) on the training and validation dataset yet the model is overfitted.

import zipfile
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import tensorflow
from nltk.corpus import stopwords
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from google.colab import files
from sklearn.model_selection import train_test_split
%matplotlib inline

Download File Preparation¶

# Meng-install Kaggle untuk proses download dataset 
!pip install -q kaggle

# Tempat untuk melakukan upload Kaggle JSON
files.upload()

# Memindahkan Kaggle JSON ke current dir
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Men-download dataset
!kaggle datasets download -d datafiniti/consumer-reviews-of-amazon-products

Downloading consumer-reviews-of-amazon-products.zip to /content
 62% 10.0M/16.3M [00:00<00:00, 44.9MB/s]
100% 16.3M/16.3M [00:00<00:00, 64.3MB/s]

Dataset Preparation¶

I used Amazon Product Reviews dataset for this notebook. The data that will be analyzed is review text and its rating.

# Proses extract zip
file_path = 'consumer-reviews-of-amazon-products.zip'
zip_ref = zipfile.ZipFile(file_path, 'r')
zip_ref.extractall()
zip_ref.close()

# Load file
df = pd.read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')
df.head()

Data Preprocessing¶

# Hanya mengambil data review dan rating
data = df.loc[:, ['reviews.text', 'reviews.rating']]
data.head()

df.groupby('reviews.rating')['reviews.text'].count().plot(kind='pie')

<matplotlib.axes._subplots.AxesSubplot at 0x7f0f5f30fe90>

From the graph results, it is found that the data is very imbalanced with evidence that the product with a rating of 5 is much greater than the rating with a value of 4 or less. Even a rating with a value of 4 is much greater than a rating with a value of 3 or below. Therefore, upsampling will be carried out with the assumption that:

Rating 1-2 is a 'bad rating'
Rating 3 is a 'neutral rating'
Rating 4-5 is a 'good rating'

Although there are 3 categories, this notebook will only classify 'good' and 'bad' ratings.

def upsampling(rate):
  if rate < 3:
    return 'bad'
  elif rate > 3:
    return 'good'
  
  return 'neutral'

data['reviews.stat'] = data['reviews.rating'].apply(lambda x: upsampling(x))

data.head()

data.groupby('reviews.stat')['reviews.text'].count().plot(kind='pie')

<matplotlib.axes._subplots.AxesSubplot at 0x7f0f5f27db10>

From graph above, it can be seen that the results of the upsampling is still very imbalanced yet relatively better than before. Therefore, for training and evaluating the model, the sample that will be taken for a 'good rating' is only 1600.

# Memisahkan data 
bad_data = data[data['reviews.stat'] == 'bad']
good_data = data[data['reviews.stat'] == 'good'].sample(n=1600, replace=True)
final_data = pd.concat([good_data, bad_data], axis=0)

final_data.groupby('reviews.stat')['reviews.text'].count().plot(kind='pie')

<matplotlib.axes._subplots.AxesSubplot at 0x7f0f5f24c890>

In the graph above, it can be seen that the data set is better because there is a balance between 'good' and 'bad' data.

# Melakukan print terkait dengan informasi dataset
# dan jumlah data 'good' and 'bad'
print("Informasi terkait dengan data:\n")
print(final_data.info())
print()
print("Sebaran data:\n", final_data.groupby('reviews.stat')['reviews.text'].count())

Informasi terkait dengan data:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3181 entries, 10856 to 28285
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   reviews.text        3181 non-null   object
 1   reviews.rating      3181 non-null   int64 
 2   reviews.stat        3181 non-null   object
 3   reviews.stat.label  3181 non-null   int64 
 4   text_excl_swords    3181 non-null   object
dtypes: int64(2), object(3)
memory usage: 149.1+ KB
None

Sebaran data:
 reviews.stat
bad     1581
good    1600
Name: reviews.text, dtype: int64

# Encoding
# Melakukan perubahan string menjadi int
final_data['reviews.stat.label'] = [0 if stat == 'bad' else 1 for stat in final_data['reviews.stat']]

Removing stop words because the existence of these stop words do not have much influence on the model

# Stop word removal
nltk.download('stopwords')
stopwords_list_en = stopwords.words('english')

def remove_stop_words(word_seq):
  new_sntc = " ".join(filter(lambda word: word not in stopwords_list_en, word_seq))

  return new_sntc

final_data['text_excl_swords'] = final_data['reviews.text'].apply(lambda words: remove_stop_words(text_to_word_sequence(words)))
final_data.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

# Memisahkan data feature dan label
X = final_data['text_excl_swords'].values
y = final_data['reviews.stat.label'].values

# Melakukan split train-test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Proses tokenisasi
tokenizer = Tokenizer(num_words=5000, oov_token='x')
tokenizer.fit_on_texts(X_train) 
tokenizer.fit_on_texts(X_test)

X_train_seq = tokenizer.texts_to_sequences(X_train) 
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq)
X_test_pad = pad_sequences(X_test_seq)

# Pembuatan model
model = Sequential([
    Embedding(input_dim=5000, output_dim=32),
    LSTM(64),
    Dense(64, activation='relu'),
    Dropout(0.6),
    Dense(32, activation='relu'),
    Dropout(0.4),
    Dense(16, activation='relu'),
    Dropout(0.25),
    Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 32)          160000    
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 16)                528       
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 191,617
Trainable params: 191,617
Non-trainable params: 0
_________________________________________________________________

model.compile(loss='binary_crossentropy', optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.0008), metrics=['accuracy'])

ep = 0
counter = 0

# Membuat fungsi callback untuk early stopping
# dengan kriteria apabila telah 5 kali mencapai nilai akurasi train dan validasi lebih dari 0.9
class ModelCallback(Callback):
  def on_epoch_end(self, epoch, logs=None):
    global counter
    global ep

    curr_train_acc = round(logs['accuracy'], 2)
    curr_val_acc = round(logs['val_accuracy'], 2)
    ep += 1
    
    if curr_train_acc >= 0.90 and curr_val_acc >= 0.90:
      counter += 1
      if counter == 5:
        print("Training stopped at epoch of {} with train accuracy: {} and validation accuracy: {}".format(epoch+1, curr_train_acc, curr_val_acc))
        self.model.stop_training = True

%%time
history = model.fit(X_train_pad, 
                    y_train, 
                    epochs=25, 
                    validation_data=(X_test_pad, y_test), 
                    callbacks=[ModelCallback()],
                    verbose=2)

Epoch 1/25
80/80 - 17s - loss: 0.6882 - accuracy: 0.5330 - val_loss: 0.6424 - val_accuracy: 0.7143 - 17s/epoch - 207ms/step
Epoch 2/25
80/80 - 11s - loss: 0.5118 - accuracy: 0.7960 - val_loss: 0.3383 - val_accuracy: 0.8807 - 11s/epoch - 137ms/step
Epoch 3/25
80/80 - 11s - loss: 0.2431 - accuracy: 0.9214 - val_loss: 0.3390 - val_accuracy: 0.8713 - 11s/epoch - 137ms/step
Epoch 4/25
80/80 - 11s - loss: 0.1575 - accuracy: 0.9509 - val_loss: 0.3075 - val_accuracy: 0.8791 - 11s/epoch - 138ms/step
Epoch 5/25
80/80 - 11s - loss: 0.1180 - accuracy: 0.9623 - val_loss: 0.3314 - val_accuracy: 0.9058 - 11s/epoch - 144ms/step
Epoch 6/25
80/80 - 11s - loss: 0.0782 - accuracy: 0.9815 - val_loss: 0.3583 - val_accuracy: 0.9074 - 11s/epoch - 139ms/step
Epoch 7/25
80/80 - 11s - loss: 0.0530 - accuracy: 0.9862 - val_loss: 0.4882 - val_accuracy: 0.9027 - 11s/epoch - 138ms/step
Epoch 8/25
80/80 - 11s - loss: 0.0487 - accuracy: 0.9886 - val_loss: 0.4950 - val_accuracy: 0.8995 - 11s/epoch - 138ms/step
Epoch 9/25
Training stopped at epoch of 9 with train accuracy: 0.99 and validation accuracy: 0.91
80/80 - 11s - loss: 0.0430 - accuracy: 0.9878 - val_loss: 0.5554 - val_accuracy: 0.9058 - 11s/epoch - 143ms/step
CPU times: user 2min 49s, sys: 9.91 s, total: 2min 59s
Wall time: 1min 45s

plt.plot(range(1,ep+1), history.history['accuracy'], label='Training')
plt.plot(range(1,ep+1), history.history['val_accuracy'], label='Validation')
plt.title("Train Accuracy - Validation Accuracy")
plt.legend()
plt.show()

plt.clf()

plt.plot(range(1,ep+1), history.history['loss'], label='Training')
plt.plot(range(1,ep+1), history.history['val_loss'], label='Validation')
plt.title("Train Loss - Validation Loss")
plt.legend()
plt.show()

From plotting accuracy and loss above, it can be seen that the model has a high accuracy on training and validation data, i.e. more than 90%. However, the model is overfitted because distance between training loss and validation loss is too high and the validation loss is getting higher every epoch.

	id	dateAdded	dateUpdated	name	asins	brand	categories	primaryCategories	imageURLs	keys	...	reviews.didPurchase	reviews.doRecommend	reviews.id	reviews.numHelpful	reviews.rating	reviews.sourceURLs	reviews.text	reviews.title	reviews.username	sourceURLs
0	AVpgNzjwLJeJML43Kpxn	2015-10-30T08:59:32Z	2019-04-25T09:08:16Z	AmazonBasics AAA Performance Alkaline Batterie...	B00QWO9P0O,B00LH3DMUO	Amazonbasics	AA,AAA,Health,Electronics,Health & Household,C...	Health & Beauty	https://images-na.ssl-images-amazon.com/images...	amazonbasics/hl002619,amazonbasicsaaaperforman...	...	NaN	NaN	NaN	NaN	3	https://www.amazon.com/product-reviews/B00QWO9...	I order 3 of them and one of the item is bad q...	... 3 of them and one of the item is bad quali...	Byger yang	https://www.barcodable.com/upc/841710106442,ht...
1	AVpgNzjwLJeJML43Kpxn	2015-10-30T08:59:32Z	2019-04-25T09:08:16Z	AmazonBasics AAA Performance Alkaline Batterie...	B00QWO9P0O,B00LH3DMUO	Amazonbasics	AA,AAA,Health,Electronics,Health & Household,C...	Health & Beauty	https://images-na.ssl-images-amazon.com/images...	amazonbasics/hl002619,amazonbasicsaaaperforman...	...	NaN	NaN	NaN	NaN	4	https://www.amazon.com/product-reviews/B00QWO9...	Bulk is always the less expensive way to go fo...	... always the less expensive way to go for pr...	ByMG	https://www.barcodable.com/upc/841710106442,ht...
2	AVpgNzjwLJeJML43Kpxn	2015-10-30T08:59:32Z	2019-04-25T09:08:16Z	AmazonBasics AAA Performance Alkaline Batterie...	B00QWO9P0O,B00LH3DMUO	Amazonbasics	AA,AAA,Health,Electronics,Health & Household,C...	Health & Beauty	https://images-na.ssl-images-amazon.com/images...	amazonbasics/hl002619,amazonbasicsaaaperforman...	...	NaN	NaN	NaN	NaN	5	https://www.amazon.com/product-reviews/B00QWO9...	Well they are not Duracell but for the price i...	... are not Duracell but for the price i am ha...	BySharon Lambert	https://www.barcodable.com/upc/841710106442,ht...
3	AVpgNzjwLJeJML43Kpxn	2015-10-30T08:59:32Z	2019-04-25T09:08:16Z	AmazonBasics AAA Performance Alkaline Batterie...	B00QWO9P0O,B00LH3DMUO	Amazonbasics	AA,AAA,Health,Electronics,Health & Household,C...	Health & Beauty	https://images-na.ssl-images-amazon.com/images...	amazonbasics/hl002619,amazonbasicsaaaperforman...	...	NaN	NaN	NaN	NaN	5	https://www.amazon.com/product-reviews/B00QWO9...	Seem to work as well as name brand batteries a...	... as well as name brand batteries at a much ...	Bymark sexson	https://www.barcodable.com/upc/841710106442,ht...
4	AVpgNzjwLJeJML43Kpxn	2015-10-30T08:59:32Z	2019-04-25T09:08:16Z	AmazonBasics AAA Performance Alkaline Batterie...	B00QWO9P0O,B00LH3DMUO	Amazonbasics	AA,AAA,Health,Electronics,Health & Household,C...	Health & Beauty	https://images-na.ssl-images-amazon.com/images...	amazonbasics/hl002619,amazonbasicsaaaperforman...	...	NaN	NaN	NaN	NaN	5	https://www.amazon.com/product-reviews/B00QWO9...	These batteries are very long lasting the pric...	... batteries are very long lasting the price ...	Bylinda	https://www.barcodable.com/upc/841710106442,ht...

	reviews.text	reviews.rating
0	I order 3 of them and one of the item is bad q...	3
1	Bulk is always the less expensive way to go fo...	4
2	Well they are not Duracell but for the price i...	5
3	Seem to work as well as name brand batteries a...	5
4	These batteries are very long lasting the pric...	5

	reviews.text	reviews.rating	reviews.stat	reviews.stat.label	text_excl_swords
10856	Great price. Work as expected.	5	good	1	great price work expected
9889	Super great deal.	5	good	1	super great deal
112	of course these are no energizer bunny batteri...	5	good	1	course energizer bunny batteries price amount ...
1418	Batteries work (and last) as expected for a gr...	5	good	1	batteries work last expected great price
24377	I love this solutions for my kids. I gave it a...	4	good	1	love solutions kids gave four star rating orde...

	reviews.text	reviews.rating	reviews.stat
0	I order 3 of them and one of the item is bad q...	3	neutral
1	Bulk is always the less expensive way to go fo...	4	good
2	Well they are not Duracell but for the price i...	5	good
3	Seem to work as well as name brand batteries a...	5	good
4	These batteries are very long lasting the pric...	5	good