Text Classification for Amazon Product Reviews

This notebook contains machine learning model implementation with LSTM layer and Embedding for classification text about Amazon Product Reviews, i.e. whether a review is considered as good or bad review. The model gave high accuracy (> 90%) on the training and validation dataset yet the model is overfitted.

In [ ]:
import zipfile
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import tensorflow
from nltk.corpus import stopwords
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import Callback
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from google.colab import files
from sklearn.model_selection import train_test_split
%matplotlib inline

Download File Preparation

In [ ]:
# Meng-install Kaggle untuk proses download dataset 
!pip install -q kaggle
In [ ]:
# Tempat untuk melakukan upload Kaggle JSON
files.upload()
In [ ]:
# Memindahkan Kaggle JSON ke current dir
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
In [ ]:
# Men-download dataset
!kaggle datasets download -d datafiniti/consumer-reviews-of-amazon-products
Downloading consumer-reviews-of-amazon-products.zip to /content
 62% 10.0M/16.3M [00:00<00:00, 44.9MB/s]
100% 16.3M/16.3M [00:00<00:00, 64.3MB/s]

Dataset Preparation

I used Amazon Product Reviews dataset for this notebook. The data that will be analyzed is review text and its rating.

In [ ]:
# Proses extract zip
file_path = 'consumer-reviews-of-amazon-products.zip'
zip_ref = zipfile.ZipFile(file_path, 'r')
zip_ref.extractall()
zip_ref.close()
In [ ]:
# Load file
df = pd.read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')
df.head()
Out[ ]:
id dateAdded dateUpdated name asins brand categories primaryCategories imageURLs keys ... reviews.didPurchase reviews.doRecommend reviews.id reviews.numHelpful reviews.rating reviews.sourceURLs reviews.text reviews.title reviews.username sourceURLs
0 AVpgNzjwLJeJML43Kpxn 2015-10-30T08:59:32Z 2019-04-25T09:08:16Z AmazonBasics AAA Performance Alkaline Batterie... B00QWO9P0O,B00LH3DMUO Amazonbasics AA,AAA,Health,Electronics,Health & Household,C... Health & Beauty https://images-na.ssl-images-amazon.com/images... amazonbasics/hl002619,amazonbasicsaaaperforman... ... NaN NaN NaN NaN 3 https://www.amazon.com/product-reviews/B00QWO9... I order 3 of them and one of the item is bad q... ... 3 of them and one of the item is bad quali... Byger yang https://www.barcodable.com/upc/841710106442,ht...
1 AVpgNzjwLJeJML43Kpxn 2015-10-30T08:59:32Z 2019-04-25T09:08:16Z AmazonBasics AAA Performance Alkaline Batterie... B00QWO9P0O,B00LH3DMUO Amazonbasics AA,AAA,Health,Electronics,Health & Household,C... Health & Beauty https://images-na.ssl-images-amazon.com/images... amazonbasics/hl002619,amazonbasicsaaaperforman... ... NaN NaN NaN NaN 4 https://www.amazon.com/product-reviews/B00QWO9... Bulk is always the less expensive way to go fo... ... always the less expensive way to go for pr... ByMG https://www.barcodable.com/upc/841710106442,ht...
2 AVpgNzjwLJeJML43Kpxn 2015-10-30T08:59:32Z 2019-04-25T09:08:16Z AmazonBasics AAA Performance Alkaline Batterie... B00QWO9P0O,B00LH3DMUO Amazonbasics AA,AAA,Health,Electronics,Health & Household,C... Health & Beauty https://images-na.ssl-images-amazon.com/images... amazonbasics/hl002619,amazonbasicsaaaperforman... ... NaN NaN NaN NaN 5 https://www.amazon.com/product-reviews/B00QWO9... Well they are not Duracell but for the price i... ... are not Duracell but for the price i am ha... BySharon Lambert https://www.barcodable.com/upc/841710106442,ht...
3 AVpgNzjwLJeJML43Kpxn 2015-10-30T08:59:32Z 2019-04-25T09:08:16Z AmazonBasics AAA Performance Alkaline Batterie... B00QWO9P0O,B00LH3DMUO Amazonbasics AA,AAA,Health,Electronics,Health & Household,C... Health & Beauty https://images-na.ssl-images-amazon.com/images... amazonbasics/hl002619,amazonbasicsaaaperforman... ... NaN NaN NaN NaN 5 https://www.amazon.com/product-reviews/B00QWO9... Seem to work as well as name brand batteries a... ... as well as name brand batteries at a much ... Bymark sexson https://www.barcodable.com/upc/841710106442,ht...
4 AVpgNzjwLJeJML43Kpxn 2015-10-30T08:59:32Z 2019-04-25T09:08:16Z AmazonBasics AAA Performance Alkaline Batterie... B00QWO9P0O,B00LH3DMUO Amazonbasics AA,AAA,Health,Electronics,Health & Household,C... Health & Beauty https://images-na.ssl-images-amazon.com/images... amazonbasics/hl002619,amazonbasicsaaaperforman... ... NaN NaN NaN NaN 5 https://www.amazon.com/product-reviews/B00QWO9... These batteries are very long lasting the pric... ... batteries are very long lasting the price ... Bylinda https://www.barcodable.com/upc/841710106442,ht...

5 rows × 24 columns

Data Preprocessing

In [ ]:
# Hanya mengambil data review dan rating
data = df.loc[:, ['reviews.text', 'reviews.rating']]
data.head()
Out[ ]:
reviews.text reviews.rating
0 I order 3 of them and one of the item is bad q... 3
1 Bulk is always the less expensive way to go fo... 4
2 Well they are not Duracell but for the price i... 5
3 Seem to work as well as name brand batteries a... 5
4 These batteries are very long lasting the pric... 5
In [ ]:
df.groupby('reviews.rating')['reviews.text'].count().plot(kind='pie')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0f5f30fe90>

From the graph results, it is found that the data is very imbalanced with evidence that the product with a rating of 5 is much greater than the rating with a value of 4 or less. Even a rating with a value of 4 is much greater than a rating with a value of 3 or below. Therefore, upsampling will be carried out with the assumption that:

  • Rating 1-2 is a 'bad rating'
  • Rating 3 is a 'neutral rating'
  • Rating 4-5 is a 'good rating'

Although there are 3 categories, this notebook will only classify 'good' and 'bad' ratings.

In [ ]:
def upsampling(rate):
  if rate < 3:
    return 'bad'
  elif rate > 3:
    return 'good'
  
  return 'neutral'

data['reviews.stat'] = data['reviews.rating'].apply(lambda x: upsampling(x))
In [ ]:
data.head()
Out[ ]:
reviews.text reviews.rating reviews.stat
0 I order 3 of them and one of the item is bad q... 3 neutral
1 Bulk is always the less expensive way to go fo... 4 good
2 Well they are not Duracell but for the price i... 5 good
3 Seem to work as well as name brand batteries a... 5 good
4 These batteries are very long lasting the pric... 5 good
In [ ]:
data.groupby('reviews.stat')['reviews.text'].count().plot(kind='pie')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0f5f27db10>

From graph above, it can be seen that the results of the upsampling is still very imbalanced yet relatively better than before. Therefore, for training and evaluating the model, the sample that will be taken for a 'good rating' is only 1600.

In [ ]:
# Memisahkan data 
bad_data = data[data['reviews.stat'] == 'bad']
good_data = data[data['reviews.stat'] == 'good'].sample(n=1600, replace=True)
final_data = pd.concat([good_data, bad_data], axis=0)
In [ ]:
final_data.groupby('reviews.stat')['reviews.text'].count().plot(kind='pie')
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0f5f24c890>

In the graph above, it can be seen that the data set is better because there is a balance between 'good' and 'bad' data.

In [ ]:
# Melakukan print terkait dengan informasi dataset
# dan jumlah data 'good' and 'bad'
print("Informasi terkait dengan data:\n")
print(final_data.info())
print()
print("Sebaran data:\n", final_data.groupby('reviews.stat')['reviews.text'].count())
Informasi terkait dengan data:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3181 entries, 10856 to 28285
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   reviews.text        3181 non-null   object
 1   reviews.rating      3181 non-null   int64 
 2   reviews.stat        3181 non-null   object
 3   reviews.stat.label  3181 non-null   int64 
 4   text_excl_swords    3181 non-null   object
dtypes: int64(2), object(3)
memory usage: 149.1+ KB
None

Sebaran data:
 reviews.stat
bad     1581
good    1600
Name: reviews.text, dtype: int64
In [ ]:
# Encoding
# Melakukan perubahan string menjadi int
final_data['reviews.stat.label'] = [0 if stat == 'bad' else 1 for stat in final_data['reviews.stat']]

Removing stop words because the existence of these stop words do not have much influence on the model

In [ ]:
# Stop word removal
nltk.download('stopwords')
stopwords_list_en = stopwords.words('english')

def remove_stop_words(word_seq):
  new_sntc = " ".join(filter(lambda word: word not in stopwords_list_en, word_seq))

  return new_sntc

final_data['text_excl_swords'] = final_data['reviews.text'].apply(lambda words: remove_stop_words(text_to_word_sequence(words)))
final_data.head()
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[ ]:
reviews.text reviews.rating reviews.stat reviews.stat.label text_excl_swords
10856 Great price. Work as expected. 5 good 1 great price work expected
9889 Super great deal. 5 good 1 super great deal
112 of course these are no energizer bunny batteri... 5 good 1 course energizer bunny batteries price amount ...
1418 Batteries work (and last) as expected for a gr... 5 good 1 batteries work last expected great price
24377 I love this solutions for my kids. I gave it a... 4 good 1 love solutions kids gave four star rating orde...
In [ ]:
# Memisahkan data feature dan label
X = final_data['text_excl_swords'].values
y = final_data['reviews.stat.label'].values
In [ ]:
# Melakukan split train-test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [ ]:
# Proses tokenisasi
tokenizer = Tokenizer(num_words=5000, oov_token='x')
tokenizer.fit_on_texts(X_train) 
tokenizer.fit_on_texts(X_test)

X_train_seq = tokenizer.texts_to_sequences(X_train) 
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq)
X_test_pad = pad_sequences(X_test_seq)
In [ ]:
# Pembuatan model
model = Sequential([
    Embedding(input_dim=5000, output_dim=32),
    LSTM(64),
    Dense(64, activation='relu'),
    Dropout(0.6),
    Dense(32, activation='relu'),
    Dropout(0.4),
    Dense(16, activation='relu'),
    Dropout(0.25),
    Dense(1, activation='sigmoid')
])
In [ ]:
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 32)          160000    
                                                                 
 lstm (LSTM)                 (None, 64)                24832     
                                                                 
 dense (Dense)               (None, 64)                4160      
                                                                 
 dropout (Dropout)           (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 16)                528       
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 191,617
Trainable params: 191,617
Non-trainable params: 0
_________________________________________________________________
In [ ]:
model.compile(loss='binary_crossentropy', optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.0008), metrics=['accuracy'])
In [ ]:
ep = 0
counter = 0

# Membuat fungsi callback untuk early stopping
# dengan kriteria apabila telah 5 kali mencapai nilai akurasi train dan validasi lebih dari 0.9
class ModelCallback(Callback):
  def on_epoch_end(self, epoch, logs=None):
    global counter
    global ep

    curr_train_acc = round(logs['accuracy'], 2)
    curr_val_acc = round(logs['val_accuracy'], 2)
    ep += 1
    
    if curr_train_acc >= 0.90 and curr_val_acc >= 0.90:
      counter += 1
      if counter == 5:
        print("Training stopped at epoch of {} with train accuracy: {} and validation accuracy: {}".format(epoch+1, curr_train_acc, curr_val_acc))
        self.model.stop_training = True
In [ ]:
%%time
history = model.fit(X_train_pad, 
                    y_train, 
                    epochs=25, 
                    validation_data=(X_test_pad, y_test), 
                    callbacks=[ModelCallback()],
                    verbose=2)
Epoch 1/25
80/80 - 17s - loss: 0.6882 - accuracy: 0.5330 - val_loss: 0.6424 - val_accuracy: 0.7143 - 17s/epoch - 207ms/step
Epoch 2/25
80/80 - 11s - loss: 0.5118 - accuracy: 0.7960 - val_loss: 0.3383 - val_accuracy: 0.8807 - 11s/epoch - 137ms/step
Epoch 3/25
80/80 - 11s - loss: 0.2431 - accuracy: 0.9214 - val_loss: 0.3390 - val_accuracy: 0.8713 - 11s/epoch - 137ms/step
Epoch 4/25
80/80 - 11s - loss: 0.1575 - accuracy: 0.9509 - val_loss: 0.3075 - val_accuracy: 0.8791 - 11s/epoch - 138ms/step
Epoch 5/25
80/80 - 11s - loss: 0.1180 - accuracy: 0.9623 - val_loss: 0.3314 - val_accuracy: 0.9058 - 11s/epoch - 144ms/step
Epoch 6/25
80/80 - 11s - loss: 0.0782 - accuracy: 0.9815 - val_loss: 0.3583 - val_accuracy: 0.9074 - 11s/epoch - 139ms/step
Epoch 7/25
80/80 - 11s - loss: 0.0530 - accuracy: 0.9862 - val_loss: 0.4882 - val_accuracy: 0.9027 - 11s/epoch - 138ms/step
Epoch 8/25
80/80 - 11s - loss: 0.0487 - accuracy: 0.9886 - val_loss: 0.4950 - val_accuracy: 0.8995 - 11s/epoch - 138ms/step
Epoch 9/25
Training stopped at epoch of 9 with train accuracy: 0.99 and validation accuracy: 0.91
80/80 - 11s - loss: 0.0430 - accuracy: 0.9878 - val_loss: 0.5554 - val_accuracy: 0.9058 - 11s/epoch - 143ms/step
CPU times: user 2min 49s, sys: 9.91 s, total: 2min 59s
Wall time: 1min 45s
In [ ]:
plt.plot(range(1,ep+1), history.history['accuracy'], label='Training')
plt.plot(range(1,ep+1), history.history['val_accuracy'], label='Validation')
plt.title("Train Accuracy - Validation Accuracy")
plt.legend()
plt.show()
In [ ]:
plt.clf()

plt.plot(range(1,ep+1), history.history['loss'], label='Training')
plt.plot(range(1,ep+1), history.history['val_loss'], label='Validation')
plt.title("Train Loss - Validation Loss")
plt.legend()
plt.show()

From plotting accuracy and loss above, it can be seen that the model has a high accuracy on training and validation data, i.e. more than 90%. However, the model is overfitted because distance between training loss and validation loss is too high and the validation loss is getting higher every epoch.