Sentiment Analysis with Deep Learning through Keras

Georg Zhelev

Table of Contents

  1. Load and Describe Data
  2. Pre-Process and Split
  3. Tokenize and Pad
  4. Baseline Model and a Neural Network

1. Data Description

  • Amazon customer reviews (input) and star ratings (output)
  • industrial level dataset (3,6 mil. train)

The data format is the following:

  • label, Text (all in one line)
  • label 1 corresponds to 1- and 2-star reviews
  • label 2 corresponds to 4- and 5-star reviews
  • Most of the reviews are in English

1.1 Load and decompress Files

Decompress files and return a list containing each line as a list item.

b'__label__2 Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^\n'

1.2 Decode from raw binary strings to strings that can be parsed. Extract labels and extract texts.

[1 1 0 ... 1 0 1]
'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

2. View Prepared Data

Text is saved into a list, while labels saved into an array.

Length train: 49473
Length test: 10007

3. Pre-Process

  • bild a small and efficient vocabulary
  • Stopwords only blow up the vocabulary
  • non-numerical values
  • Using the regular expressions module
  • Match characters and subsititute them with spaces
  1. Lowercase text
  2. Remove non-word characters:
    • numbers and punctuation
  3. Removes non-english language characters
['Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^']


['stuning even for the non gamer  this sound track was beautiful  it paints the senery in your mind so well i would recomend it even to people who hate vid  game music  i have played the game chrono cross but out of all of the games i have ever played it has the best music  it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras  it would impress anyone who cares to listen    ']

3.1 Describe Data

1    25217
0    24256
dtype: int64

About equal distribution of classes. (1 is positive, while 0 a negative review).

4. Split Data

Length train texts: 39578
Length validation texts: 9895
Length text texts 10007

5 .Tokenize Text

  • split texts into lists of tokens.
  • assign max features (1200 most common words)
  • creates the vocabulary based on train data
  • resulting vectors equal the length of each text

5.1 Encode training data sentences into sequences

  • Transforms each text into a sequence of integers.
  • Assigns an integer to each word
  • Can access the word index (a dictionary) to verify assigned integer to the word
CPU times: user 4.36 s, sys: 11.1 ms, total: 4.37 s
Wall time: 4.37 s

5.2 Show an encoded Sequence

First review: wonderful  inspiring music   so many artists struggle to put 10 songs on an album  of which maybe half could be considered decent  joseph arthur manages to create 1 for this album and there s not a loser in the bunch his songs are pure poetry surrounded by swirling layers of gorgeous music   sometimes simplistic folk  other times upbeat rock  but his lyrics carry each one with often times devastating results  in a good way   tales of love lost and struggles to love are the most common  but they never get tiring due to the diversity of the tracks for those who do love this album as much as i do  check out gavin degraw as well  his album chariot is arguably the best of 00  ebhp 

First encoded review: [235, 1992, 123, 29, 106, 1404, 2140, 5, 162, 240, 154, 20, 43, 104, 7, 91, 290, 374, 96, 27, 1598, 719, 2603, 3312, 2260, 5, 1275, 77, 12, 8, 104, 3, 52, 17, 16, 4, 4605, 10, 1, 1098, 54, 154, 25, 982, 2180, 7582, 53, 4939, 7, 2261, 123, 568, 3313, 3251, 79, 185, 3651, 447, 18, 54, 677, 1528, 272, 26, 19, 519, 185, 9023, 1222, 10, 4, 34, 99, 1844, 7, 78, 466, 3, 3217, 5, 78, 25, 1, 113, 1201, 18, 36, 118, 61, 8065, 771, 5, 1, 8066, 7, 1, 571, 12, 171, 65, 69, 78, 8, 104, 24, 73, 24, 2, 69, 589, 47, 24, 70, 54, 104, 9, 7347, 1, 82, 7, 310] 

Lenth before encoding 684
Lenth before encoding 121
wonderful 235
inspiring 1992

5.3. Unique tokens and Document Count

Found 64191 unique words.
Documents 39578

5.4 Word Index (according to its frequency)

Word Index [('the', 1), ('i', 2), ('and', 3), ('a', 4), ('to', 5)]
[('revengeful', 64187), ('dices', 64188), ('laryngitis', 64189), ('guitarrist', 64190), ('punchless', 64191)]
[('mr', 501), ('working', 502), ('entire', 503), ('name', 504), ('totally', 505)]

5.5 Word Counts

Word Counts [('wonderful', 1724), ('inspiring', 132), ('music', 3601), ('so', 13166), ('many', 3935)]
[('revengeful', 1), ('dices', 1), ('laryngitis', 1), ('guitarrist', 1), ('punchless', 1)]
[('0', 3342), ('just', 10498), ('over', 3880), ('month', 620), ('now', 3644)]
164470

6. Padding with Keras

  • text lengths are not be uniform
  • a neural network requeres it
  • select a maximum length
  • pad shorter sentences with 0
  • needed, to use batches effectively
  • equal the length of the longest sentance

6.1 Maximum and Minimum Length

241
3

6.2 Length Before Padding

[4, 260, 1143, 159, 63, 7, 1, 1842, 3, 25, 295, 106, 25, 85, 1143, 52, 25, 4, 177, 7, 2069, 60, 831, 621, 91, 9, 356, 18, 1, 108, 66, 21, 247, 111, 2, 102, 1, 465, 7, 8, 15, 9, 35, 1728, 89, 2111, 66, 21, 196, 117, 38, 28, 163, 30, 94, 25, 536, 265, 18, 12, 1] 

61

6.3 Length after Padding

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    4  260
 1143  159   63    7    1 1842    3   25  295  106   25   85 1143   52
   25    4  177    7 2069   60  831  621   91    9  356   18    1  108
   66   21  247  111    2  102    1  465    7    8   15    9   35 1728
   89 2111   66   21  196  117   38   28  163   30   94   25  536  265
   18   12    1] 

241
241

7. Neural Network

Input Parameters into Neural Network

  • Embedding Layer
  • input dimention: size of the vocabulary
  • output dimention: embedding size
  • learned embedding
  • for dense layer include length of input sequences
  • Total Vocabulary:
  • Selected Features:
  • Lenght:
  • Shape of Input:
64192
12000
241
(39578, 241)
model = Sequential()
model.add(layers.Embedding(MAX_FEATURES, embedding_dim, input_length=maxlen))

model.add(layers.Conv1D(128, 5, activation='relu'))

model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_3 (Embedding)      (None, 241, 100)          1200000   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 237, 128)          64128     
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                1290      
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
=================================================================
Total params: 1,265,429
Trainable params: 1,265,429
Non-trainable params: 0
_________________________________________________________________
%%time

history = model.fit(train_texts, train_labels,
                     epochs=3,
                     verbose=True,
                     validation_data=(val_texts, val_labels),
                     batch_size=512)

7.1 Accuracy Evaluation

Training Accuracy: 0.9063
Testing Accuracy:  0.9056

8. Architecture Simulation Study and Baseline Model

  • Dense with no embedding: Acc: 0.50
  • Dense with embedding: Acc: 0.85
  • Added Conv layer: Acc: 0.90
  • Log Reg: Acc: 0.90 (wait time)

Thank you for listening