sismetanin Aug 1 2019 at 13:35

Contextual Emotion Detection in Textual Conversations Using Neural Networks

10 min

3.7K

VK corporate blogPython*Data Mining*Big Data*Machine learning*

Nowadays, talking to conversational agents is becoming a daily routine, and it is crucial for dialogue systems to generate responses as human-like as possible. As one of the main aspects, primary attention should be given to providing emotionally aware responses to users. In this article, we are going to describe the recurrent neural network architecture for emotion detection in textual conversations, that participated in SemEval-2019 Task 3 “EmoContext”, that is, an annual workshop on semantic evaluation. The task objective is to classify emotion (i.e. happy, sad, angry, and others) in a 3-turn conversational data set.

The rest of the article is organized as follows. Section 1 gives a brief overview of the EmoContext task and the provided data. Sections 2 and 3 focus on the texts pre-processing and word embeddings, consequently. In section 4, we described the architecture of the LSTM model used in our submission. In conclusion, the final performance of our system and the source code are presented. The model is implemented in Python using Keras library.

1. Training Data

The SemEval-2019 Task 3 “EmoContext” is focused on the contextual emotion detection in textual conversation. In EmoContext, given a textual user utterance along with 2 turns of context in a conversation, we must classify whether the emotion of the next user utterance is “happy”, “sad”, “angry” or “others” (Table 1). There are only two conversation participants: an anonymous person (Tuen-1 and Turn-3) and the AI-based chatbot Ruuh (Turn-2). For a detailed description, see (Chatterjee et al., 2019).

Table 1. Examples showing the EmoContext dataset (Chatterjee et al., 2019)

User (Turn-1)	Conversational Agent (Turn-1)	User (Turn-2)	True Class
I just qualified for the Nabard internship	WOOT! That’s great news. Congratulations!	I started crying	Happy
How dare you to slap my child	If you spoil my car, I will do that to you too	Just try to do that once	Angry
I was hurt by u more	You didn’t mean it.	say u love me	Sad
I will do night.	Alright. Keep me in loop.	Not giving WhatsApp no.	Others

During the competition, we had access to 30160 human-labeled texts provided by task organizers, where about 5000 samples each from “angry”, “sad”, “happy” class and 15000 for “others” class (Table 2). Dev and test sets, which were also provided by organizers, in contrast with a train set, have a real-life distribution, which is about 4% for each emotional class and the rest for the “others” class. Data provided by Microsoft and can be found in the official LinkedIn group.

Table 2. Emotion class label distribution in datasets (Chatterjee et al., 2019).

Dataset	Happy	Sad	Angry	Others	Total
Train	14.07%	18.11%	18.26%	49.56%	30160
Dev	5.15%	4.54%	5.45%	84.86%	2755
Test	5.16%	4.54%	5.41%	84.90%	5509
Distant	33.33%	33.33%	33.33%	0%	900k

In addition to this data, we collected 900k English tweets in order to create a distant dataset of 300k tweets for each emotion. To form the distant dataset, we based on the strategy of Go et al. (2009), under which we simply associate tweets with the presence of emotion-related words such as ’#angry’, ’#annoyed’, ’#happy’, ’#sad, ’#surprised’, etc. The list of query terms was based on the query terms of SemEval-2018 AIT DISC (Duppada et al., 2018).

The key performance metric of EmoContext is a micro-average F1 score for three emotion classes, i.e. ‘sad’, ‘happy’, and ‘angry’.

def preprocessData(dataFilePath, mode):
    conversations = []
    labels = []
    with io.open(dataFilePath, encoding="utf8") as finput:
        finput.readline()
        for line in finput:
            line = line.strip().split('\t')
            for i in range(1, 4):
                line[i] = tokenize(line[i])
            if mode == "train":
                labels.append(emotion2label[line[4]])
            conv = line[1:4]
            conversations.append(conv)
    if mode == "train":
        return np.array(conversations), np.array(labels)
    else:
        return np.array(conversations)

texts_train, labels_train = preprocessData('./starterkitdata/train.txt', mode="train")
texts_dev, labels_dev = preprocessData('./starterkitdata/dev.txt', mode="train")
texts_test, labels_test = preprocessData('./starterkitdata/test.txt', mode="train")

2. Texts Pre-Processing

Before any training stage, texts were pre-processed by text tool Ekphrasis (Baziotis et al., 2017). This tool helps to perform spell correction, word normalization, segmentation, and allows to specify which tokens should be omitted, normalized or annotated with special tags. We used the following techniques for the pre-processing stage.

URLs, emails, the date and time, usernames, percentage, currencies, and numbers were replaced with the corresponding tags.
Repeated, censored, elongated, and capitalized terms were annotated with the corresponding tags.
Elongated words were automatically corrected based on built-in word statistics corpus.
Hashtags and contractions unpacking (i.e. word segmentation) was performed based on built-in word statistics corpus.
A manually created dictionary for replacing terms extracted from the text was used in order to reduce a variety of emotions.

In addition, Emphasis provides with the tokenizer which is able to identify most emojis, emoticons, and complicated expressions such as censored, emphasized and elongated words as well as dates, times, currencies, and acronyms.

Table 3. Text pre-processing examples.

Original Text	Pre-Processed Text
I FEEL YOU… I'm breaking into million pieces	<allcaps> i feel you </allcaps>. <repeated> i am breaking into million pieces
tired and I missed you too :‑(	tired and i missed you too <sad>
you should liiiiiiisten to this: www.youtube.com/watch?v=99myH1orbs4	you should listen <elongated> to this: <url>
My apartment takes care of it. My rent is around $650.	my apartment takes care of it. my rent is around <money> .

from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import numpy as np

import re
import io

label2emotion = {0: "others", 1: "happy", 2: "sad", 3: "angry"}
emotion2label = {"others": 0, "happy": 1, "sad": 2, "angry": 3}

emoticons_additional = {
    '(^・^)': '<happy>', ':‑c': '<sad>', '=‑d': '<happy>', ":'‑)": '<happy>', ':‑d': '<laugh>',
    ':‑(': '<sad>', ';‑)': '<happy>', ':‑)': '<happy>', ':\\/': '<sad>', 'd=<': '<annoyed>',
    ':‑/': '<annoyed>', ';‑]': '<happy>', '(^�^)': '<happy>', 'angru': 'angry', "d‑':":
        '<annoyed>', ":'‑(": '<sad>', ":‑[": '<annoyed>', '(�?�)': '<happy>', 'x‑d': '<laugh>',
}

text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
               'time', 'url', 'date', 'number'],
    # terms that will be annotated
    annotate={"hashtag", "allcaps", "elongated", "repeated",
              'emphasis', 'censored'},
    fix_html=True,  # fix HTML tokens
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter="twitter",
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector="twitter",
    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=True,  # Unpack contractions (can't -> can not)
    spell_correct_elong=True,  # spell correction for elongated words
    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. You can pass more than one dictionaries.
    dicts=[emoticons, emoticons_additional]
)


def tokenize(text):
    text = " ".join(text_processor.pre_process_doc(text))
    return text

3. Word Embeddings

Word embeddings have become an essential part of any deep-learning approaches for NLP systems. To determine the most suitable vectors for emotions detection task, we try Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and FastText (Joulin et al., 2017) models as well as DataStories pre-trained word vectors (Baziotis et al., 2017). The key concept of Word2Vec is to locate words, which share common contexts in the training corpus, in close proximity in vector space. Both Word2Vec and Glove models learn geometrical encodings of words from their co-occurrence information, but essentially the former is a predictive model, and the latter is a count-based model. In other words, while Word2Vec tries to predict a target word (CBOW architecture) or a context (Skip-gram architecture), i.e. to minimize the loss function, GloVe calculates word vectors doing dimensionality reduction on the co-occurrence counts matrix. FastText is very similar to Word2Vec except for the fact that it uses character n-grams in order to learn word vectors, so it’s able to solve the out-of-vocabulary issue.

For all the techniques mentioned above, we used the default training prams provided by the authors. We train a simple LSTM model (dim = 64) based on each of these embeddings and compare effectiveness using cross-validation. According to the result, DataStories pre-trained embeddings demonstrated the best average F1 score.

To enrich selected word embeddings with the emotional polarity of the words, we consider performing distant pre-training phrase by a fine-tuning of the embeddings on the automatically labeled distant dataset. The importance of using pre-training was demonstrated in (Deriu et al., 2017). We use the distant dataset to train the simple LSTM network to classify angry, sad, and happy tweets. The embeddings layer was frozen for the first training epoch in order to avoid significant changes in the embeddings weights, and then it was unfrozen for the next 5 epochs. After the training stage, the fine-tuned embeddings was saved for the further training phases and made publicly available.

def getEmbeddings(file):
    embeddingsIndex = {}
    dim = 0
    with io.open(file, encoding="utf8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            embeddingVector = np.asarray(values[1:], dtype='float32')
            embeddingsIndex[word] = embeddingVector 
            dim = len(embeddingVector)
    return embeddingsIndex, dim


def getEmbeddingMatrix(wordIndex, embeddings, dim):
    embeddingMatrix = np.zeros((len(wordIndex) + 1, dim))
    for word, i in wordIndex.items():
        embeddingMatrix[i] = embeddings.get(word)
    return embeddingMatrix


from keras.preprocessing.text import Tokenizer

embeddings, dim = getEmbeddings('emosense.300d.txt')
tokenizer = Tokenizer(filters='')
tokenizer.fit_on_texts([' '.join(list(embeddings.keys()))])

wordIndex = tokenizer.word_index
print("Found %s unique tokens." % len(wordIndex))

embeddings_matrix = getEmbeddingMatrix(wordIndex, embeddings, dim)

4. Neural Network Architecture

A recurrent neural network (RNN) is a family of artificial neural networks which is specialized in the processing of sequential data. In contrast with traditional neural networks, RRNs are designed to deal with sequential data by sharing their internal weights processing the sequence. For this purpose, the computation graph of RRNs includes cycles, representing the influence of the previous information on the present one. As an extension of RNNs, Long Short-Term Memory networks (LSTMs) have been introduced in 1997 (Hochreiter and Schmidhuber, 1997). In LSTMs recurrent cells are connected in a particular way to avoid vanishing and exploding gradient issues. Traditional LSTMs only preserves information from the past since they process the sequence only in one direction. Bidirectional LSTMs combine output from two hidden LSTM layers moving in opposite directions, where one moves forward through time, and another moves backward through time, thereby enabling to capture information from both past and future states simultaneously (Schuster and Paliwal, 1997).

Figure 1: The architecture of a smaller version of the proposed architecture. LSTM unit for the first turn and for the third turn have shared weights.

A high-level overview of our approach is provided in Figure 1. The proposed architecture of the neural network consists of the embedding unit and two bidirectional LSTM units (dim = 64). The former LSTM unit is intended to analyze the utterance of the first user (i.e. the first turn and the third turn of the conversation), and the latter is intended to analyze the utterance of the second user (i.e. the second turn). These two units learn not only semantic and sentiment feature representation, but also how to capture user-specific conversation features, which allows classifying emotions more accurately. At the first step, each user utterance is fed into a corresponding bidirectional LSTM unit using pre-trained word embeddings. Next, these three feature maps are concatenated in a flatten feature vector and then passed to a fully connected hidden layer (dim = 30), which analyzes interactions between obtained vectors. Finally, these features proceed through the output layer with the softmax activation function to predict a final class label. To reduce overfitting, regularization layers with Gaussian noise were added after the embedding layer, dropout layers (Srivastava et al., 2014) were added at each LSTM unit (p = 0.2) and before the hidden fully connected layer (p = 0.1).

from keras.layers import Input, Dense, Embedding, Concatenate, Activation, \
    Dropout, LSTM, Bidirectional, GlobalMaxPooling1D, GaussianNoise
from keras.models import Model


def buildModel(embeddings_matrix, sequence_length, lstm_dim, hidden_layer_dim, num_classes, 
               noise=0.1, dropout_lstm=0.2, dropout=0.2):
    turn1_input = Input(shape=(sequence_length,), dtype='int32')
    turn2_input = Input(shape=(sequence_length,), dtype='int32')
    turn3_input = Input(shape=(sequence_length,), dtype='int32')
    embedding_dim = embeddings_matrix.shape[1]
    embeddingLayer = Embedding(embeddings_matrix.shape[0],
                                embedding_dim,
                                weights=[embeddings_matrix],
                                input_length=sequence_length,
                                trainable=False)
    
    turn1_branch = embeddingLayer(turn1_input)
    turn2_branch = embeddingLayer(turn2_input) 
    turn3_branch = embeddingLayer(turn3_input) 
    
    turn1_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn1_branch)
    turn2_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn2_branch)
    turn3_branch = GaussianNoise(noise, input_shape=(None, sequence_length, embedding_dim))(turn3_branch)

    lstm1 = Bidirectional(LSTM(lstm_dim, dropout=dropout_lstm))
    lstm2 = Bidirectional(LSTM(lstm_dim, dropout=dropout_lstm))
    
    turn1_branch = lstm1(turn1_branch)
    turn2_branch = lstm2(turn2_branch)
    turn3_branch = lstm1(turn3_branch)
    
    x = Concatenate(axis=-1)([turn1_branch, turn2_branch, turn3_branch])
    
    x = Dropout(dropout)(x)
    
    x = Dense(hidden_layer_dim, activation='relu')(x)
    
    output = Dense(num_classes, activation='softmax')(x)
    
    model = Model(inputs=[turn1_input, turn2_input, turn3_input], outputs=output)
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
    
    return model

model = buildModel(embeddings_matrix, MAX_SEQUENCE_LENGTH, lstm_dim=64, hidden_layer_dim=30, num_classes=4)

5. Results

In the process of searching for optimal architecture, we experimented not only with the number of cells in layers, activation functions and regularization parameters but also with the architecture of the neural network. The detailed info about this phrase can be found in the original paper.

The model described in the previous section demonstrated the best scores on the dev dataset, so it was used in the final evaluation stage of the competition. On the final test dataset, it achieved 72.59% micro-average F1 score for emotional classes, while the maximum score among all participants was 79.59%. However, this is well above the official baseline released by task organizers, which was 58,68%.

The source code of the model and word-embeddings are available at GitHub.
The full version of the article and the task description paper can be found at ACL Anthology.
The training dataset is located at the official competition group at LinkedIn.

Citation:

@inproceedings{smetanin-2019-emosense,
    title = "{E}mo{S}ense at {S}em{E}val-2019 Task 3: Bidirectional {LSTM} Network for Contextual Emotion Detection in Textual Conversations",
    author = "Smetanin, Sergey",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/S19-2034",
    pages = "210--214",
}

Tags:

Hubs: