Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Comparing Chatbot Models: Rule-based, Retrieval-based, and Generative-based with Attention, Study notes of Art

Art

The implementation, comparison, and analysis of two state-of-the-art generative models for neural machine translation applied to chatbots. The study shows that an architecture based on attention mechanisms outperforms recurrence-based models. The document also introduces the concept of chatbots and their importance in Natural Language Processing (NLP), and explains the differences between rule-based, retrieval-based, and generative-based models.

Typology: Study notes

2021/2022

Uploaded on 09/07/2022

adnan_95 🇮🇶

4.3

(38)

921 documents

1 / 44

Partial preview of the text

Download Comparing Chatbot Models: Rule-based, Retrieval-based, and Generative-based with Attention and more Study notes Art in PDF only on Docsity! Implementing ChatBots using Neural Machine Translation techniques Degree’s Thesis Telecommunications Sciences and Technologies Author: Alvaro Nuez Ezquerra Advisors: Marta R. Costa-Jussà and Carlos Segura Perales Universitat Politècnica de Catalunya (UPC) 2017 - 2018 Abstract Conversational agents or chatbots (short for chat robot) are a branch of Natural Language Processing (NLP) that has arisen a lot of interest nowadays due to the extent number of applica- tions in company services such as customer support or automatized FAQS and personal asistent services, for instance Siri or Cortana. There are three types: rule-based models, retrieval-based models and generative-based models. The difference between them is the freedom they have at the time of generating an answer given a question. The chatbot models usually used in public services are rule-based or retrieval-based given the need to guarantee quality and adecuate answers to users. But these models can handle only conversations aligned with their previous written answers and, therefore, conversations can sometimes sound artificial if it goes out of the topic. Generative-based models can handle better an open conversation which makes them a more generalizable approach. Promising results have been achieved in generative-based models by applying neural machine translation techniques with the recurrent encoder/decoder architecture. In this project is implemented, compared and analyzed two generative models that constitute the state of the art in neural machine translation applied to chatbots. One model is based on recurrence with attention and the other is based exclusively in attention. Additionally, the model based exclusively on recurrence has been used as a reference. Experiments show that, as in translation, an architecture based only in attention mechanisms obtains better results than the recurrence based models. 1 Acknowledgements First of all I want to thank my tutors Marta R. Costa-Jussà and Carlos Segura Perales for letting me be part of this project and motivate me to continue investigating about the fascinating world of AI. It has been such a pleasure to work with you. Also I want to express my gratitude to Carlos Escolano and Noe Casas for helping me with some problems I found during the realization of the project. I want to thank my parents for all the dedication and effort they have had with me to bring me here. All this is thanks to you. And thank you Judith for the immense patience you have had with me, listening and supporting all my ideas day after day and being at my side whenever I needed it. 4 Revision history and approval record Revision Date Purpose 0 11/01/2018 Document creation 1 20/01/2018 Document revision 2 23/01/2018 Document revision 3 25/01/2018 Document approbation DOCUMENT DISTRIBUTION LIST Name e-mail Alvaro Nuez Ezquerra alvaronuez.eis@gmail.com Marta R. Costa-Jussà martaruizcostajussa@gmail.com Carlos Segura Perales carlos.seguraperales@telefonica.com Written by: Reviewed and approved by: Reviewed and approved by: Date 11/01/2018 Date 25/01/2018 Date 25/01/2018 Name Alvaro Nuez Ez- querra Name Marta R. Costa- Jussà Name Carlos Segura Perales Position Project Author Position Project Supervi- sor Position Project Supervisor 5 Contents 1 Introduction 10 1.1 Statement of purpose and contributions . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Requirements and specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Methods and procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Work Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 State of the art 13 2.1 Natural Language Processing - NLP . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Biological Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Recurrent Neural Networks - RNN . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Encoder/Decoder architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Architectures 19 3.1 RNN Encoder/Decoder architecture (Seq2Seq) . . . . . . . . . . . . . . . . . . 19 3.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Transformer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Implementation 23 4.1 Data bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Ubuntu Dialogue Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.2 Open Subtitles Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Pre-Processing of the full Open Subtitles Corpus . . . . . . . . . . . . . . . . . 24 4.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Baseline model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6 List of Tables 4.1 Table of percentage of covered data set for different lengths . . . . . . . . . . . 24 4.2 Table of percentage of covered data set for different vocabulary sizes . . . . . . . 25 5.1 Manual Evaluation of the models trained with 5 million sentences . . . . . . . . 30 5.2 Manual Evaluation of the models trained with 20 million sentences . . . . . . . . 31 9 Chapter 1 Introduction A conversational agent or chatbot is a language recognition system able to maintain a con- versation with an user using a question/answer protocol. The communication can be done by audio or text media. As far as concerned in this project, we are going to focus on the textual models. The first chatbot models were rule-based, for instance ELIZA[13], PARRY1 and A.L.I.C.E2. These models require a programmer to write some rules and patterns beforehand for the analysis and decomposition of a sentence and then, create an answer by the combination of a template and keywords. Thanks to the latest advances in machine learning and more specifically in artificial neural networks, it is possible to create chatbot models that do not longer require previous written rules. Instead, given a set of examples, the chatbot learns the pattern inherent to the samples. There are two different approaches depending on the freedom they have at the time of generating an answer: retrieval-based and generative-based. The retrieval-based systems determine which answer, from a set of answers previously written, is the most appropriate given a sentence/question as an input. These models are quite useful when the target domain is limited (e.g. a model trained only for sport or medicine conversations) and the chatbot is not allowed to commit grammatical or semantic errors during its service, for instance in FAQs3 and costumer support services. The problem is that they barely handle unseen questions and became impractical in open domains (e.g. general knowledge). Generative models, on the other hand, are trained to generate data as a response word by word. Nevertheless, not having rules implies that they have to learn to build sentences during their training. For that reason, they are more complex and harder to train than the retrieval-based systems. Usually generative models are prone to commit grammatical and semantic errors but on the other hand, they better handle new data and can answer with more natural sentences. What makes them an interesting approach is that they are an advance towards what is known as Strong Artificial Intelligence4 (Strong AI), is the system itself who analyze, compute and build an answer thanks to an autonomous learning without any human intervention. A great step forward to the area of generative-based chatbots was the implementation of a model using an encoder/decoder architecture with recurrent neural networks known as sequence to sequence (Seq2Seq) [12] used in translation [2] [9]. This project is motivated by the good results shown in the experiment, which achieve a new state of the art in the generative-based chatbot area. 1PARRY was a chatbot that simulated a person with schizophrenia. It was created by psychiatrist Kenneth Colby as a counterpart to the ELIZA model. 2Artificial Linguistic Internet Computer Entity (A.L.I.C.E) is a chatbot created by Dr. Richard S. Wallace. The chatbot uses the Artificial Intelligence Markup Language (AIML) for the definition of patterns and rules. 3Frequent Asked Questions services 4Strong Artificial Intelligence is the word used to refer to systems that can perform any intellectual task that a human brain can. 10 1.1 Statement of purpose and contributions The main goal of this project is to apply to generative-based conversational agents (chat- bots) two encoder/decoder architectures from the state of the art in translation techniques and determine, with an experiment, which has a better performance in general topic conversations. The main contribution of this project is the first implementation of a generative-based chatbot using the Transformer architecture. Results show that this architecture outperforms the state of the art in generative-based chatbot models. Additionally, some improvements have been applied to the basic model implemented by Google Brain resident Etienne Pot6 by adding a bidirectional encoder, the attention mechanism and a beam search algorithm to improve the quality of the answers. This improved version has also been shown to outperform the basic model. 1.2 Requirements and specifications As one of the main languages used in machine learning nowadays, this project has been developed completely in Python 3.5.3. using the open-source software library for machine learning TensorFlow 5 for both implementation and training of models. All the software has been launched in a cluster of 8 servers from the TSC department of the UPC, each with 2 Intel R© Xeon R© E5-2670 v3 2,3GHz 12N processors, and a total of 16 NVIDIA GTX Titan X GPUs. Each GPU has 12GB of memory and 3072 CUDA Cores. 1.3 Methods and procedures The project’s main idea was originally proposed by my supervisors and starts from previous work [12] [11]. The baseline of this project is the model used in [12], which is a generative- based chatbot implemented with the neural machine translation architecture Seq2Seq. For its implementation it has been used a basic model published by Google Brain resident Etienne Pot at his GitHub page6. After testing the model with different data sets, it was improved by adding an attention mechanism that allows the model to focus in the most relevant characteristics from the input sentence in the decoding phase. Additionally, as in [1], it was incorporated to the model a bidirectional encoder and a beam search algorithm at the decoder to improve the quality of the answers. Finally, the main contribution uses an architecture recently proposed by the Google research team [11] which consists in an encoder/decoder architecture based exclusively in attention mech- anisms without recurrent neural networks. The motivation to use the architecture is because it has been shown to outperform the state of the art in translation systems. It has been built using a library from TensorFlow called Tensor2Tensor 7. 5https://www.tensorflow.org/ 6Seq2Seq Chatbot model code available at https://github.com/Conchylicultor/DeepQA 7https://github.com/tensorflow/tensor2tensor 11 There are two types of learning algorithms: supervised and unsupervised. Difference lies in, if during the training process, training samples are labeled with information of the class they belong or conversely there is no additional information and is the system who must determine which class they belong to. These algorithm have become popular because they reduce the human intervention at the time of defining rules or patterns to the systems, letting to them to extract that information. They have changed lots of areas as image processing, audio and speech processing, translation and conversational systems among others. Most of them where created long time ago but their true potential has been possible thanks to recent computing capacity improvements and the availability of big data bases. 2.3 Biological Neuron The learning algorithm at which focuses this project, is inspired by the biological neurons of the brain. Neurons are a type of cell from the nervous system composed by a cell body called soma, some input signal branches called dendrites and a single output signal branch called axon (Figure 2.1). Figure 2.1: Structure of a neuron cell Axons split in their extremities into different sub-branches called tellodendrites. Connection between tellodendrites of a neuron and dendrites (or directly the soma) of another is performed by the synaptic terminal, which is a small structure that contains neurotransmitter molecules responsible for the exchange of the nervous signal (also known as synapses). Neurons, emit electrical impulses by the axon if the amount of electrical excitation received by the dendrites exceeds a threshold. 14 2.4 Artificial Neural Networks A neural network is a type of machine learning algorithm that is inspired by the behaviour of biological neurons in the brain. It consists of a group of basic units called artificial neurons (AN) or perceptron (see Figure 2.2) which are connected among them composing a complex network. They can compute an output given input data by decomposing it in different representations in order to identify different characteristics. The first model of AN was proposed by neurophysiologist Warren McCulloch and mathemati- cian Walter Pitts in 1943 [4]. The proposed model is a simple mathematical approximation of the operation of a biological neuron, capable to compute basic operations as identity function, AND, OR and NOR. Many other models have been proposed since then, but the most simple AN architecture is the perceptron which was proposed by Frank Rosenblatt in 1957 [4]. Whilst the AN model proposed by McCulloch and Pitts used binary values, the perceptron can operate with any numbers. The algorithm computes an activation function over the weighted sum of the input values. Additionally, in order to give one extra degree of freedom, a bias is added as shown in the following equation: output = f( ∑ i xi ∗ wi + w0) (2.1) Figure 2.2: Structure of a perceptron where wi and wO are the weights and bias respectively. The optimal values of these parame- ters are computed using gradient descent techniques starting from a labeled training data set2. Gradient Descent is an iterative optimization algorithm used to find the global minimum of a function. Activation functions are continuous and differentiable non-linear functions. They are required to be smooth in order to be able to learn from gradient descent techniques. The non-linearity 2In unsupervised learning, it is required to apply clustering techniques first, in order to label the data. 15 is an important condition that ensures a non linear discriminant expression at the output of the neural network, on the contrary a multilayer and single layer network perform alike. Output values are binary, if the weighted sum exceeds a threshold imposed by the activation function (originally the Heaviside step function), then the output is activated on the other hand the output value is deactivated. The perceptron operates as a linear discriminant, which means that every unit can linearly separate samples into two classes. It is possible to compute basic operations as AND or OR, but functions as XOR, are non-linear separable problems and therefore not implementable (Figure 2.3). Figure 2.3: XOR operation is a non-linear separable problem The solution is a multilayer perceptron (MLP) which is a network composed by multiple layers of perceptrons (Figure 2.4). The basic structure of a MLP is composed by an input layer where all data is fed to the network, one or more hidden layers for multiple representations of data and characteristic identification and finally an output layer. The output layer can use a different activation function depending on the nature of the task. For classification, the output layer uses a softmax function that represents, for each target class, a probability of success. Figure 2.4: Structure of a multilayer perceptron 16 Chapter 3 Architectures In the following chapter will be explained the different encoder/decoder architectures used in this project. First, it will be described the RNN architecture used in [12]. Second an overview of the attention mechanism used by the other two models will be provided. Finally, it will be explained the most recent architecture based exclusively in attention. 3.1 RNN Encoder/Decoder architecture (Seq2Seq) A very successful implementation of Encoder/Decoder architecture for NLP tasks (specially in neural machine translation) is the RNN Encoder/Decoder [2] [9] also known as Sequence to Sequence (Seq2Seq). The encoder and decoder are recurrent neural networks, which allows the model to be fed with variable-length input sentences. Given an input sentence, the encoder iteratively computes for each word a hidden state vector using the word and previous hidden state of the RNN. Once the whole sentence has been analyzed, the relevant information of the input sentence is contained in the last hidden state of the RNN, known as context or thought vector. The decoder computes, word by word, an output in the original representation space using the information contained in the context vector and previous decoded words. The architecture implementation can vary depending on the type of RNN cell used (genuine RNN cell, a LSTM cell or a GRU cell), number of cells per layer or the number of hidden layers among other parameters. Figure 3.1 shows a diagram of the sequence to sequence architecture. Figure 3.1: Diagram of the sequence to sequence architecture 19 As the size of the sentence increases, it is needed to encode a large quantity of information into a fixed-length vector, so some of it is lost at the encoding process resulting in a poor performance of the chatbot. 3.2 Attention Mechanism A solution to the problem due to the fixed-length nature of the context vector is to allow the decoder to ”see” the most relevant words of the input sentence during the decoding process (Figure 3.2). This method is called Attention Mechanism [1]. Figure 3.2: Diagram of the sequence to sequence architecture with a bidirectional encoder and applying attention in the decoding process Instead of only using the last hidden state vector as the context vector, for each word to decode, the decoder computes a context vector with a weighted sum of all hidden state vectors of the encoder. Unlike the Seq2Seq previously presented, for the attention mechanism a bidirectional encoder is used to encode the input sentence word by word into two hidden cell state vectors. The first one, going from the first to the last word of the sequence (forward) and the other one, reversing the sequence going from the last to the first word (backwards). The final hidden cell state vector is a concatenation of the hidden state going forwards and the hidden state going backwards (bidirectional hidden state vector). The bidirectional encoding allows to encode more information of a word from the sentence. Weights are computed by an alignment model and normalized over all values to get a percent- age of how relevant the word from the input sentence is, in relation to the word to be decoded. 1 1For further technical explanation of how weights are computed see [1]. 20 3.3 Transformer architecture The transformer architecture is an encoder/decoder model based entirely on attention mech- anism 2 recently proposed by the Google team [11] as a new state of the art neural machine translation (NMT) architecture. Additionally, they proved that the model can be generalized for other NLP tasks as English constituency parsing [6]. RNN are intrinsically sequential, which is a problem at parallelizing RNN models as Seq2Seq. This problem is solved in the Transformer architecture due to be based only in attention mech- anism and lack of RNN. Moreover, it has been proven that they require less training time than the RNN encoder/decoders. There are three main stages in the encoder (see Figure 3.3). The first one is where input words are projected into a vector representation space by an embedding matrix and then, given that there is no information of the order and position of words in the input sentence 3 a positional encoding is added to the embedded input vectors. The second stage is a multi-head attention block (of Self-Attention in this first case) that linearly projects the input information into different space representations and performs attention over all of them. This method allows the model to identify different semantic, morphological and lexical characteristics of the input sequence and attend them separately at the decoding process. Finally a position-wise feed-forward network is used, which applies two linear transformations to each position separately. The decoder has five stages, the first two only used at the training phase: an output embedding and positional encoding (similar to the one used in the encoder but for target sentences in the training phase), a masked multi-head attention (also Self-Attention), a multi-head attention, a feed forward network and finally a softmax layer to compute the output probabilities. Given that at the decoding process we can not know the future words, the attention can only be applied to previous ones. This is what the masked multi-head attention does, which is a multi- head attention block with a mask that restricts the attention only to past words. For a deeper technical explanation of the architecture see [11]. 2Unlike previous encoder/decoder models as Seq2Seq which uses RNN. 3In RNN encoder/decoder models, due to their sequential nature, no positional information is required. 21 4.1.2 Open Subtitles Corpus The Open Subtitles Corpus is composed by a wide range of movie and TV series scripts translated to multiple languages. It is generally used by video platforms to show subtitles of their movies/TV series. As for the project, a small version of the English corpus with 1,651,332 sentences was used for first experiments and lately, given the need of a bigger data set, the full English corpus, composed by 337,847,902. 4.2 Pre-Processing of the full Open Subtitles Corpus Although the Ubuntu Dialogue Corpus and the small version of the Open Subtitles Corpus were already processed, the full Open Subtitles Corpus required a pre-processing to clean a little bit the data set and adjust it to the experiments. The initial data base format of the full Open Subtitles Corpus was a set of XML files distributed in different directories corresponding to movies and TV series. For simplicity, all scripts were extracted and written into a single file, were each row corresponded to the dialog of a single speaker. The final file contained 337.847.902 sentences which was incredibly big. The corpus had a lot of noise, it required a pre-processing to reduce it. First, with a python script the data set was cleaned of symbols as: ”, *, -, # and musical note symbols. After the pre-processing the data set was reduced to 335,190,993 sentences. Nevertheless, it continued to be large, so some statistics were computed (Table 4.1) in order to get the maximum length that ensured to be covering the 99 % of the data set. Percentage of covered corpus 50% 75% 90% 95% 99% Maximum sentence length 4 7 11 15 24 Table 4.1: Table of percentage of covered data set for different lengths Additionally the maximum and minimum sentence lengths computed were 8251 and 0 respec- tively. After some post-analysis it was found that some sentences were long sequences of random numbers and others just null. By limiting the data set to 24 words as the maximum sentence length it is ensured to be covering the 99 % of the corpus and avoiding random sequences (noise is reduced). After limiting the length and discarding null sentences the corpus was reduced to 331,595,588 sentences. Given that it is necessary to establish a vocabulary size as a parameter for the models, from the processed corpus it was computed statistics of covered corpus given maximum vocabulary sizes (Figure 4.2). Among the 331,595,588 sentences there were 2,188,717,613 words from which 2,420,428 of them were different. 24 Percentage of covered corpus 50% 75% 90% 95% 99% Vocabulary size 40 281 2365 8049 72827 Table 4.2: Table of percentage of covered data set for different vocabulary sizes Given the results, the vocabulary size choose is 72,827 words which ensures to be covering the 99 % of the corpus. 4.3 Parameters 4.3.1 Baseline model At the experiments in [12], the architectures had 1024 unit cells for the Ubuntu model and 4096 for the OpenSubtitles. Due to computational limitations, our model had to be simpler, for that reason we used a two layered LSTM model with 512 unit cells per layer. Although the model could not achieve the results of the mentioned experiments, it would give quite good results. The initial code used a 32 dense dimension for the embedding matrix, but given that the vocabulary size of the data set was quite large, it was double to 64. As for the training, it was used ADAM [7] which is an optimizer algorithm that has been shown to achieve good results in deep learning applications and has been established as a reference. Also, it has been use the recommended parameter values for generic deep learning tasks in [7]: a learning rate of 0.2, β1 = 0.9, β2 = 0.98 and ε = 10−9. Additionally, in order to avoid over fitting, a dropout mechanism5 was applied with a keep probability of 90%. Also a 256 sample batches were used as input data. For better results, a beam search algorithm has been added at the decoder with a beam size of 4 and a penalty length factor of 0.6. 4.3.2 Seq2Seq + Attention Mechanism model The Seq2Seq + Attention Mechanism model (henceforth +Attention model) uses the same parameters as the baseline model for the Seq2Seq architecture and training. Due to the Attention Mechanism and large vocabulary sizes, the +Attention models require lot of memory. A solution is to use a sampled softmax loss function which instead of training with all possible words uses only a smaller random set, which in this project is set to 512. Due to problems with TensorFlow’s new version, it was required to modify some of the native function of RNN encoder with attention in order to gift a bidirectional encoding to the model. 5The dropout mechanism applies a probability, to each unit in the neural network, to be dropped. The elimination of units forces the model to lose some specificity over the training set and therefore avoid over fitting. 25 4.3.3 Transformer model The configuration parameters are the same as the ones used in [11] with few simplifications. The model has 8 attention head at the Multi-Head Attention block and 6 hidden layers with 512 unit cells each one. As the baseline model and the +Attention model, a beam search algorithm is used at the decoding process with a beam size of 4 and a length penalty factor of 0.6. For the training, it has been used ADAM optimizer algorithm but with the recommended parameters for the model in [11]: a learning rate of 0.2, β1 = 0.9, β2 = 0.98 and ε = 10−9. The model also had a dropout mechanism with the same keep probabilities as the Seq2Seq and +Attention. The batch size of the input data is of 4096 samples. 26 +Attention Human what is two plus two ? Chatbot Two hundred dollars. Human what is your name ? Chatbot Rusesabagina Human how old are you ? Chatbot Pardon me. Human my name is david. what is my name? Chatbot Porter. Although the answers are not quite correct, it is possible to notice that the +Attention model is able to identify when an answer requires to be a number, an age or even a name. 5.3 Transformer model using the small version of the Open Sub- titles Corpus After having implemented and tested two models for the SeqSeq and +Attention, a Trans- former model was implemented. After a first training, the model shown a really bad performance by almost always answering ”I don’t know” or ”I’m not sure”. The model was then trained more times but results improved very little. At that time, it was discussed to change the data set which, as previously mentioned, is a small version of the Open Subtitles data set to the full corpus. With the new corpus it was expected to achieve better results. 5.4 Final Experiments with full Open Subtitles Corpus For the final experiments the Baseline, +Attention, and Transformer models were trained using the full English Open Subtitles Corpus. As it has been explained in 4.2, the pre-processed corpus has 331,595,588 sentences which is too large for the training process. For that reason, the training process was divided by pre-training the models with a small set of the corpus and go increasing the number of sentences in posterior training. The pre-training was performed using the first 5 million sentences of the data set. All models were configured with a maximum vocabulary size of 72,827 words and a maximum sentence length of 24 words as explained in 4.2. Although the maximum vocabulary size has changed for the sub data set, in order to make possible train again the models with a different sub set, the models were configured with the values computed for the full pre-processed corpus. All the models were trained using all sentences as questions for their next sentence and as an answer for the previous one. For the evaluation it was establish a new criterion, ranking the answers of each model by if they are good, neutral or bad. The neutral label is for those answers which are not completely correct neither incorrect. For instance, when a model does not know what to answer says ”I don’t know” or ”I’m not sure” which depending on the context can be valid answers. 29 Given that the evaluation depends on the subjectivity of the evaluator, it was asked to two persons to rank the answers. From their rankings, an agreement is computed to know how many sentences they have rank alike. Results are shown in table 5.1. Given that the purpose of chatbots is to model language and not apply some logic at their answers, the evaluators were required to rank positively an answer if the type of answer matches with the question but the information is not correct (for instance, ask the result of an operation to the model and answer with a number but not the correct one). System Ranking Evaluator 1 Evaluator 2 Agreement RNN Good 46 49 43 +Attention 56 56 46 Transformer 83 94 77 RNN Bad 140 145 136 +Attention 62 98 78 Transformer 87 30 24 RNN Neutral 14 5 3 +Attention 62 45 40 Transformer 87 75 68 Table 5.1: Manual Evaluation of the models trained with 5 million sentences Some examples of the results on different styles of conversations: basic, philosophical, numer- ical and general knowledge. Conversation 1: Basic Context My name is Mary Johnson. What is my name? RNN I’m sorry +Attention Jinha. Transformer Mary. Context What time is it ? RNN I’m sorry +Attention 8:15 Transformer It’s after 10:00 . Conversation 2: Philosophical Context Why are you here ? RNN I’m sorry +Attention What do you want? Transformer I’m meeting someone called Natalie. Context What is the purpose of this conversation ? RNN I’m not a little +Attention I’m not going to be a man. Transformer I don’t know. 30 Conversation 3: Numerical Context How old were you when you were in school ? RNN I’m not sure. +Attention Novalee Transformer Six years. Context i have two apples, Paul gives me two oranges, how many apples do i have? RNN No. +Attention I haven’t been here since i came to you. Transformer One. Conversation 4: General knowledge Context ronald reagan is ... RNN I’m not a child. +Attention Yes Transformer He’s a friend of mine. Context The Pacific ocean is ... RNN I’m not gonna be a little more. +Attention The ocean. Transformer If you want to give me a hand, I’ll die. The agreement between both evaluators shows that the Transformer model performs better than the recurrence-based ones. After the evaluation, the models were trained again using another 15 million different sentences from the full corpus, being trained in total with 20 million sentences. In this case it was asked to three evaluators to rank the chatbot performances, results are shown in table 5.2. System Ranking Evaluator 1 Evaluator 2 Evaluator 3 Agreement RNN Good 41 47 54 32 +Attention 61 51 71 43 Transformer 74 57 70 51 RNN Bad 75 123 53 46 +Attention 90 116 57 46 Transformer 25 110 25 20 RNN Neutral 84 30 93 15 +Attention 49 33 72 6 Transformer 101 33 105 18 Table 5.2: Manual Evaluation of the models trained with 20 million sentences At the second training, the Baseline and +Attention models improved significantly in com- parison to the pre trained models. From the agreement between evaluators the transformer architecture is good or neutral in 34.5% of the questions whilst the baseline plus attention is good or neutral in 24.5% of the questions. Finally, with the worst rank, the baseline architecture 31 Chapter 6 Conclusions and Further Research Chatbots are systems with the ability to model natural language and establish a conversation with a user through a question/answer protocol. There are three types of approaches depending on the freedom they have at the time of answering: rule-based, retrieval-based and generative- based. The two first approaches are the most used nowadays due to its effectiveness at the time of maintaining a close-domain conversation. The generative-based models, on the other hand, arise as a powerful alternative in the sense that they can handle better an open topic conversation. They are very related to the idea of strong AI, no human intervention at the time of answering, everything is learned by the machine. Promising results have been achieved in generative-based chatbot models by applying neural translation techniques with RNN encoder/decoder architectures. In this thesis, it has been shown that chatbot models based on encoder/decoder architectures using exclusively attention outperforms RNN models. It is important to mention that all models shown in this project shape and mimic natural human language but do not apply any logic to their answers. That is why most of the answers are not coherent between them and the final model lacks of a ”personality”. Encoder/decoder architectures are a very interesting approach to solve the last problem. Between the encoder and decoder a logical block could be added. Then, once an input sentence has been encoded, apply some reasoning to compute what it is intended to answer. Finally, all the information is passed to the decoder which models an answer in natural language to describe what the logical block has reasoned. As direct further research of this thesis, the author and his supervisors are organizing a hackathon competition in the 4 Years from Now Conference 2018 (Barcelona). The objec- tive of this hackathon is to build a multilingual chatbot model based on a modified version of the transformer architecture with an additional intermediate block that will allow to separate the translation modelling part from the conversational one. 34 Chapter 7 Appendix This appendix contains the paper [3] currently under review at the recognized international conference of CICLING 2018. 35 Experimental research on encoder-decoder architectures with attention for chatbots Marta R. Costa-jussà, Álvaro Nuez, and Carlos Segura∗ TALP Research Center - Universitat Politècnica de Catalunya, Barcelona ∗ Telefònica I+D, Barcelona marta.ruiz@upc.edu, alvaronuez.eis@gmail.com, carlos.seguraperales@telefonica.com Abstract. Chatbots aim at automatically offering a conversation be- tween a human and a computer. While there is a long track of research in rule-based and information retrieval-based approaches, the generation- based approach is quite recent and can be dramatically improved by adapting recent advances in close areas as machine translation. In this paper, we offer an experimental view of how alternative encoder-decoder deep learning architectures perform in the context of chatbots. Our re- search concludes that a fully attention-based architecture is able to dra- matically outperform the recurrent neural network baseline system. Keywords: Chatbot, Encoder-Decoder, Attention Mechanisms 1 Introduction A chatbot stands for the short version of chat plus robot and it is a computer program that conducts a human-machine conversation in any topic. One of the very first chatbots was rule-based. It was proposed in 1966 by Joseph Weizenbaum’s program ELIZA [11]. Input sentences were analyzed us- ing several predefined decomposition rules, and after that key words were used to generate responses to them. The Artificial Intelligence Markup Language (AIML) is an evolution of these first rule-based chatbots. This AIML follows the idea of defining written patterns and the corresponding templates which are responses to the patterns. Then, in inference, if the robot identifies a pattern in a sentence from a user, the robot is able to reply taking the corresponding template [9]. To reduce the amount of work that developing these patterns and templates requires, alternative chatbots, no longer rule-based, but retrieval-based were pro- posed. These systems use different dialogue databases to train an information retrieval system [2]. The big advantage of these retrieval-based systems is that their training requires little human dedication. However, these systems still rely on giving the most appropriate response from a set of sentences. Thanks to the emergent deep learning techniques, the novel generative-based approaches have arisen offering chatbots that are capable, for the first time, to respond to non-predefined sentences. First successful approach is based on the popular 4 Marta R. Costa-jussà, Álvaro Nuez, and Carlos Segura∗ characters. We constructed a dataset consisting in pairs of consecutive utter- ances, using every sentence twice as context and as target. Due to computing and memory constrains, we extracted a subset of the first 10 million sentences for training using each sentence as context and as target. Therefore, we end up training with 20 million sentences for context and targets. Preprocessing of the database consisted on removing XML tags, limiting the sentence size and removing strange symbols (e.g. #). Details on training and evaluation split are reported on Table 1. Table 1. Size of the parallel corpora Set Role Words Vocab Training Context/Target 20,000,000 131,708,317 349,603 Evaluation Context 200 1,446 399 3.2 Parameters In order to ensure that we cover the 99 % of the dataset, we have limited the vocabulary size to 72,827 words and the length of sentences to 24 words. All words that are used only once are discarded. All three models have a 64 dense size for the embedding matrix. The recurrent architecture for both models (with and without attention), have 2 LSTM layers with 512 units. Additionally, the model with attention uses 512 softmax samples. For the training, we used ADAM with a learning rate of 0.002, β1 = 0.9, β2 = 0.999 and ε = 10−8; a batch size of 256 and a dropout rate of 0.1. The transformer architecture has 8 attention heads and 6 hidden layers with 512 units. For the training we have used ADAM with a learning rate of 0.2, β1 = 0.9, β2 = 0.98 and ε = 10−9; a batch size of 4096 and a dropout and attention dropout rates of 0.1. For better responses we have added to the three models a beam search with a beam size of 4 and 0.6 of alpha. 3.3 Evaluation In order to compare our three conversational models, we asked three evaluators to rank chatbot responses among: good, bad or neutral. Results are shown in table 2 showing ranking from Evaluator 1, Evaluator 2, Evaluator 3 and agreement between them. Results show clear improvement of the attention mechanisms over the baseline system. In fact, all evaluators agree that the transformer architecture is good or neutral in 34.5% of the questions whereas the baseline architecture is only good or neutral in 23.5% of the questions and the baseline plus attention is good or neutral in 24.5% of the questions. Research on encoder-decoder architectures for chatbots 5 Table 2. Manual Evaluation System Ranking Evaluator 1 Evaluator 2 Evaluator 3 Agreement RNN Good 41 47 54 32 +Attention 61 51 71 43 Transformer 74 57 70 51 RNN Bad 75 123 53 46 +Attention 90 116 57 46 Transformer 25 110 25 20 RNN Neutral 84 30 93 15 +Attention 49 33 72 6 Transformer 101 33 105 18 In addition to the results, we report several examples on different styles of conversations present in the test set: basic, philosophical, numerical and general knowledge. Conversation 1: Basic Context what is your name ? RNN Anouk. +Attention Adriano faustino sotolongo. Transformer i don’t know . Context what time is it ? RNN I do n’t know. +Attention 6:30. Transformer i don’t know. Conversation 2: Philosophical Context I think you are a machine RNN I’m sorry +Attention Vlanne: Transformer I’m not a machine . Context Life is sad . RNN I’m not a monster. +Attention I do n’t want to go to jail. Transformer It’s not sad . Context Is there a god ? RNN No. +Attention Surprisingly, yes. Transformer no . Conversation 3: Personality 6 Marta R. Costa-jussà, Álvaro Nuez, and Carlos Segura∗ Context How old were you when you were able to speak ? RNN Eighteen. +Attention Seventeen. Transformer six months. Conversation 4: General knowledge Context ronald reagan is ... RNN I’m not a good person. +Attention And i’m not a spy.. Transformer he’s in the hospital. Context Is five plus five equal to ten ? RNN No, no. +Attention That’s exactly what i’m gon na do. Transformer yes . 4 Conclusions Attention-based mechanisms are revolutionizing natural language, speech and image processing applications. In this paper, we are implementing a couple of recently proposed attention mechanisms into the chatbot application. Exper- iments trained on a open-domain database show that a fully attention-based architecture performs significantly better in a variety of contexts including ba- sic, philosophical, personality and general knowledge. Three evaluators agreed on rating the fully attention-based mechanism 34.5% of the time either good or neutral, while the responses in the baseline system where only 23.5% of the time either good or neutral. Taking advantage of this generic encoder-decoder architecture, among further research, we plan to introduce further contexts while training the system so as to allow the system to keep coherence in longer dialogues and to train our system on multiple languages. Acknowledgments. This study has been funded by the Spanish Ministerio de Economı́a y Competitividad and the European Regional Development Fund, through the postdoctoral senior grant Ramón y Cajal and the contract TEC2015- 69266-P (MINECO/FEDER,EU). References 1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans- lation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. 2. Rafael E. Banchs and Haizhou Li. IRIS: a chat-oriented dialogue system based on the vector space model. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the System Demonstrations, July 10, 2012, Jeju Island, Korea, pages 37–42, 2012.

Documents

questions

Comparing Chatbot Models: Rule-based, Retrieval-based, and Generative-based with Attention, Study notes of Art

Related documents

Partial preview of the text