LASER for NLP part 2

LASER for NLP tasks | Sentiment Analysis- Part II | Engati

Engati
4 min readFeb 10, 2020

The previous article ( PART -1) covers the LASER concepts and it’s architecture.

There are 5 sections in this tutorial:

Here’s a step-by-step tutorial on using LASER for the multi-classification task (sentiment analysis). LASER provides multilingual embeddings, which can be used to train the model to carry out the sentiment analysis task.

There are 5 sections in this tutorial:

1. Dataset Preparation

2. Setup and installation

3. Classification Model Training

4. Inference

5. Analysis of results

6. Conclusion

Dataset Preparation

Clean your data in order to make sure you don’t have any empty rows or na values in your dataset. Divide the entire dataset to split into training, dev, and testing, in which train.tsv and dev.tsv will have the labels, and test.tsv won’t have the labels. I had around 31k English sentences in my training dataset, after cleaning.

Setup and installation

This link can be referred for the setup of the docker for LASER for getting the sentence embeddings, else there is a package for LASER named laserembeddings, which is a production-ready port of Facebook Research’s LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings. I used laserembeddings package to get the embeddings.

Run the following command to install the package and test it:

laserembeddings

Calculate the embeddings for all the sentences in the dataset using laserembedding s package and store all the embedding vectors(1024 dimensional) in numpy file.

Classification Model Training

Since we already have the embeddings for all the sentences computed with us, here laser is acting as the encoder in our model, as it provides with the embeddings for the input sentences. So, now we need to build a classifier network for our decoder in the model, to classify the sentence as a positive, negative or neutral sentiment:

For building the model:

Do the following imports:

tensorflow

For modeling:

(I will explain the parameters given below)

sequential model

Here, we have a sequential model, the above model works as a decoder, in which you can tweak the layers in the model, the above is a very simplistic architecture. I experimented with, adding globalaveragepool layer also, however, the results were comparatively the same. The ‘3’ in the Dense layer suggests we have 3 classes to predict. I have used adam optimizer and categorical_crossentropy, as I have multi-class classification problem statement[positive, negative, neutral]. X1, is the embedding for all the sentences in the training dataset and Y1 corresponds to the labels, whereas X2 is the embedding for all the sentences in the validation dataset and Y2 corresponds to the labels. I ran it for a few epochs around 7, for which val_accuracy came around 92%. You can decide epochs based on your data.

Finally, I trained the model using my English dataset(consisting of 31k sentences).

After you train the model by running the above steps, make sure you save the model :

model.save

Inference

Now, with the saved model, I carried the inference task by loading the laser_model.

import tensorflow

I carried inference on the test dataset, using the loaded model: model.predict_classes, with an accuracy of around 90 %. I converted my test dataset in various different languages(Hindi, German, French, Arabic, Tamil, Indonesia) using google translate package and after getting the embeddings using laserembeddings, I used the trained model to carry out the inferences.

Analysis of results

The accuracy score on various languages came as:

Hindi — 89.2%
German — 87%
French — 88%
Arabic — 87.7%
Tamil — 79%
Indonesia — 84%

Conclusion

In this article, we have learned to use LASER for the multi-classification task. LASER can be used on other Natural Language Processing tasks also instead of just classification, as I implemented FAQ for multilingual support. I feel that with adequate data, the results are surprisingly good even with the simple architecture of the decoder model. But, when it comes to less data, it has some issues, I got to know that when I tried the same with another domain data, which had less data comparatively, so I feel you should have proper data to achieve extremely good results. I am also exploring Bert Multilingual to see if it outperforms LASER and will come up with the results. Thanks for reading and have a great day ahead. Check out this article about “Sentence similarity, a though NLP problem” at Engati blogs.

Originally published at https://blog.engati.com on February 10, 2020.

--

--