Localization in Streamlit using SpAcy, word2vec

Introduction

Language localisation (or localization) is the process of adapting a product's translation to a specific country or region. It is the second phase of a larger process of product translation and cultural adaptation (for specific countries, regions, cultures or groups) to account for differences in distinct markets, a process known as internationalisation and localisation. More

What is use of localization ?
Assume you are writer for a movie, you have created a script in Hindi language. Considering that the script is in Hindi it will have character's name, product name, localtion based in India. Now, this movie was a blockbuster in India and producers from different parts of world want to re-create this movie in their country, they will require the scripts which will be in Hindi, a writer from Germany won't understand it. Firstly this script has to be translated into German language which can easily done nowadays by using google translate but the character's name, product name, locations will be still of India. This is where localization comes in play, we can use this technique to convert these named entities of Inida to Germany. Example capital of India i.e., Delhi can be converted to Bonn which is the capital of Germany.

Problem statement
Given data we will try to acheive localization on it. Below are the steps which is involved,

* Accept the data(pdf/document/txt or raw) from user.

* Use SpAcy to extract named entites(location, name of people's, organization names etc).

* Pass these named entites to word2vec and find localized words of these entities according to whichever country we require.

* Replace the localised word with the original one in the data accepted from user .

In this notebook i have showed above steps by using streamlit which allows us to create a user friendly front end.

From above example we can see that how spacy library helps to find these named enities from given data.

Myself Amar Sharma and my team member Parvez Shaikh created many such notebooks as part of the course work under "Master in Data Science Programme" at Suven, under mentor-ship of Rocky Jagtiani.

What we are trying to achieve ?

Before starting coding part we will have a quick look at what we will achieve in the end.

* Understaning the above output :

User will input some data, here user has entered "My name is Max", then user has selected country as India means user wants to localise the data as per Inida norms, in output we can see the "Max" which is a person's name in USA has been converted to "Sanjay" which is proper Inidan name.

Coding part

After installing, import the required libraries.

We have created a main function.

Now, we will create Localise_data function that will return top 5 most similar words of each named entity(PERSON, GPE) that is identified.

model.most_similar function works as shown in below image,

* First it will substract the word vector of 'Frank' with word vector of 'USA'.

* Then the substracted word vector will be added to word vector of 'India'.

* Resulting will be top 1 most similar word.

We have created load_gensim function that will return word vector model. Since this model is of 1.6gb i have downloaded the model and kept in local folder and reading chunk of model everytime the function is called. You can download the model from here.

Finally, we are calling the main function.

After we have written this much code we can run our streamlit app from conda environment. Execute below command.

streamlit run path_of_python_file.py

After you have ran the above command your streamlit app will start running on some URL.

Output

In your browser you can run the local URL, which will open the user interface of your streamlit application.

I will demonstrate by uploading pdf and localising it.

Above is what the pdf contains i.e, only 2 sentence.

User interface

Select 'Upload From Server' option from drop down list.

Below is output after select 'Upload from server' option.

I will upload pdf file, and select country as 'India' with replacement type as 'Automatic'.

After clicking on 'Process' button we will below output.

Now, we will try to select replacement type as 'Manually'.

For each named entity we will get a dropdown list of their top 5 localised word.

After selecting what manuall replacement we need, we will again click on 'Process' button.

Here, my pdf had only 2 sentences, which worked well. But if we will try to localise some data that our model has not seen then we won't get proper results as we have loaded only chunk of google word vector model.

Conclusion

By loading enitre google word vector model we can improve our results.
We can load data from different data sources like we can read data directly from webpage.
We can improve our user interface to allow some more options for our user to manipulate uploaded data.

References
https://towardsdatascience.com/practical-ai-using-nlp-word-vectors-in-a-novel-way-to-solve-the-problem-of-localization-9de3e4fbf56f
https://www.streamlit.io/
https://spacy.io/

I would like to humbly and sincerely thank my mentor Rocky Jagtiani. He is more of a friend to me then mentor. The Python for Data Science taught by him and various assignments we did and are still doing is the best way to learn and skill in Data Science field.

Search This Blog

Machine Learning