Deep Learning (2019-2020): Project: John Samuel

Download

Download the data from https://zenodo.org/record/3271358.

View the data contents. As you can see in the output, there are four columns: timestamp, property, language and type.

Taking an example line from this dataset,

2013-09-10T22:43:54Z,P856,en,label

corresponds to the action that an English label of Property P856 was added for the first time at 2013-09-10T22:43:54Z.

Following is a description of each column.

timestamp: the time at which an action was made. For example, 2013-09-10T22:43:54Z
property: Wikidata property identifier. It uses the P-number, For example, P856
language: the language in which a label/description/alias was first translated
type: It could be one of the following values: label, description and alias

Creation of dataframe

import pandas as pd dataframe = pd.read_csv("multilingual_wikidata_translation_flow.csv") # remove duplicates dataframe = dataframe.drop_duplicates() # remove rows with missing values dataframe = dataframe.dropna() print(dataframe)

Get a detailed information of the given dataframe.

dataframe.describe()

Let us now take a look at the translation of labels in different languages for a given property.

dataframe.loc[(dataframe["property"]=="P856") & (dataframe["type"]=="label")]

We will check the data available for a property like P856.

ptranslation = dataframe.loc[(dataframe["property"]=="P856")] print(ptranslation)

You can also add additional conditions to filter out any information. The code below filters out the translation of labels of property P856. Please copy the code below and create new cells for testing with different properties and see the results for translation of descriptions and aliases. What are your first observations? Please note them as comments in your notebook.

ptranslation = dataframe.loc[(dataframe["property"]=="P856") & (dataframe["type"]=="label")] print(ptranslation)

Visualization of translation

Our next goal is to plot the translation of a property and see whether we can observe any translation pattern. Please copy the code below and create new cells for testing with different properties and plot the results. In this plot, we plot the time on the x-axis and translation type on the y-axis.

import matplotlib import matplotlib.pyplot as plot from dateutil import parser import numpy as np ptranslation = dataframe.loc[(dataframe["property"]=="P856")] x = [] y = [] z = [] for i in range(0, len(ptranslation['timestamp'])): #parsing the date in string and converting it to datetime try: value = parser.parse(ptranslation['timestamp'].iloc[i]) x.append(value) except Exception: continue y.append(ptranslation['type'].iloc[i]) z.append(ptranslation['language'].iloc[i]) #creating a plot plot.rcParams["figure.figsize"] = (25, 9) colors = matplotlib.cm.rainbow(np.linspace(0, 1, 3)) tcolors = {'label': colors[0], 'description':colors[1], 'alias':colors[2] } cs = [tcolors[i] for i in y] fig, ax = plot.subplots() ax.scatter(x, y, s=30, color=cs) plot.xlabel("Time") plot.ylabel("Translation type") #annotating the points for i, txt in enumerate(z): ax.annotate(txt, (x[i], y[i])) fig.show()

Now, we plot another graph, this time, language on the y-axis and time on the x-axis.

import matplotlib import matplotlib.pyplot as plot from dateutil import parser import numpy as np ptranslation = dataframe.loc[(dataframe["property"]=="P279")] x = [] y = [] z = [] for i in range(0, len(ptranslation['timestamp'])): try: #parsing the date in string and converting it to datetime value = parser.parse(ptranslation['timestamp'].iloc[i]) x.append(value) except Exception: continue y.append(ptranslation['type'].iloc[i]) z.append(ptranslation['language'].iloc[i]) #creating a plot colors = matplotlib.cm.rainbow(np.linspace(0, 1, 3)) tcolors = {'label': colors[0], 'description':colors[1], 'alias':colors[2] } cs = [tcolors[i] for i in y] plot.rcParams["figure.figsize"] = (25, 25) fig, ax = plot.subplots() ax.scatter(x, z, s=50, color=cs) plot.xlabel("Time") plot.ylabel("Language") # saving the plot in a file plot.savefig('translation.png') fig.show()

Following are three plots concerning properties: P31, P279 and P856 respectively. What are your observations on the values on the y-axis? Please note them as comments on the notebook. You can also see that some translations can be even made in a single edit. Please check a column of red lines in the first figure (P31).

In the code given below, we select some languages and see the order of translation. Copy the code below and plot the graphs for different properties. Please note your observations as comments on the notebook.

import matplotlib import matplotlib.pyplot as plot from dateutil import parser import numpy as np #list of languages under consideration langlist = ["fr", "es", "en", "de", "it", "pt", "ja"] ptranslation = dataframe.loc[(dataframe["property"]=="P18") & (dataframe["language"].isin (langlist))] print(ptranslation) x= [] y= [] z= [] for i in range(0, len(ptranslation['timestamp'])): try: #parsing the date in string and converting it to datetime value = parser.parse(ptranslation['timestamp'].iloc[i]) x.append(value) except Exception: continue y.append(ptranslation['type'].iloc[i]) z.append(ptranslation['language'].iloc[i]) #creating a plot colors = matplotlib.cm.rainbow(np.linspace(0, 1, 3)) tcolors = {'label': colors[0], 'description':colors[1], 'alias':colors[2] } cs = [tcolors[i] for i in y] plot.rcParams["figure.figsize"] = (10, 5) fig, ax = plot.subplots() ax.scatter(x, z, s=50, color=cs) plot.savefig('translation.png') fig.show()

We used plotting techniques to manually detect some patterns of translation. Now we use other algorithms to see whether groups of languages are always present together

Before continuing, we need to install the library mlxtend.

!pip install mlxtend

Next we prepare the language dataset. We will focus on the translation of property labels. We create a list of list of available translations of labels of all the properties from our dataset

#prepare dataset of languages languageorder = [] labeldataframe = dataframe.loc[(dataframe["type"] == "label")] groups = labeldataframe.groupby(["property"]) for k, group in groups: languageorder.append(list(groups.get_group((k))["language"])) print(languageorder);

This will give us the following output.

[['en', 'it', 'fi', 'fr', 'de', 'zh-hans', 'ru', 'hu', 'he', 'nl', 'pt', 'pl', 'ca', 'cs', 'ilo', 'zh', 'nb', 'ko',..

Now we calculate the frequent itemsets using apriori algorithm. Please uncomment the print statement to see the intermediate dataframe.

from mlxtend.preprocessing import TransactionEncoder from mlxtend.frequent_patterns import apriori te = TransactionEncoder() te_ary = te.fit(languageorder).transform(languageorder) # preparation of data frame df = pd.DataFrame(te_ary, columns=te.columns_) #print(df) #use of apriori algorithm frequent_itemsets = apriori(df, min_support=0.75, use_colnames=True) print (frequent_itemsets)

The program uses a minimum support of 0.75. We see that English labels are present for all the properties, whereas French labels are only available for 93% of the properties. On considering pairs of languages, we see that English and Arabic are present in 93% of the translations. As we go further below, we see combinations of 3, 4, 5 languages etc.

Please change this value between 0 and 1 and see the output.

S.No.		support
0	0.998424	(ar)
1	0.759414	(ca)
2	1.000000	(en)
3	0.932409	(fr)
4	0.993540	(nl)
5	0.985978	(uk)
6	0.758626	(ar, ca)
7	0.998424	(en, ar)
8	0.931621	(ar, fr)
9	0.992752	(nl, ar)
10	0.985663	(ar, uk)
11	0.759414	(en, ca)
12	0.759256	(nl, ca)
13	0.754215	(uk, ca)
14	0.932409	(en, fr)
15	0.993540	(nl, en)
16	0.985978	(en, uk)
17	0.931779	(nl, fr)
18	0.928155	(uk, fr)
19	0.985663	(nl, uk)
20	0.758626	(en, ar, ca)
21	0.758469	(nl, ar, ca)
22	0.753899	(ar, uk, ca)
23	0.931621	(en, ar, fr)
24	0.992752	(nl, en, ar)
25	0.985663	(en, ar, uk)
26	0.931149	(nl, ar, fr)
27	0.927997	(ar, uk, fr)
28	0.985347	(nl, ar, uk)
29	0.759256	(nl, en, ca)
30	0.754215	(en, uk, ca)
31	0.754057	(nl, uk, ca)
32	0.931779	(nl, en, fr)
33	0.928155	(en, uk, fr)
34	0.985663	(nl, en, uk)
35	0.927840	(nl, uk, fr)
36	0.758469	(nl, en, ar, ca)
37	0.753899	(en, ar, uk, ca)
38	0.753742	(nl, ar, uk, ca)
39	0.931149	(nl, en, ar, fr)
40	0.927997	(en, ar, uk, fr)
41	0.985347	(nl, en, ar, uk)
42	0.927682	(nl, ar, uk, fr)
43	0.754057	(nl, en, uk, ca)
44	0.927840	(nl, en, uk, fr)
45	0.753742	(ar, en, ca, nl, uk)
46	0.927682	(ar, en, fr, nl, uk)

Repeat the above experiment for descriptions and aliases. What are your observations. Please note them in the notebook as comments.

Our next goal is to generate the association rules, i.e, rules of the form. A -> C, where A is the antecedent and C is the consequent.

from mlxtend.frequent_patterns import association_rules association_rules(frequent_itemsets, metric="confidence", min_threshold=0.95)

It will give us the following output.

S.No.	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(de)	(en)	0.479193	0.988966	0.477932	0.997368	1.008496	0.004026	4.192938
1	(de)	(uk)	0.479193	0.960277	0.468789	0.978289	1.018757	0.008631	1.829646
2	(es)	(en)	0.389975	0.988966	0.388556	0.996362	1.007479	0.002884	3.033137
3	(fr)	(en)	0.713272	0.988966	0.708859	0.993812	1.004900	0.003457	1.783181
4	(it)	(en)	0.413934	0.988966	0.413146	0.998096	1.009232	0.003779	5.795082
5	(ru)	(en)	0.301387	0.988966	0.300441	0.996862	1.007984	0.002380	3.516183
6	(en)	(uk)	0.988966	0.960277	0.950189	0.960791	1.000534	0.000507	1.013087
7	(uk)	(en)	0.960277	0.988966	0.950189	0.989494	1.000534	0.000507	1.050303
8	(es)	(uk)	0.389975	0.960277	0.383039	0.982215	1.022845	0.008555	2.233492
9	(fr)	(uk)	0.713272	0.960277	0.689470	0.966630	1.006615	0.004531	1.190362
10	(it)	(uk)	0.413934	0.960277	0.405738	0.980198	1.020745	0.008246	2.005990
11	(ru)	(uk)	0.301387	0.960277	0.300126	0.995816	1.037009	0.010711	9.493695
12	(de, fr)	(en)	0.391393	0.988966	0.391078	0.999195	1.010343	0.004003	13.698770
13	(de, uk)	(en)	0.468789	0.988966	0.467686	0.997646	1.008777	0.004069	4.687894

Let's take a look at one of the association rule (de, fr)->(en). The confidence score is 0.999195. But this rule states that when we have French and German translations available, we will also have the English translation.

You can change the metric to 'lift' and test.

We will now add two columns that will contain the length of antecedents and consequents. As you may have noticed, association rules are dataframes. So we can easily write filter conditions to get antecedents and consequents of length more than 1.

from mlxtend.frequent_patterns import association_rules arules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.95) arules["antecedent_len"] = arules["antecedents"].apply(lambda x: len(x)) arules["consequent_len"] = arules["consequents"].apply(lambda x: len(x)) arules.loc[(arules["antecedent_len"]>1) & (arules["consequent_len"]>1)]

S.No.	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction	antecedent_len	consequent_len
32	(de, fr)	(en, uk)	0.391393	0.950189	0.383670	0.980266	1.031653	0.011772	2.524088	2	2
35	(fr, es)	(en, uk)	0.332755	0.950189	0.326765	0.981999	1.033477	0.010585	2.767124	2	2
38	(fr, it)	(en, uk)	0.345523	0.950189	0.338430	0.979471	1.030817	0.010117	2.426342	2	2

Repeat the above experiment for translation of descriptions and aliases. Please note down your observations as comments in your notebook.

Our next goal is to visualize the flow of translation. For this purpose, we install holoviews and bokeh.

!pip install holoviews bokeh --upgrade

First, we create groups for different translations.

languageorder = [] labeldataframe = dataframe.loc[(dataframe["type"] == "label")] groups = labeldataframe.groupby(["property"]) for k, group in groups: languageorder.append(list(groups.get_group((k))["language"])) languagepairs = [] for lo in languageorder: for i in range(0, len(lo)-1): languagepairs.append([lo[i], lo[i+1], 1]) lpdataframe = pd.DataFrame(languagepairs) lpdataframe.columns = ['source', 'target', 'count'] groups = lpdataframe.groupby(["source","target"]).count().reset_index() print(groups)

Next we visualize them using circular layout. Note the code below, we have only taken 500 combinations of language pairs.

import networkx as nx import matplotlib.pyplot as plot G = nx.from_pandas_edgelist(groups[:500], 'source', 'target') plot.rcParams["figure.figsize"] = (15,15) pos = nx.circular_layout(G) nx.draw(G, pos, with_labels=True)

Copy the above code and check the flow of translations for any languages of your choice (number of languages > 20).

Use Bokeh/Holoviews chord layout for visualizing the weights as well.

Prediction

Our final goal is to predict the next language(s) that will be translated, given a sequence of available translations.

Take for example, let's assume we have seen the following sequences of translation.

[['en', 'it', 'fi', 'fr', 'de', 'nl', 'pt', 'pl', 'ca', 'cs'] ['en', 'it', 'fi', 'fr', 'de', 'ilo', 'zh', 'nb'] ['ru', 'hu', 'he', 'nl', 'pt', 'pl', 'ca', 'cs'], ... ]

Once your model has been trained using this data, it may be able to predict the next possible translation(s).

Example 1: If the user enters the following sequence,

['pt']

your model may return the following language

['pl']

Example 2: If the user enters the following sequence,

['it', 'fi']

your model may return the following language

['fr']

Example 3: If the user enters the following sequence,

['nl', 'pt', 'pl']

your model may return the following languages

['ca', 'cs']

Your goal in this exercise is to train a neural network model that can predict the next probable translation(s) of labels (or descriptions or aliases).

For creating this model, you must use the translation sequence of Wikidata properties seen above. You must split the translation sequences into training and test sequences. Please show the accuracy metrics of your model.

It is important that you distinguish among the translation sequences of labels, descriptions and aliases, i.e., the user may ask the next probable language(s) for any of the three.

Please comment your code and write down your observations. For example, what's the maximum length of input sequence and output sequence? What's the accuracy metrics? Why did you choose a given neural network model? ...

Hint: You may use recurrent neural network with LSTM (Long short-term memory).

Practicals: Deep Learning

John Samuel

Goals

Exercise 1.1

Download

Creation of dataframe

Visualization of translation

Exercise 1.2

Exercise 1.3

Exercise 1.4

Exercise 1.5

Prediction

Submission

References