Practical 7

Practical 7#

Goals#

Understanding the translation pattern

Exercise 7.1#

Download#

Download the data from https://zenodo.org/record/3271358.

View the data contents. As you can see in the output, there are four columns: timestamp, property, language and type.

Taking an example line from this dataset,

              2013-09-10T22:43:54Z,P856,en,label

corresponds to the action that an English label of Property P856 was added for the first time at 2013-09-10T22:43:54Z.

Following is a description of each column.

timestamp: the time at which an action was made. For example, 2013-09-10T22:43:54Z
property: Wikidata property identifier. It uses the P-number, For example, P856
language: the language in which a label/description/alias was first translated
type: It could be one of the following values: label, description and alias

Creation of dataframe#

import pandas as pd

dataframe = pd.read_csv("multilingual_wikidata_translation_flow.csv")

# remove duplicates
dataframe = dataframe.drop_duplicates()

# remove rows with missing values
dataframe = dataframe.dropna()
print(dataframe)

Get a detailed information of the given dataframe.

dataframe.describe()

Let us now take a look at the translation of labels in different languages for a given property.

dataframe.loc[(dataframe["property"] == "P856") & (dataframe["type"] == "label")]

We will check the data available for a property like P856.

ptranslation = dataframe.loc[(dataframe["property"] == "P856")]
print(ptranslation)

You can also add additional conditions to filter out any information. The code below filters out the translation of labels of property P856. Please copy the code below and create new cells for testing with different properties and see the results for translation of descriptions and aliases. What are your first observations? Please note them as comments in your notebook.

ptranslation = dataframe.loc[
    (dataframe["property"] == "P856") & (dataframe["type"] == "label")
]
print(ptranslation)

Visualization of translation#

Our next goal is to plot the translation of a property and see whether we can observe any translation pattern. Please copy the code below and create new cells for testing with different properties and plot the results. In this plot, we plot the time on the x-axis and translation type on the y-axis.

import matplotlib
import matplotlib.pyplot as plot
from dateutil import parser
import numpy as np

ptranslation = dataframe.loc[(dataframe["property"] == "P856")]
x = []
y = []
z = []
for i in range(0, len(ptranslation["timestamp"])):
    # parsing the date in string and converting it to datetime
    try:
        value = parser.parse(ptranslation["timestamp"].iloc[i])
        x.append(value)
    except Exception:
        continue
    y.append(ptranslation["type"].iloc[i])
    z.append(ptranslation["language"].iloc[i])

# creating  a plot
plot.rcParams["figure.figsize"] = (25, 9)
colors = matplotlib.cm.rainbow(np.linspace(0, 1, 3))
tcolors = {"label": colors[0], "description": colors[1], "alias": colors[2]}
cs = [tcolors[i] for i in y]
fig, ax = plot.subplots()
ax.scatter(x, y, s=30, color=cs)

plot.xlabel("Time")
plot.ylabel("Translation type")

# annotating  the points
for i, txt in enumerate(z):
    ax.annotate(txt, (x[i], y[i]))

plot.show()

Now, we plot another graph, this time, language on the y-axis and time on the x-axis.

import matplotlib
import matplotlib.pyplot as plot
from dateutil import parser
import numpy as np

ptranslation = dataframe.loc[(dataframe["property"] == "P279")]
x = []
y = []
z = []

for i in range(0, len(ptranslation["timestamp"])):
    try:
        # parsing the date in string and converting it to datetime
        value = parser.parse(ptranslation["timestamp"].iloc[i])
        x.append(value)
    except Exception:
        continue
    y.append(ptranslation["type"].iloc[i])
    z.append(ptranslation["language"].iloc[i])

# creating a plot
colors = matplotlib.cm.rainbow(np.linspace(0, 1, 3))
tcolors = {"label": colors[0], "description": colors[1], "alias": colors[2]}

cs = [tcolors[i] for i in y]

plot.rcParams["figure.figsize"] = (25, 25)
fig, ax = plot.subplots()
ax.scatter(x, z, s=50, color=cs)

plot.xlabel("Time")
plot.ylabel("Language")

# saving the plot in a file
plot.savefig("translation.png")
plot.show()

Check the plots concerning properties: P31, P279 and P856 respectively. What are your observations on the values on the y-axis? Please note them as comments on the notebook. You can also see that some translations can be even made in a single edit. Please check a column of red lines in the first figure (P31).

In the code given below, we select some languages and see the order of translation. Copy the code below and plot the graphs for different properties. Please note your observations as comments on the notebook.

import matplotlib
import matplotlib.pyplot as plot

from dateutil import parser
import numpy as np

# list of languages under consideration
langlist = ["fr", "es", "en", "de", "it", "pt", "ja"]
ptranslation = dataframe.loc[
    (dataframe["property"] == "P18") & (dataframe["language"].isin(langlist))
]
print(ptranslation)

x = []
y = []
z = []

for i in range(0, len(ptranslation["timestamp"])):
    try:
        # parsing the date in string and converting it to datetime
        value = parser.parse(ptranslation["timestamp"].iloc[i])
        x.append(value)
    except Exception:
        continue
    y.append(ptranslation["type"].iloc[i])
    z.append(ptranslation["language"].iloc[i])


# creating  a plot
colors = matplotlib.cm.rainbow(np.linspace(0, 1, 3))
tcolors = {"label": colors[0], "description": colors[1], "alias": colors[2]}

cs = [tcolors[i] for i in y]
plot.rcParams["figure.figsize"] = (10, 5)
fig, ax = plot.subplots()
ax.scatter(x, z, s=50, color=cs)

plot.savefig("translation.png")
plot.show()

Exercise 7.2#

We used plotting techniques to manually detect some patterns of translation. Now we use other algorithms to see whether groups of languages are always present together

Before continuing, we need to install the library mlxtend.

!pip install mlxtend

Next we prepare the language dataset. We will focus on the translation of property labels. We create a list of list of available translations of labels of all the properties from our dataset

# prepare dataset of languages
languageorder = []
labeldataframe = dataframe.loc[(dataframe["type"] == "label")]
groups = labeldataframe.groupby(["property"])

for k, group in groups:
    languageorder.append(list(groups.get_group((k))["language"]))

print(languageorder)

This will give us the following output.

  [['en', 'it', 'fi', 'fr', 'de', 'zh-hans', 'ru', 'hu', 'he', 'nl', 'pt', 'pl', 'ca', 'cs', 'ilo', 'zh', 'nb', 'ko',..        

Now we calculate the frequent itemsets using apriori algorithm. Please uncomment the print statement to see the intermediate dataframe.

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

te = TransactionEncoder()
te_ary = te.fit(languageorder).transform(languageorder)

# preparation of data frame
df = pd.DataFrame(te_ary, columns=te.columns_)

print(df)

# use of apriori algorithm
frequent_itemsets = apriori(df, min_support=0.75, use_colnames=True)
print(frequent_itemsets)

The program uses a minimum support of 0.75. We see that English labels are present for all the properties, whereas French labels are only available for 93% of the properties. On considering pairs of languages, we see that English and Arabic are present in 93% of the translations. As we go further below, we see combinations of 3, 4, 5 languages etc.

Please change this value between 0 and 1 and see the output.

      S.No.              support                itemsets
      ------- ---------- ---------------------- ----------
     0.998424   (ar)                   
     0.759414   (ca)                   
     1.000000   (en)                   
     0.932409   (fr)                   
     0.993540   (nl)                   
     0.985978   (uk)                   
     0.758626   (ar, ca)               
     0.998424   (en, ar)               
     0.931621   (ar, fr)               
     0.992752   (nl, ar)               
    0.985663   (ar, uk)               
    0.759414   (en, ca)               
    0.759256   (nl, ca)               
    0.754215   (uk, ca)               
    0.932409   (en, fr)               
    0.993540   (nl, en)               
    0.985978   (en, uk)               
    0.931779   (nl, fr)               
    0.928155   (uk, fr)               
    0.985663   (nl, uk)               
    0.758626   (en, ar, ca)           
    0.758469   (nl, ar, ca)           
    0.753899   (ar, uk, ca)           
    0.931621   (en, ar, fr)           
    0.992752   (nl, en, ar)           
    0.985663   (en, ar, uk)           
    0.931149   (nl, ar, fr)           
    0.927997   (ar, uk, fr)           
    0.985347   (nl, ar, uk)           
    0.759256   (nl, en, ca)           
    0.754215   (en, uk, ca)           
    0.754057   (nl, uk, ca)           
    0.931779   (nl, en, fr)           
    0.928155   (en, uk, fr)           
    0.985663   (nl, en, uk)           
    0.927840   (nl, uk, fr)           
    0.758469   (nl, en, ar, ca)       
    0.753899   (en, ar, uk, ca)       
    0.753742   (nl, ar, uk, ca)       
    0.931149   (nl, en, ar, fr)       
    0.927997   (en, ar, uk, fr)       
    0.985347   (nl, en, ar, uk)       
    0.927682   (nl, ar, uk, fr)       
    0.754057   (nl, en, uk, ca)       
    0.927840   (nl, en, uk, fr)       
    0.753742   (ar, en, ca, nl, uk)   
    0.927682   (ar, en, fr, nl, uk)   

Repeat the above experiment for descriptions and aliases. What are your observations. Please note them in the notebook as comments.

Exercise 7.3#

Our next goal is to generate the association rules, i.e, rules of the form. A -> C, where A is the antecedent and C is the consequent.

from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, metric="confidence", min_threshold=0.95)

It will give us the following output.

      S.No.   antecedents   consequents   antecedent support   consequent support   support    confidence   lift       leverage   conviction
      ------- ------------- ------------- -------------------- -------------------- ---------- ------------ ---------- ---------- ------------
     (de)          (en)          0.479193             0.988966             0.477932   0.997368     1.008496   0.004026   4.192938
     (de)          (uk)          0.479193             0.960277             0.468789   0.978289     1.018757   0.008631   1.829646
     (es)          (en)          0.389975             0.988966             0.388556   0.996362     1.007479   0.002884   3.033137
     (fr)          (en)          0.713272             0.988966             0.708859   0.993812     1.004900   0.003457   1.783181
     (it)          (en)          0.413934             0.988966             0.413146   0.998096     1.009232   0.003779   5.795082
     (ru)          (en)          0.301387             0.988966             0.300441   0.996862     1.007984   0.002380   3.516183
     (en)          (uk)          0.988966             0.960277             0.950189   0.960791     1.000534   0.000507   1.013087
     (uk)          (en)          0.960277             0.988966             0.950189   0.989494     1.000534   0.000507   1.050303
     (es)          (uk)          0.389975             0.960277             0.383039   0.982215     1.022845   0.008555   2.233492
     (fr)          (uk)          0.713272             0.960277             0.689470   0.966630     1.006615   0.004531   1.190362
    (it)          (uk)          0.413934             0.960277             0.405738   0.980198     1.020745   0.008246   2.005990
    (ru)          (uk)          0.301387             0.960277             0.300126   0.995816     1.037009   0.010711   9.493695
    (de, fr)      (en)          0.391393             0.988966             0.391078   0.999195     1.010343   0.004003   13.698770
    (de, uk)      (en)          0.468789             0.988966             0.467686   0.997646     1.008777   0.004069   4.687894

Let’s take a look at one of the association rule (de, fr)->(en). The confidence score is 0.999195. But this rule states that when we have French and German translations available, we will also have the English translation.

You can change the metric to ‘lift’ and test.

We will now add two columns that will contain the length of antecedents and consequents. As you may have noticed, association rules are dataframes. So we can easily write filter conditions to get antecedents and consequents of length more than 1.

from mlxtend.frequent_patterns import association_rules

arules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.95)

arules["antecedent_len"] = arules["antecedents"].apply(lambda x: len(x))
arules["consequent_len"] = arules["consequents"].apply(lambda x: len(x))

arules.loc[(arules["antecedent_len"] > 1) & (arules["consequent_len"] > 1)]

      S.No.   antecedents   consequents   antecedent support   consequent support   support    confidence   lift       leverage   conviction   antecedent\_len   consequent\_len
      ------- ------------- ------------- -------------------- -------------------- ---------- ------------ ---------- ---------- ------------ ----------------- -----------------
      32      (de, fr)      (en, uk)      0.391393             0.950189             0.383670   0.980266     1.031653   0.011772   2.524088     2                 2
      35      (fr, es)      (en, uk)      0.332755             0.950189             0.326765   0.981999     1.033477   0.010585   2.767124     2                 2
      38      (fr, it)      (en, uk)      0.345523             0.950189             0.338430   0.979471     1.030817   0.010117   2.426342     2                 2

Repeat the above experiment for translation of descriptions and aliases. Please note down your observations as comments in your notebook.

Exercise 7.4#

Our next goal is to visualize the flow of translation. For this purpose, we install networkx.

!pip install networkx      

First, we create groups for different translations.

languageorder = []
labeldataframe = dataframe.loc[(dataframe["type"] == "label")]

groups = labeldataframe.groupby(["property"])

for k, group in groups:
    languageorder.append(list(groups.get_group((k))["language"]))

languagepairs = []
for lo in languageorder:
    for i in range(0, len(lo) - 1):
        languagepairs.append([lo[i], lo[i + 1], 1])

lpdataframe = pd.DataFrame(languagepairs)
lpdataframe.columns = ["source", "target", "count"]

groups = lpdataframe.groupby(["source", "target"]).count().reset_index()

print(groups)

Next we visualize them using circular layout. Note the code below, we have only taken 500 combinations of language pairs.

import networkx as nx
import matplotlib.pyplot as plot

G = nx.from_pandas_edgelist(groups[:500], "source", "target")

plot.rcParams["figure.figsize"] = (15, 15)

pos = nx.circular_layout(G)

nx.draw(G, pos, with_labels=True)

Copy the above code and check the flow of translations for any languages of your choice (number of languages > 20).

Use Bokeh/Holoviews chord layout for visualizing the weights as well.