Programming Session 2

Programming Session 2#

Goals#

Parsing and working with CSV, TSV and JSON files
Querying external data sources
Data analyses

Exercises#

Parsing and reading CSV/TSV files
Parsing and reading JSON files
Querying external data sources (Query endpoints and API)
Performing classical data analyses

Screenshots#

Screenshots have been provided so that you can check and compare whether the output you have is the same as what we want from the step (or exercise) in question.

Exercise 1 ★#

We first install the necessary packages.In Jupyter notebooks, you can use the ! command to run shell commands directly from a notebook cell. To install Python packages such as NumPy, pandas, you can use the following command:

!pip3 install numpy pandas matplotlib sparqlwrapper

Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (2.0.0)
Requirement already satisfied: pandas in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: matplotlib in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (3.9.1)
Requirement already satisfied: sparqlwrapper in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (2.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: contourpy>=1.0.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (4.53.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (3.1.2)

Requirement already satisfied: rdflib>=6.1.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from sparqlwrapper) (7.0.0)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: isodate<0.7.0,>=0.6.0 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from rdflib>=6.1.1->sparqlwrapper) (0.6.1)

To import the NumPy, pandas libraries in a Jupyter notebook, you use the following Python import statements:

import numpy as np
import pandas as pd
import matplotlib as plot
import SPARQLWrapper as sw

The __version__ attribute is a common way to access the version information of a Python package.

print(np.__version__)
print(pd.__version__)
print(plot.__version__)
print(sw.__version__)

0.0
2.2
9.1
0.0

Exercise 2 ★#

Most of the time, we work with CSV (comma-separated values) files for data analysis. A CSV file consits of one or more lines and each line has one or more values separated by commas. One can consider every line as a row and every value in a row as a column value. The first row is sometimes used to describe the column names.

Copy the file pl.csv to your current working directory (where you are running Jupyter: TP1) and use the following code to parse the csv file. Note the column names and datatypes (U100, i4), where U100 corresponds to 100-character unicode string and i4 corresponds to 32-bit signed integer.

Please check the complete list of dtypes here.

import numpy as np

dataset = np.loadtxt(
    "../data/pl.csv",  # Change this value to the path of your CSV file
    dtype={"names": ("name", "year"), "formats": ("U100", "i4")},
    skiprows=1,  # skip the first row, since it's the header
    delimiter=",",  # the separator is comma since it is a CSV file
    encoding="UTF-8",  # UTF-8 encoding
)
print(dataset)

[('ENIAC coding system', 1943) ('ENIAC Short Code', 1946)
 ('Von Neumann and Goldstine graphing system', 1946)
 ('ARC Assembly', 1947) ('Plankalkül', 1948) ('CPC Coding scheme', 1948)
 ('Curry notation system', 1948) ('Short Code', 1949)
 ('assembly language', 1949) ('Short Code', 1950) ('G-code', 1950)
 ('Birkbeck Assembler', 1950) ('Superplan', 1951) ('ALGAE', 1951)
 ('Intermediate Programming Language', 1951)
 ('Regional Assembly Language', 1951)
 ('Boehm unnamed coding system', 1951) ('Klammerausdrücke', 1951)
 ('OMNIBAC Symbolic Assembler', 1951) ('Stanislaus', 1951)
 ('Whirlwind assembler', 1951) ('Rochester assembler', 1951)
 ('Sort Merge Generator', 1951) ('autocode', 1952) ('A-0 System', 1952)
 ('Editing Generator', 1952) ('COMPOOL', 1952) ('Speedcoding', 1953)
 ('READ/PRINT', 1953) ('Fortran', 1954) ('ARITH-MATIC', 1954)
 ('autocode', 1954) ('Laning and Zierler system', 1954)
 ('MATH-MATIC', 1954) ('MATRIX MATH', 1954) ('FLOW-MATIC', 1955)
 ('PACT', 1955) ('BACAIC', 1955) ('Freiburger Code', 1955)
 ('Sequentielle Formelübersetzung', 1955) ('Internal Translator', 1955)
 ('PRINT', 1955) ('Information Processing Language', 1956)
 ('FORTRAN for the IBM 704', 1956) ('Fortran', 1957) ('COMTRAN', 1957)
 ('GEORGE', 1957) ('UNICODE', 1957) ('ALGOL 58', 1958) ('Lisp', 1958)
 ('ALGOL', 1958) ('FORTRAN III', 1958) ('FORTRAN II', 1958)
 ('JOVIAL', 1959) ('TRAC', 1959) ('COBOL', 1959) ('MAD', 1959)
 ('Lisp', 1959) ('FACT', 1959) ('COBOL', 1960) ('ALCOR', 1960)
 ('ALGOL 60', 1960) ('FORTRAN IV', 1961) ('TECO', 1962) ('SNOBOL', 1962)
 ('APL', 1962) ('Simula', 1962) ('CORC', 1962) ('CPL', 1963)
 ('JOSS', 1963) ('P′′', 1964) ('Report Program Generator', 1964)
 ('TRAC', 1964) ('BASIC', 1964) ('COWSEL', 1964) ('MARK IV', 1964)
 ('MIMIC', 1964) ('PL/I', 1964) ('Speakeasy', 1964) ('IBM RPG II', 1965)
 ('Atlas Autocode', 1965) ('TELCOMP', 1965) ('Euler', 1966)
 ('ISWIM', 1966) ('ALGOL W', 1966)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1966)
 ('JOSS', 1966) ('Coral 66', 1966) ('BCPL', 1966) ('APL', 1966)
 ('FORTRAN 66', 1966)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1967)
 ('Hop', 1967) ('XPL', 1967) ('Interlisp', 1967) ('BCPL', 1967)
 ('Simula', 1967) ('Space Programming Language', 1967) ('PILOT', 1968)
 ('ALGOL 68', 1968)]

np.loadtxt: This function loads data from a text file, with each row in the text file being converted to a NumPy array element.
"../data/pl.csv": The path to the CSV file. You need to change this to the actual path where your CSV file is located.
dtype={"names": ("name", "year"), "formats": ("U100", "i4")}: Specifies the data type for the resulting NumPy array. Here, dtype is defined as a dictionary with:
- names: A tuple specifying the names of the columns.
- formats: A tuple specifying the data type of each column. "U100" stands for a Unicode string of maximum length 100, and "i4" stands for a 4-byte integer.
skiprows=1: This parameter skips the first row of the CSV file, which is assumed to be the header.
delimiter=",": Specifies the delimiter that separates the values in the CSV file. Here, it is a comma.
encoding="UTF-8": Specifies the encoding of the file. UTF-8 is a common encoding for text files.

Important Notes:#

File Path: Ensure the file path ("../data/pl.csv") is correct relative to your Jupyter notebook or script. You might need to adjust it depending on where your CSV file is located.
CSV File Structure: The CSV file should have a header row that matches the column names specified in the dtype parameter, and the data should follow in subsequent rows.
Column Data Types: Ensure that the data types ("U100" for strings and "i4" for integers) match the actual data in your CSV file.

CSV support in numpy (Ref: (https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html)) is different from Python’s default CSV reader (Ref: (https://docs.python.org/3.9/library/csv.html)) because of its capability to support the data types (Ref: (https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)). Before continuing, take a deep look at numpy.loadtxt (Ref: (https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html)).

Copy the file pl.tsv to your current working directory and use the following code to parse the tsv file.

import numpy as np

dataset = np.loadtxt(
    "../data/pl.tsv",  # Change this value to the path of your CSV file
    dtype={"names": ("name", "year"), "formats": ("U100", "i4")},
    skiprows=1,
    delimiter="\t",
    encoding="UTF-8",
)
print(dataset)

[('ENIAC coding system', 1943) ('ENIAC Short Code', 1946)
 ('Von Neumann and Goldstine graphing system', 1946)
 ('ARC Assembly', 1947) ('Plankalkül', 1948) ('CPC Coding scheme', 1948)
 ('Curry notation system', 1948) ('Short Code', 1949)
 ('assembly language', 1949) ('Short Code', 1950)
 ('Birkbeck Assembler', 1950) ('G-code', 1950) ('Superplan', 1951)
 ('ALGAE', 1951) ('Intermediate Programming Language', 1951)
 ('Regional Assembly Language', 1951)
 ('Boehm unnamed coding system', 1951) ('Klammerausdrücke', 1951)
 ('OMNIBAC Symbolic Assembler', 1951) ('Stanislaus', 1951)
 ('Whirlwind assembler', 1951) ('Rochester assembler', 1951)
 ('Sort Merge Generator', 1951) ('Editing Generator', 1952)
 ('COMPOOL', 1952) ('autocode', 1952) ('A-0 System', 1952)
 ('READ/PRINT', 1953) ('Speedcoding', 1953) ('Fortran', 1954)
 ('MATRIX MATH', 1954) ('autocode', 1954) ('ARITH-MATIC', 1954)
 ('Laning and Zierler system', 1954) ('MATH-MATIC', 1954)
 ('FLOW-MATIC', 1955) ('BACAIC', 1955) ('Freiburger Code', 1955)
 ('Sequentielle Formelübersetzung', 1955) ('Internal Translator', 1955)
 ('PRINT', 1955) ('PACT', 1955) ('Information Processing Language', 1956)
 ('FORTRAN for the IBM 704', 1956) ('Fortran', 1957) ('UNICODE', 1957)
 ('COMTRAN', 1957) ('GEORGE', 1957) ('ALGOL 58', 1958) ('Lisp', 1958)
 ('FORTRAN III', 1958) ('FORTRAN II', 1958) ('ALGOL', 1958)
 ('JOVIAL', 1959) ('TRAC', 1959) ('COBOL', 1959) ('Lisp', 1959)
 ('MAD', 1959) ('FACT', 1959) ('COBOL', 1960) ('COBOL', 1960)
 ('ALGOL 60', 1960) ('ALCOR', 1960) ('FORTRAN IV', 1961) ('SNOBOL', 1962)
 ('TECO', 1962) ('Simula', 1962) ('APL', 1962) ('CORC', 1962)
 ('CPL', 1963) ('JOSS', 1963) ('P′′', 1964)
 ('Report Program Generator', 1964) ('TRAC', 1964) ('BASIC', 1964)
 ('COWSEL', 1964) ('PL/I', 1964) ('MARK IV', 1964) ('MIMIC', 1964)
 ('Speakeasy', 1964) ('IBM RPG II', 1965) ('Atlas Autocode', 1965)
 ('TELCOMP', 1965) ('Euler', 1966) ('ISWIM', 1966) ('ALGOL W', 1966)
 ('JOSS', 1966)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1966)
 ('BCPL', 1966) ('Coral 66', 1966) ('FORTRAN 66', 1966) ('APL', 1966)
 ('Hop', 1967)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1967)
 ('Space Programming Language', 1967) ('BCPL', 1967) ('Interlisp', 1967)
 ('Simula', 1967) ('XPL', 1967) ('PILOT', 1968)]

Note the changes in the above code compared to the previous one. A TSV file is a tab-separated file, i.e., the column values are separated by a tab ((\t)).

np.loadtxt: This function loads data from a text file, with each row in the text file being converted to a NumPy array element.
"../data/pl.tsv": The path to the TSV file. You need to change this to the actual path where your TSV file is located.
dtype={"names": ("name", "year"), "formats": ("U100", "i4")}: Specifies the data type for the resulting NumPy array. Here, dtype is defined as a dictionary with:
- names: A tuple specifying the names of the columns.
- formats: A tuple specifying the data type of each column. "U100" stands for a Unicode string of maximum length 100, and "i4" stands for a 4-byte integer.
skiprows=1: This parameter skips the first row of the TSV file, which is assumed to be the header.
delimiter="\t": Specifies the delimiter that separates the values in the TSV file. Here, it is a tab character.
encoding="UTF-8": Specifies the encoding of the file. UTF-8 is a common encoding for text files.

For counting the number of rows in your dataset, you can use len()

# Print the number of elements in the dataset
print(len(dataset))

You can also display the output of variables and methods in a notebook without print()

len(dataset)

Exercise 3 ★★#

Most of the external data sources may provide their data in JSON format. Our next exercise is to parse JSON files. Copy the file pl.json to your current working directory and use the following code to parse the JSON file. In this exercise, we use Pandas python package (Ref: (https://pandas.pydata.org/pandas-docs/stable/)) to parse the JSON file to obtain a Pandas DataFrame (Ref: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)). Try using methods like transpose (Ref: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html#pandas.DataFrame.transpose)), count (Ref: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html#pandas.DataFrame.count)) etc.

Before continuing this exercise, please practice working with Pandas. Take a look at 10 minutes to pandas (Ref: (https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)).

from pandas import json_normalize
import pandas as pd
import json

# Load JSON data from a file
data = json.load(open("../data/pl.json"))

# Convert JSON data to a pandas DataFrame
dataframe = json_normalize(data)

print(dataframe)

                                languageLabel  year
                       ENIAC coding system  1943
                          ENIAC Short Code  1946
 Von Neumann and Goldstine graphing system  1946
                              ARC Assembly  1947
                                Plankalkül  1948
..                                        ...   ...
                                     BCPL  1967
                                Interlisp  1967
                                   Simula  1967
                                      XPL  1967
                                    PILOT  1968

[100 rows x 2 columns]

This code is using Python’s pandas library to work with JSON data.

*. Imports:

from pandas import json_normalize: Imports the json_normalize function from the pandas library. This function is used to flatten JSON data into a DataFrame.
import pandas as pd: Imports the pandas library with the alias pd, which is a common convention.
import json: Imports Python’s built-in json module for handling JSON data.

*. Loading JSON Data:

data = json.load(open("../data/pl.json")): Opens a JSON file located at "../data/pl.json" and loads its contents into the data variable. The json.load() function reads the JSON file and parses it into a Python dictionary.

Normalizing JSON into a DataFrame:
- dataframe = json_normalize(data): Uses the json_normalize() function from pandas to convert the nested JSON data (data) into a flat table-like structure called a DataFrame. This step is particularly useful for analyzing structured JSON data where nested objects or arrays need to be represented in tabular form.

And the output without print()

dataframe

	languageLabel	year
0	ENIAC coding system	1943
1	ENIAC Short Code	1946
2	Von Neumann and Goldstine graphing system	1946
3	ARC Assembly	1947
4	Plankalkül	1948
...	...	...
95	BCPL	1967
96	Interlisp	1967
97	Simula	1967
98	XPL	1967
99	PILOT	1968

100 rows × 2 columns

To display the values of the column year:

# Select the "year" column from the dataframe
dataframe["year"]

   1943
   1946
   1946
   1947
   1948
      ... 
  1967
  1967
  1967
  1967
  1968
Name: year, Length: 100, dtype: object

To display the values of the column languageLabel:

# Select the "languageLabel" column from the dataframe
dataframe["languageLabel"]

                         ENIAC coding system
                            ENIAC Short Code
   Von Neumann and Goldstine graphing system
                                ARC Assembly
                                  Plankalkül
                        ...                    
                                       BCPL
                                  Interlisp
                                     Simula
                                        XPL
                                      PILOT
Name: languageLabel, Length: 100, dtype: object

Getting some important information like count, min, max using the method describe(). By default, it focuses on numerical columns (containing numbers) and calculates statistics like mean, standard deviation, quartiles (like 25th and 75th percentile), minimum and maximum values.

# Get some descriptive summary of the dataframe
dataframe.describe()

	languageLabel	year
count	100	100
unique	88	24
top	COBOL	1951
freq	3	11

The dataframe.describe() method in pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. Here’s what each part typically includes:

Count: Number of non-null observations (rows) in each column.
Mean: Average value of each numerical column.
Std (Standard Deviation): Measures the dispersion or spread of the values in each numerical column.
Min: The smallest value in each numerical column.
25% (Percentile): The value below which 25% of the observations fall (1st quartile).
50% (Median): The median or 50th percentile.
75% (Percentile): The value below which 75% of the observations fall (3rd quartile).
Max: The largest value in each numerical column.

When you call describe() on a pandas DataFrame (dataframe in this case), it analyzes only the numerical columns by default and provides the statistical summary for each of these columns. Non-numerical columns are ignored unless specifically included by using additional parameters.

This method is useful for quickly getting an overview of your dataset’s numeric values, identifying outliers, understanding the distribution of your data, and more.

To display the unique values of the column year:

# Get the unique years from the "year" column
dataframe["year"].unique()

array(['1943', '1946', '1947', '1948', '1949', '1950', '1951', '1952',
       '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960',
       '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968'],
      dtype=object)

Note that the order of unique values returned by .unique() is not guaranteed.

To display the unique values of the column languageLabel:

# Get the unique languages from the "languageLabel" column
dataframe["languageLabel"].unique()

array(['ENIAC coding system', 'ENIAC Short Code',
       'Von Neumann and Goldstine graphing system', 'ARC Assembly',
       'Plankalkül', 'CPC Coding scheme', 'Curry notation system',
       'Short Code', 'assembly language', 'Birkbeck Assembler', 'G-code',
       'Superplan', 'ALGAE', 'Intermediate Programming Language',
       'Regional Assembly Language', 'Boehm unnamed coding system',
       'Klammerausdrücke', 'OMNIBAC Symbolic Assembler', 'Stanislaus',
       'Whirlwind assembler', 'Rochester assembler',
       'Sort Merge Generator', 'Editing Generator', 'COMPOOL', 'autocode',
       'A-0 System', 'READ/PRINT', 'Speedcoding', 'Fortran',
       'MATRIX MATH', 'ARITH-MATIC', 'Laning and Zierler system',
       'MATH-MATIC', 'FLOW-MATIC', 'BACAIC', 'Freiburger Code',
       'Sequentielle Formelübersetzung', 'Internal Translator', 'PRINT',
       'PACT', 'Information Processing Language',
       'FORTRAN for the IBM 704', 'UNICODE', 'COMTRAN', 'GEORGE',
       'ALGOL 58', 'Lisp', 'FORTRAN III', 'FORTRAN II', 'ALGOL', 'JOVIAL',
       'TRAC', 'COBOL', 'MAD', 'FACT', 'ALGOL 60', 'ALCOR', 'FORTRAN IV',
       'SNOBOL', 'TECO', 'Simula', 'APL', 'CORC', 'CPL', 'JOSS', 'P′′',
       'Report Program Generator', 'BASIC', 'COWSEL', 'PL/I', 'MARK IV',
       'MIMIC', 'Speakeasy', 'IBM RPG II', 'Atlas Autocode', 'TELCOMP',
       'Euler', 'ISWIM', 'ALGOL W',
       'Massachusetts General Hospital Utility Multi-Programming System',
       'BCPL', 'Coral 66', 'FORTRAN 66', 'Hop',
       'Space Programming Language', 'Interlisp', 'XPL', 'PILOT'],
      dtype=object)

To sort the values by one or more columns:

dataframe.sort_values(["year"])

	languageLabel	year
0	ENIAC coding system	1943
1	ENIAC Short Code	1946
2	Von Neumann and Goldstine graphing system	1946
3	ARC Assembly	1947
4	Plankalkül	1948
...	...	...
98	XPL	1967
93	Massachusetts General Hospital Utility Multi-P...	1967
92	Hop	1967
94	Space Programming Language	1967
99	PILOT	1968

100 rows × 2 columns

.sort_values(["year"]) sorts the DataFrame by the values in the column named “year”. You can specify one or more columns to sort by. By default, sorting is done in ascending order.

dataframe.sort_values(["year", "languageLabel"])

	languageLabel	year
0	ENIAC coding system	1943
1	ENIAC Short Code	1946
2	Von Neumann and Goldstine graphing system	1946
3	ARC Assembly	1947
5	CPC Coding scheme	1948
...	...	...
93	Massachusetts General Hospital Utility Multi-P...	1967
97	Simula	1967
94	Space Programming Language	1967
98	XPL	1967
99	PILOT	1968

100 rows × 2 columns

dataframe.sort_values(["languageLabel"])

	languageLabel	year
26	A-0 System	1952
62	ALCOR	1960
13	ALGAE	1951
52	ALGOL	1958
48	ALGOL 58	1958
...	...	...
20	Whirlwind assembler	1951
98	XPL	1967
8	assembly language	1949
31	autocode	1954
25	autocode	1952

100 rows × 2 columns

To obtain the data types of the columns:

# Get the data types of each column in the dataframe
dataframe.dtypes

languageLabel    object
year             object
dtype: object

The dataframe.dtypes attribute in pandas returns the data types of each column in the DataFrame dataframe. It provides information about how pandas has interpreted the data from the JSON (or any other source) based on its initial reading and normalization process. Here’s what you’ll typically see:

Numeric Types: Integers (int64), Floating-point numbers (float64).
Object Type: Typically strings (object), but can also include other Python objects.
DateTime Types: If applicable, pandas will recognize and label date/time data appropriately (datetime64).
Categorical Types: If pandas identifies a column with a small number of unique values, it may assign it a category dtype, which can optimize memory usage and speed up certain operations.

Exercise 4 ★★#

In this exercise, we will take a look at how to download data from external data sources using special query interfaces. Take for example, above data was obtained from Wikidata query:** (https://query.wikidata.org/)) interface. See the screenshot given below.

Given below is the code to read data from an external data source. Use this url: (https://query.wikidata.org/sparql?query=SELECT %3FlanguageLabel (YEAR(%3Finception) as %3Fyear)%0AWHERE%0A%7B%0A%20%20%23instances%20of%20programming%20language%0A%20%20%3Flanguage%20wdt%3AP31%20wd%3AQ9143%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP571%20%3Finception%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FlanguageLabel.%0A%20%20FILTER(lang(%3FlanguageLabel)%20%3D%20%22en%22)%0A%7D%0AORDER%20BY%20%3Fyear%0ALIMIT%20100&format=json).

import urllib.request
import json
import pandas as pd

url = "https://query.wikidata.org/sparql?query=SELECT%20%3FlanguageLabel%20(YEAR(%3Finception)%20as%20%3Fyear)%0AWHERE%0A%7B%0A%20%20%23instances%20of%20programming%20language%0A%20%20%3Flanguage%20wdt%3AP31%20wd%3AQ9143%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP571%20%3Finception%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FlanguageLabel.%0A%20%20FILTER(lang(%3FlanguageLabel)%20%3D%20%22en%22)%0A%7D%0AORDER%20BY%20%3Fyear%0ALIMIT%20100&format=json"
response = urllib.request.urlopen(url)
responsedata = json.loads(response.read().decode("utf-8"))

array = []

for data in responsedata["results"]["bindings"]:
    array.append([data["year"]["value"], data["languageLabel"]["value"]])

dataframe = pd.DataFrame(array, columns=["year", "languageLabel"])
dataframe = dataframe.astype(dtype={"year": "<i4", "languageLabel": "<U200"})
print(dataframe)

    year                              languageLabel
 1942                                 Plankalkül
 1943                        ENIAC coding system
 1946                           ENIAC Short Code
 1946  Von Neumann and Goldstine graphing system
 1948                          CPC Coding scheme
..   ...                                        ...
1967                                      ABSYS
1967                 Space Programming Language
1968                                      Refal
1968                                      PILOT
1968                                     MAPPER

[100 rows x 2 columns]

dataframe

	year	languageLabel
0	1942	Plankalkül
1	1943	ENIAC coding system
2	1946	ENIAC Short Code
3	1946	Von Neumann and Goldstine graphing system
4	1948	CPC Coding scheme
...	...	...
95	1967	ABSYS
96	1967	Space Programming Language
97	1968	Refal
98	1968	PILOT
99	1968	MAPPER

100 rows × 2 columns

# Get some descriptive summary of the dataframe
dataframe.describe()

	year
count	100.000000
mean	1958.300000
std	6.500194
min	1942.000000
25%	1952.750000
50%	1959.000000
75%	1964.000000
max	1968.000000

# Get some descriptive summary of the year column
dataframe["year"].describe()

count     100.000000
mean     1958.300000
std         6.500194
min      1942.000000
25%      1952.750000
50%      1959.000000
75%      1964.000000
max      1968.000000
Name: year, dtype: float64

# Get some descriptive summary of the languageLabel column
dataframe["languageLabel"].describe()

count          100
unique          99
top       autocode
freq             2
Name: languageLabel, dtype: object

# Get the data types of each column in the dataframe
dataframe.dtypes

year              int32
languageLabel    object
dtype: object

Exercise 5 ★★★#

This exercise will use some basic data analyses. Continuing with the code in Exercise 4, le’s count the number of programming languages released in a year.

grouped = dataframe.groupby("year").count()
grouped

	languageLabel
year
1942	1
1943	1
1946	2
1948	2
1949	1
1950	3
1951	11
1952	4
1953	2
1954	4
1955	8
1956	1
1957	6
1958	2
1959	4
1960	7
1961	1
1962	5
1963	3
1964	11
1965	5
1966	7
1967	6
1968	3

The above code creates a new DataFrame grouped by grouping the original DataFrame dataframe based on the values in the column named “year”, and then counting the occurrences in each group.

You can also use multiple aggregate functions using agg()

grouped = dataframe.groupby("year").agg(["count"])
grouped

	languageLabel
	count
year
1942	1
1943	1
1946	2
1948	2
1949	1
1950	3
1951	11
1952	4
1953	2
1954	4
1955	8
1956	1
1957	6
1958	2
1959	4
1960	7
1961	1
1962	5
1963	3
1964	11
1965	5
1966	7
1967	6
1968	3

Till now, we worked with tables having two columns. Now we focus on tables with three columns (programming language, year, paradigm). Copy the file plparadigm.json to your working directory. And test the following program.

from pandas import json_normalize
import pandas as pd
import json

# Load JSON data from a file
jsondata = json.load(open("../data/plparadigm.json"))

# Create an empty list to store temporary data
array = []

# Loop through each item in the JSON data
for data in jsondata:
    # Extract specific values from each item and append them to the list
    # Assuming "year", "languageLabel", and "paradigmLabel" are keys in the JSON data
    array.append([data["year"], data["languageLabel"], data["paradigmLabel"]])

# Create a pandas DataFrame from the list with named columns
dataframe = pd.DataFrame(array, columns=["year", "languageLabel", "paradigmLabel"])

# Set data types for the columns in the DataFrame
# This ensures efficient storage and avoids potential errors during operations
dataframe = dataframe.astype(
    dtype={"year": "int64", "languageLabel": "<U200", "paradigmLabel": "<U200"}
)

# Group the DataFrame by "year" and "paradigmLabel" columns
grouped = dataframe.groupby(["year", "paradigmLabel"]).agg(["count"])
grouped

		languageLabel
		count
year	paradigmLabel
1948	procedural programming	1
1949	non-structured programming	1
1949	procedural programming	1
1950	procedural programming	1
1953	procedural programming	1
...	...	...
1970	imperative programming	1
	procedural programming	1
	structured programming	1
1971	functional programming	1
1971	knowledge representation and reasoning	1

73 rows × 1 columns

Now test the following program. Compare the difference in output.

grouped = dataframe.groupby(["paradigmLabel", "year"]).agg(["count"])
grouped

		languageLabel
		count
paradigmLabel	year
array programming	1954	1
	1957	1
	1962	1
	1966	1
assembly language	1956	1
...	...	...
structured programming	1957	1
	1958	1
	1962	1
	1966	1
	1970	1

73 rows × 1 columns

Your next goal is to run the following query to get the population information of different countries (limited to 10000 rows). Run the following query on Wikidata query service and download the JSON file.

SELECT DISTINCT ?countryLabel (YEAR(?date) as ?year) ?population
WHERE {
 ?country wdt:P31 wd:Q6256; #Country 
   p:P1082 ?populationStatement;
  rdfs:label ?countryLabel. #Label
 ?populationStatement ps:P1082 ?population; #population
  pq:P585 ?date. #period in time
 FILTER(lang(?countryLabel)="en") #Label in English
}
ORDER by ?countryLabel ?year
LIMIT 10000

Now, compute and display the following information (using various operations available in pandas library (Ref: (https://pandas.pydata.org/pandas-docs/stable/10min.html))):

The population of countries in alphabetical order of their names and ascending order of year.
The latest available population of every country
The country with the lowest and highest population (considering the latest population)

Your next goal is to run the following query to get information related to scientific articles published after 2010 (limited to 10000 rows). Run the following query on Wikidata query service and download the JSON file. It gives you the following information related to the scientific article: title, main subject and publication year.

SELECT ?title ?subjectLabel ?year
{
  ?article wdt:P31 wd:Q13442814; #scientific article
           wdt:P1476 ?title; #title of the article
           wdt:P921 ?subject; #main subject
           wdt:P577 ?date. #publication date
  ?subject rdfs:label ?subjectLabel.
  BIND(YEAR(?date) as ?year).
  #published after 2010
  FILTER(lang(?title)="en" &&
     lang(?subjectLabel)="en" && ?year>2010)
}
LIMIT 5000

Now, compute and display the following information (using various operations available in pandas library (Ref: (https://pandas.pydata.org/pandas-docs/stable/10min.html))):

The number of articles published on different subjects every year.
Top subject of interest to the scientific community every year(based on the above query results).
Top 10 subjects of interest to the scientific community (based on the above query results) since 2010.

Hint:Take a look at functions groupby, reset_index, head, tail, sort_values, count of Pandas

Note: If you get time-out errors, please change the LIMIT to some lower values (1000, 2000, 5000).

Exercise 6 ★★★#

In our final exercise, we will query Wikidata and obtain the URLs of the images. Then we will download these images.

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

# Get cities
query = """SELECT DISTINCT ?bigcity ?bigcityLabel ?country ?countryLabel ?image {
  ?bigcity wdt:P31 wd:Q1549591;
               wdt:P17 ?country;
               wdt:P18 ?image.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 10"""


def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (
        sys.version_info[0],
        sys.version_info[1],
    )
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()


array = []
results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]:
    array.append(
        (
            result["bigcityLabel"]["value"],
            result["countryLabel"]["value"],
            result["image"]["value"],
        )
    )

dataframe = pd.DataFrame(array, columns=["city", "country", "image"])
dataframe = dataframe.astype(
    dtype={"city": "<U200", "country": "<U200", "image": "<U200"}
)
dataframe

	city	country	image
0	Solingen	Germany	http://commons.wikimedia.org/wiki/Special:File...
1	Paderborn	Germany	http://commons.wikimedia.org/wiki/Special:File...
2	Cottbus	Germany	http://commons.wikimedia.org/wiki/Special:File...
3	Bissau	Guinea-Bissau	http://commons.wikimedia.org/wiki/Special:File...
4	Ouagadougou	Burkina Faso	http://commons.wikimedia.org/wiki/Special:File...
5	Belgrade	Serbia	http://commons.wikimedia.org/wiki/Special:File...
6	Neuss	Germany	http://commons.wikimedia.org/wiki/Special:File...
7	Moers	Germany	http://commons.wikimedia.org/wiki/Special:File...
8	Remscheid	Germany	http://commons.wikimedia.org/wiki/Special:File...
9	Muscat	Oman	http://commons.wikimedia.org/wiki/Special:File...

We will now download these images

import requests
import shutil
import os


def download_image(url):

    headers = {"User-Agent": "Mozilla/5.0"}
    request = requests.get(url, allow_redirects=True, headers=headers, stream=True)
    if request.status_code == 200:
        with open(os.path.basename(url), "wb") as image:
            request.raw.decode_content = True
            shutil.copyfileobj(request.raw, image)
    return request.status_code

dataframe.image.apply(download_image)

  200
  200
  200
  200
  200
  200
  200
  200
  200
  200
Name: image, dtype: int64

Modify the above code and download images related to any subject of your choice (cities, historical monuments, flowers, buildings, etc.)

Programming Session 2

Contents

Programming Session 2#

Goals#

Exercises#

Screenshots#

Exercise 1 ★#

Exercise 2 ★#

Important Notes:#

Exercise 3 ★★#

Exercise 4 ★★#

Exercise 5 ★★★#

Exercise 6 ★★★#