Programming Session 2#

Goals#

  • Parsing and working with CSV, TSV and JSON files

  • Querying external data sources

  • Data analyses

Exercises#

  1. Parsing and reading CSV/TSV files

  2. Parsing and reading JSON files

  3. Querying external data sources (Query endpoints and API)

  4. Performing classical data analyses

Screenshots#

Screenshots have been provided so that you can check and compare whether the output you have is the same as what we want from the step (or exercise) in question.

Exercise 1 ★#

We first install the necessary packages.In Jupyter notebooks, you can use the ! command to run shell commands directly from a notebook cell. To install Python packages such as NumPy, pandas, you can use the following command:

!pip3 install numpy pandas matplotlib sparqlwrapper
Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (2.0.0)
Requirement already satisfied: pandas in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: matplotlib in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (3.9.1)
Requirement already satisfied: sparqlwrapper in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (2.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from pandas) (2024.1)
Requirement already satisfied: contourpy>=1.0.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (4.53.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: rdflib>=6.1.1 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from sparqlwrapper) (7.0.0)
Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: isodate<0.7.0,>=0.6.0 in /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages (from rdflib>=6.1.1->sparqlwrapper) (0.6.1)

To import the NumPy, pandas libraries in a Jupyter notebook, you use the following Python import statements:

import numpy as np
import pandas as pd
import matplotlib as plot
import SPARQLWrapper as sw

The __version__ attribute is a common way to access the version information of a Python package.

print(np.__version__)
print(pd.__version__)
print(plot.__version__)
print(sw.__version__)
2.0.0
2.2.2
3.9.1
2.0.0

Exercise 2 ★#

Most of the time, we work with CSV (comma-separated values) files for data analysis. A CSV file consits of one or more lines and each line has one or more values separated by commas. One can consider every line as a row and every value in a row as a column value. The first row is sometimes used to describe the column names.

Copy the file pl.csv to your current working directory (where you are running Jupyter: TP1) and use the following code to parse the csv file. Note the column names and datatypes (U100, i4), where U100 corresponds to 100-character unicode string and i4 corresponds to 32-bit signed integer.

Please check the complete list of dtypes here.

import numpy as np

dataset = np.loadtxt(
    "../data/pl.csv",  # Change this value to the path of your CSV file
    dtype={"names": ("name", "year"), "formats": ("U100", "i4")},
    skiprows=1,  # skip the first row, since it's the header
    delimiter=",",  # the separator is comma since it is a CSV file
    encoding="UTF-8",  # UTF-8 encoding
)
print(dataset)
[('ENIAC coding system', 1943) ('ENIAC Short Code', 1946)
 ('Von Neumann and Goldstine graphing system', 1946)
 ('ARC Assembly', 1947) ('Plankalkül', 1948) ('CPC Coding scheme', 1948)
 ('Curry notation system', 1948) ('Short Code', 1949)
 ('assembly language', 1949) ('Short Code', 1950) ('G-code', 1950)
 ('Birkbeck Assembler', 1950) ('Superplan', 1951) ('ALGAE', 1951)
 ('Intermediate Programming Language', 1951)
 ('Regional Assembly Language', 1951)
 ('Boehm unnamed coding system', 1951) ('Klammerausdrücke', 1951)
 ('OMNIBAC Symbolic Assembler', 1951) ('Stanislaus', 1951)
 ('Whirlwind assembler', 1951) ('Rochester assembler', 1951)
 ('Sort Merge Generator', 1951) ('autocode', 1952) ('A-0 System', 1952)
 ('Editing Generator', 1952) ('COMPOOL', 1952) ('Speedcoding', 1953)
 ('READ/PRINT', 1953) ('Fortran', 1954) ('ARITH-MATIC', 1954)
 ('autocode', 1954) ('Laning and Zierler system', 1954)
 ('MATH-MATIC', 1954) ('MATRIX MATH', 1954) ('FLOW-MATIC', 1955)
 ('PACT', 1955) ('BACAIC', 1955) ('Freiburger Code', 1955)
 ('Sequentielle Formelübersetzung', 1955) ('Internal Translator', 1955)
 ('PRINT', 1955) ('Information Processing Language', 1956)
 ('FORTRAN for the IBM 704', 1956) ('Fortran', 1957) ('COMTRAN', 1957)
 ('GEORGE', 1957) ('UNICODE', 1957) ('ALGOL 58', 1958) ('Lisp', 1958)
 ('ALGOL', 1958) ('FORTRAN III', 1958) ('FORTRAN II', 1958)
 ('JOVIAL', 1959) ('TRAC', 1959) ('COBOL', 1959) ('MAD', 1959)
 ('Lisp', 1959) ('FACT', 1959) ('COBOL', 1960) ('ALCOR', 1960)
 ('ALGOL 60', 1960) ('FORTRAN IV', 1961) ('TECO', 1962) ('SNOBOL', 1962)
 ('APL', 1962) ('Simula', 1962) ('CORC', 1962) ('CPL', 1963)
 ('JOSS', 1963) ('P′′', 1964) ('Report Program Generator', 1964)
 ('TRAC', 1964) ('BASIC', 1964) ('COWSEL', 1964) ('MARK IV', 1964)
 ('MIMIC', 1964) ('PL/I', 1964) ('Speakeasy', 1964) ('IBM RPG II', 1965)
 ('Atlas Autocode', 1965) ('TELCOMP', 1965) ('Euler', 1966)
 ('ISWIM', 1966) ('ALGOL W', 1966)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1966)
 ('JOSS', 1966) ('Coral 66', 1966) ('BCPL', 1966) ('APL', 1966)
 ('FORTRAN 66', 1966)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1967)
 ('Hop', 1967) ('XPL', 1967) ('Interlisp', 1967) ('BCPL', 1967)
 ('Simula', 1967) ('Space Programming Language', 1967) ('PILOT', 1968)
 ('ALGOL 68', 1968)]
  • np.loadtxt: This function loads data from a text file, with each row in the text file being converted to a NumPy array element.

  • "../data/pl.csv": The path to the CSV file. You need to change this to the actual path where your CSV file is located.

  • dtype={"names": ("name", "year"), "formats": ("U100", "i4")}: Specifies the data type for the resulting NumPy array. Here, dtype is defined as a dictionary with:

    • names: A tuple specifying the names of the columns.

    • formats: A tuple specifying the data type of each column. "U100" stands for a Unicode string of maximum length 100, and "i4" stands for a 4-byte integer.

  • skiprows=1: This parameter skips the first row of the CSV file, which is assumed to be the header.

  • delimiter=",": Specifies the delimiter that separates the values in the CSV file. Here, it is a comma.

  • encoding="UTF-8": Specifies the encoding of the file. UTF-8 is a common encoding for text files.

Important Notes:#

  1. File Path: Ensure the file path ("../data/pl.csv") is correct relative to your Jupyter notebook or script. You might need to adjust it depending on where your CSV file is located.

  2. CSV File Structure: The CSV file should have a header row that matches the column names specified in the dtype parameter, and the data should follow in subsequent rows.

  3. Column Data Types: Ensure that the data types ("U100" for strings and "i4" for integers) match the actual data in your CSV file.

CSV support in numpy (Ref: (https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html)) is different from Python’s default CSV reader (Ref: (https://docs.python.org/3.9/library/csv.html)) because of its capability to support the data types (Ref: (https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)). Before continuing, take a deep look at numpy.loadtxt (Ref: (https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html)).

Copy the file pl.tsv to your current working directory and use the following code to parse the tsv file.

import numpy as np

dataset = np.loadtxt(
    "../data/pl.tsv",  # Change this value to the path of your CSV file
    dtype={"names": ("name", "year"), "formats": ("U100", "i4")},
    skiprows=1,
    delimiter="\t",
    encoding="UTF-8",
)
print(dataset)
[('ENIAC coding system', 1943) ('ENIAC Short Code', 1946)
 ('Von Neumann and Goldstine graphing system', 1946)
 ('ARC Assembly', 1947) ('Plankalkül', 1948) ('CPC Coding scheme', 1948)
 ('Curry notation system', 1948) ('Short Code', 1949)
 ('assembly language', 1949) ('Short Code', 1950)
 ('Birkbeck Assembler', 1950) ('G-code', 1950) ('Superplan', 1951)
 ('ALGAE', 1951) ('Intermediate Programming Language', 1951)
 ('Regional Assembly Language', 1951)
 ('Boehm unnamed coding system', 1951) ('Klammerausdrücke', 1951)
 ('OMNIBAC Symbolic Assembler', 1951) ('Stanislaus', 1951)
 ('Whirlwind assembler', 1951) ('Rochester assembler', 1951)
 ('Sort Merge Generator', 1951) ('Editing Generator', 1952)
 ('COMPOOL', 1952) ('autocode', 1952) ('A-0 System', 1952)
 ('READ/PRINT', 1953) ('Speedcoding', 1953) ('Fortran', 1954)
 ('MATRIX MATH', 1954) ('autocode', 1954) ('ARITH-MATIC', 1954)
 ('Laning and Zierler system', 1954) ('MATH-MATIC', 1954)
 ('FLOW-MATIC', 1955) ('BACAIC', 1955) ('Freiburger Code', 1955)
 ('Sequentielle Formelübersetzung', 1955) ('Internal Translator', 1955)
 ('PRINT', 1955) ('PACT', 1955) ('Information Processing Language', 1956)
 ('FORTRAN for the IBM 704', 1956) ('Fortran', 1957) ('UNICODE', 1957)
 ('COMTRAN', 1957) ('GEORGE', 1957) ('ALGOL 58', 1958) ('Lisp', 1958)
 ('FORTRAN III', 1958) ('FORTRAN II', 1958) ('ALGOL', 1958)
 ('JOVIAL', 1959) ('TRAC', 1959) ('COBOL', 1959) ('Lisp', 1959)
 ('MAD', 1959) ('FACT', 1959) ('COBOL', 1960) ('COBOL', 1960)
 ('ALGOL 60', 1960) ('ALCOR', 1960) ('FORTRAN IV', 1961) ('SNOBOL', 1962)
 ('TECO', 1962) ('Simula', 1962) ('APL', 1962) ('CORC', 1962)
 ('CPL', 1963) ('JOSS', 1963) ('P′′', 1964)
 ('Report Program Generator', 1964) ('TRAC', 1964) ('BASIC', 1964)
 ('COWSEL', 1964) ('PL/I', 1964) ('MARK IV', 1964) ('MIMIC', 1964)
 ('Speakeasy', 1964) ('IBM RPG II', 1965) ('Atlas Autocode', 1965)
 ('TELCOMP', 1965) ('Euler', 1966) ('ISWIM', 1966) ('ALGOL W', 1966)
 ('JOSS', 1966)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1966)
 ('BCPL', 1966) ('Coral 66', 1966) ('FORTRAN 66', 1966) ('APL', 1966)
 ('Hop', 1967)
 ('Massachusetts General Hospital Utility Multi-Programming System', 1967)
 ('Space Programming Language', 1967) ('BCPL', 1967) ('Interlisp', 1967)
 ('Simula', 1967) ('XPL', 1967) ('PILOT', 1968)]

Note the changes in the above code compared to the previous one. A TSV file is a tab-separated file, i.e., the column values are separated by a tab ((\t)).

  • np.loadtxt: This function loads data from a text file, with each row in the text file being converted to a NumPy array element.

  • "../data/pl.tsv": The path to the TSV file. You need to change this to the actual path where your TSV file is located.

  • dtype={"names": ("name", "year"), "formats": ("U100", "i4")}: Specifies the data type for the resulting NumPy array. Here, dtype is defined as a dictionary with:

    • names: A tuple specifying the names of the columns.

    • formats: A tuple specifying the data type of each column. "U100" stands for a Unicode string of maximum length 100, and "i4" stands for a 4-byte integer.

  • skiprows=1: This parameter skips the first row of the TSV file, which is assumed to be the header.

  • delimiter="\t": Specifies the delimiter that separates the values in the TSV file. Here, it is a tab character.

  • encoding="UTF-8": Specifies the encoding of the file. UTF-8 is a common encoding for text files.

For counting the number of rows in your dataset, you can use len()

# Print the number of elements in the dataset
print(len(dataset))
100

You can also display the output of variables and methods in a notebook without print()

len(dataset)
100

Exercise 3 ★★#

Most of the external data sources may provide their data in JSON format. Our next exercise is to parse JSON files. Copy the file pl.json to your current working directory and use the following code to parse the JSON file. In this exercise, we use Pandas python package (Ref: (https://pandas.pydata.org/pandas-docs/stable/)) to parse the JSON file to obtain a Pandas DataFrame (Ref: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)). Try using methods like transpose (Ref: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html#pandas.DataFrame.transpose)), count (Ref: (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html#pandas.DataFrame.count)) etc.

Before continuing this exercise, please practice working with Pandas. Take a look at 10 minutes to pandas (Ref: (https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)).

from pandas import json_normalize
import pandas as pd
import json

# Load JSON data from a file
data = json.load(open("../data/pl.json"))

# Convert JSON data to a pandas DataFrame
dataframe = json_normalize(data)

print(dataframe)
                                languageLabel  year
0                         ENIAC coding system  1943
1                            ENIAC Short Code  1946
2   Von Neumann and Goldstine graphing system  1946
3                                ARC Assembly  1947
4                                  Plankalkül  1948
..                                        ...   ...
95                                       BCPL  1967
96                                  Interlisp  1967
97                                     Simula  1967
98                                        XPL  1967
99                                      PILOT  1968

[100 rows x 2 columns]

This code is using Python’s pandas library to work with JSON data.

*. Imports:

  • from pandas import json_normalize: Imports the json_normalize function from the pandas library. This function is used to flatten JSON data into a DataFrame.

  • import pandas as pd: Imports the pandas library with the alias pd, which is a common convention.

  • import json: Imports Python’s built-in json module for handling JSON data.

*. Loading JSON Data:

  • data = json.load(open("../data/pl.json")): Opens a JSON file located at "../data/pl.json" and loads its contents into the data variable. The json.load() function reads the JSON file and parses it into a Python dictionary.

  • Normalizing JSON into a DataFrame:

    • dataframe = json_normalize(data): Uses the json_normalize() function from pandas to convert the nested JSON data (data) into a flat table-like structure called a DataFrame. This step is particularly useful for analyzing structured JSON data where nested objects or arrays need to be represented in tabular form.

And the output without print()

dataframe
languageLabel year
0 ENIAC coding system 1943
1 ENIAC Short Code 1946
2 Von Neumann and Goldstine graphing system 1946
3 ARC Assembly 1947
4 Plankalkül 1948
... ... ...
95 BCPL 1967
96 Interlisp 1967
97 Simula 1967
98 XPL 1967
99 PILOT 1968

100 rows × 2 columns

To display the values of the column year:

# Select the "year" column from the dataframe
dataframe["year"]
0     1943
1     1946
2     1946
3     1947
4     1948
      ... 
95    1967
96    1967
97    1967
98    1967
99    1968
Name: year, Length: 100, dtype: object

To display the values of the column languageLabel:

# Select the "languageLabel" column from the dataframe
dataframe["languageLabel"]
0                           ENIAC coding system
1                              ENIAC Short Code
2     Von Neumann and Goldstine graphing system
3                                  ARC Assembly
4                                    Plankalkül
                        ...                    
95                                         BCPL
96                                    Interlisp
97                                       Simula
98                                          XPL
99                                        PILOT
Name: languageLabel, Length: 100, dtype: object

Getting some important information like count, min, max using the method describe(). By default, it focuses on numerical columns (containing numbers) and calculates statistics like mean, standard deviation, quartiles (like 25th and 75th percentile), minimum and maximum values.

# Get some descriptive summary of the dataframe
dataframe.describe()
languageLabel year
count 100 100
unique 88 24
top COBOL 1951
freq 3 11

The dataframe.describe() method in pandas generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. Here’s what each part typically includes:

  1. Count: Number of non-null observations (rows) in each column.

  2. Mean: Average value of each numerical column.

  3. Std (Standard Deviation): Measures the dispersion or spread of the values in each numerical column.

  4. Min: The smallest value in each numerical column.

  5. 25% (Percentile): The value below which 25% of the observations fall (1st quartile).

  6. 50% (Median): The median or 50th percentile.

  7. 75% (Percentile): The value below which 75% of the observations fall (3rd quartile).

  8. Max: The largest value in each numerical column.

When you call describe() on a pandas DataFrame (dataframe in this case), it analyzes only the numerical columns by default and provides the statistical summary for each of these columns. Non-numerical columns are ignored unless specifically included by using additional parameters.

This method is useful for quickly getting an overview of your dataset’s numeric values, identifying outliers, understanding the distribution of your data, and more.

To display the unique values of the column year:

# Get the unique years from the "year" column
dataframe["year"].unique()
array(['1943', '1946', '1947', '1948', '1949', '1950', '1951', '1952',
       '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960',
       '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968'],
      dtype=object)

Note that the order of unique values returned by .unique() is not guaranteed.

To display the unique values of the column languageLabel:

# Get the unique languages from the "languageLabel" column
dataframe["languageLabel"].unique()
array(['ENIAC coding system', 'ENIAC Short Code',
       'Von Neumann and Goldstine graphing system', 'ARC Assembly',
       'Plankalkül', 'CPC Coding scheme', 'Curry notation system',
       'Short Code', 'assembly language', 'Birkbeck Assembler', 'G-code',
       'Superplan', 'ALGAE', 'Intermediate Programming Language',
       'Regional Assembly Language', 'Boehm unnamed coding system',
       'Klammerausdrücke', 'OMNIBAC Symbolic Assembler', 'Stanislaus',
       'Whirlwind assembler', 'Rochester assembler',
       'Sort Merge Generator', 'Editing Generator', 'COMPOOL', 'autocode',
       'A-0 System', 'READ/PRINT', 'Speedcoding', 'Fortran',
       'MATRIX MATH', 'ARITH-MATIC', 'Laning and Zierler system',
       'MATH-MATIC', 'FLOW-MATIC', 'BACAIC', 'Freiburger Code',
       'Sequentielle Formelübersetzung', 'Internal Translator', 'PRINT',
       'PACT', 'Information Processing Language',
       'FORTRAN for the IBM 704', 'UNICODE', 'COMTRAN', 'GEORGE',
       'ALGOL 58', 'Lisp', 'FORTRAN III', 'FORTRAN II', 'ALGOL', 'JOVIAL',
       'TRAC', 'COBOL', 'MAD', 'FACT', 'ALGOL 60', 'ALCOR', 'FORTRAN IV',
       'SNOBOL', 'TECO', 'Simula', 'APL', 'CORC', 'CPL', 'JOSS', 'P′′',
       'Report Program Generator', 'BASIC', 'COWSEL', 'PL/I', 'MARK IV',
       'MIMIC', 'Speakeasy', 'IBM RPG II', 'Atlas Autocode', 'TELCOMP',
       'Euler', 'ISWIM', 'ALGOL W',
       'Massachusetts General Hospital Utility Multi-Programming System',
       'BCPL', 'Coral 66', 'FORTRAN 66', 'Hop',
       'Space Programming Language', 'Interlisp', 'XPL', 'PILOT'],
      dtype=object)

To sort the values by one or more columns:

dataframe.sort_values(["year"])
languageLabel year
0 ENIAC coding system 1943
1 ENIAC Short Code 1946
2 Von Neumann and Goldstine graphing system 1946
3 ARC Assembly 1947
4 Plankalkül 1948
... ... ...
98 XPL 1967
93 Massachusetts General Hospital Utility Multi-P... 1967
92 Hop 1967
94 Space Programming Language 1967
99 PILOT 1968

100 rows × 2 columns

.sort_values(["year"]) sorts the DataFrame by the values in the column named “year”. You can specify one or more columns to sort by. By default, sorting is done in ascending order.

dataframe.sort_values(["year", "languageLabel"])
languageLabel year
0 ENIAC coding system 1943
1 ENIAC Short Code 1946
2 Von Neumann and Goldstine graphing system 1946
3 ARC Assembly 1947
5 CPC Coding scheme 1948
... ... ...
93 Massachusetts General Hospital Utility Multi-P... 1967
97 Simula 1967
94 Space Programming Language 1967
98 XPL 1967
99 PILOT 1968

100 rows × 2 columns

dataframe.sort_values(["languageLabel"])
languageLabel year
26 A-0 System 1952
62 ALCOR 1960
13 ALGAE 1951
52 ALGOL 1958
48 ALGOL 58 1958
... ... ...
20 Whirlwind assembler 1951
98 XPL 1967
8 assembly language 1949
31 autocode 1954
25 autocode 1952

100 rows × 2 columns

To obtain the data types of the columns:

# Get the data types of each column in the dataframe
dataframe.dtypes
languageLabel    object
year             object
dtype: object

The dataframe.dtypes attribute in pandas returns the data types of each column in the DataFrame dataframe. It provides information about how pandas has interpreted the data from the JSON (or any other source) based on its initial reading and normalization process. Here’s what you’ll typically see:

  • Numeric Types: Integers (int64), Floating-point numbers (float64).

  • Object Type: Typically strings (object), but can also include other Python objects.

  • DateTime Types: If applicable, pandas will recognize and label date/time data appropriately (datetime64).

  • Categorical Types: If pandas identifies a column with a small number of unique values, it may assign it a category dtype, which can optimize memory usage and speed up certain operations.

Exercise 4 ★★#

In this exercise, we will take a look at how to download data from external data sources using special query interfaces. Take for example, above data was obtained from Wikidata query:** (https://query.wikidata.org/)) interface. See the screenshot given below.

Given below is the code to read data from an external data source. Use this url: (https://query.wikidata.org/sparql?query=SELECT %3FlanguageLabel (YEAR(%3Finception) as %3Fyear)%0AWHERE%0A%7B%0A%20%20%23instances%20of%20programming%20language%0A%20%20%3Flanguage%20wdt%3AP31%20wd%3AQ9143%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP571%20%3Finception%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FlanguageLabel.%0A%20%20FILTER(lang(%3FlanguageLabel)%20%3D%20%22en%22)%0A%7D%0AORDER%20BY%20%3Fyear%0ALIMIT%20100&format=json).

import urllib.request
import json
import pandas as pd

url = "https://query.wikidata.org/sparql?query=SELECT%20%3FlanguageLabel%20(YEAR(%3Finception)%20as%20%3Fyear)%0AWHERE%0A%7B%0A%20%20%23instances%20of%20programming%20language%0A%20%20%3Flanguage%20wdt%3AP31%20wd%3AQ9143%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20wdt%3AP571%20%3Finception%3B%0A%20%20%20%20%20%20%20%20%20%20%20%20rdfs%3Alabel%20%3FlanguageLabel.%0A%20%20FILTER(lang(%3FlanguageLabel)%20%3D%20%22en%22)%0A%7D%0AORDER%20BY%20%3Fyear%0ALIMIT%20100&format=json"
response = urllib.request.urlopen(url)
responsedata = json.loads(response.read().decode("utf-8"))

array = []

for data in responsedata["results"]["bindings"]:
    array.append([data["year"]["value"], data["languageLabel"]["value"]])

dataframe = pd.DataFrame(array, columns=["year", "languageLabel"])
dataframe = dataframe.astype(dtype={"year": "<i4", "languageLabel": "<U200"})
print(dataframe)
    year                              languageLabel
0   1942                                 Plankalkül
1   1943                        ENIAC coding system
2   1946                           ENIAC Short Code
3   1946  Von Neumann and Goldstine graphing system
4   1948                          CPC Coding scheme
..   ...                                        ...
95  1967                                      ABSYS
96  1967                 Space Programming Language
97  1968                                      Refal
98  1968                                      PILOT
99  1968                                     MAPPER

[100 rows x 2 columns]
dataframe
year languageLabel
0 1942 Plankalkül
1 1943 ENIAC coding system
2 1946 ENIAC Short Code
3 1946 Von Neumann and Goldstine graphing system
4 1948 CPC Coding scheme
... ... ...
95 1967 ABSYS
96 1967 Space Programming Language
97 1968 Refal
98 1968 PILOT
99 1968 MAPPER

100 rows × 2 columns

# Get some descriptive summary of the dataframe
dataframe.describe()
year
count 100.000000
mean 1958.300000
std 6.500194
min 1942.000000
25% 1952.750000
50% 1959.000000
75% 1964.000000
max 1968.000000
# Get some descriptive summary of the year column
dataframe["year"].describe()
count     100.000000
mean     1958.300000
std         6.500194
min      1942.000000
25%      1952.750000
50%      1959.000000
75%      1964.000000
max      1968.000000
Name: year, dtype: float64
# Get some descriptive summary of the languageLabel column
dataframe["languageLabel"].describe()
count          100
unique          99
top       autocode
freq             2
Name: languageLabel, dtype: object
# Get the data types of each column in the dataframe
dataframe.dtypes
year              int32
languageLabel    object
dtype: object

Exercise 5 ★★★#

This exercise will use some basic data analyses. Continuing with the code in Exercise 4, le’s count the number of programming languages released in a year.

grouped = dataframe.groupby("year").count()
grouped
languageLabel
year
1942 1
1943 1
1946 2
1948 2
1949 1
1950 3
1951 11
1952 4
1953 2
1954 4
1955 8
1956 1
1957 6
1958 2
1959 4
1960 7
1961 1
1962 5
1963 3
1964 11
1965 5
1966 7
1967 6
1968 3

The above code creates a new DataFrame grouped by grouping the original DataFrame dataframe based on the values in the column named “year”, and then counting the occurrences in each group.

You can also use multiple aggregate functions using agg()

grouped = dataframe.groupby("year").agg(["count"])
grouped
languageLabel
count
year
1942 1
1943 1
1946 2
1948 2
1949 1
1950 3
1951 11
1952 4
1953 2
1954 4
1955 8
1956 1
1957 6
1958 2
1959 4
1960 7
1961 1
1962 5
1963 3
1964 11
1965 5
1966 7
1967 6
1968 3

Till now, we worked with tables having two columns. Now we focus on tables with three columns (programming language, year, paradigm). Copy the file plparadigm.json to your working directory. And test the following program.

from pandas import json_normalize
import pandas as pd
import json

# Load JSON data from a file
jsondata = json.load(open("../data/plparadigm.json"))

# Create an empty list to store temporary data
array = []

# Loop through each item in the JSON data
for data in jsondata:
    # Extract specific values from each item and append them to the list
    # Assuming "year", "languageLabel", and "paradigmLabel" are keys in the JSON data
    array.append([data["year"], data["languageLabel"], data["paradigmLabel"]])

# Create a pandas DataFrame from the list with named columns
dataframe = pd.DataFrame(array, columns=["year", "languageLabel", "paradigmLabel"])

# Set data types for the columns in the DataFrame
# This ensures efficient storage and avoids potential errors during operations
dataframe = dataframe.astype(
    dtype={"year": "int64", "languageLabel": "<U200", "paradigmLabel": "<U200"}
)

# Group the DataFrame by "year" and "paradigmLabel" columns
grouped = dataframe.groupby(["year", "paradigmLabel"]).agg(["count"])
grouped
languageLabel
count
year paradigmLabel
1948 procedural programming 1
1949 non-structured programming 1
procedural programming 1
1950 procedural programming 1
1953 procedural programming 1
... ... ...
1970 imperative programming 1
procedural programming 1
structured programming 1
1971 functional programming 1
knowledge representation and reasoning 1

73 rows × 1 columns

Now test the following program. Compare the difference in output.

grouped = dataframe.groupby(["paradigmLabel", "year"]).agg(["count"])
grouped
languageLabel
count
paradigmLabel year
array programming 1954 1
1957 1
1962 1
1966 1
assembly language 1956 1
... ... ...
structured programming 1957 1
1958 1
1962 1
1966 1
1970 1

73 rows × 1 columns

Your next goal is to run the following query to get the population information of different countries (limited to 10000 rows). Run the following query on Wikidata query service and download the JSON file.

SELECT DISTINCT ?countryLabel (YEAR(?date) as ?year) ?population
WHERE {
 ?country wdt:P31 wd:Q6256; #Country 
   p:P1082 ?populationStatement;
  rdfs:label ?countryLabel. #Label
 ?populationStatement ps:P1082 ?population; #population
  pq:P585 ?date. #period in time
 FILTER(lang(?countryLabel)="en") #Label in English
}
ORDER by ?countryLabel ?year
LIMIT 10000

Now, compute and display the following information (using various operations available in pandas library (Ref: (https://pandas.pydata.org/pandas-docs/stable/10min.html))):

  1. The population of countries in alphabetical order of their names and ascending order of year.

  2. The latest available population of every country

  3. The country with the lowest and highest population (considering the latest population)

Your next goal is to run the following query to get information related to scientific articles published after 2010 (limited to 10000 rows). Run the following query on Wikidata query service and download the JSON file. It gives you the following information related to the scientific article: title, main subject and publication year.

SELECT ?title ?subjectLabel ?year
{
  ?article wdt:P31 wd:Q13442814; #scientific article
           wdt:P1476 ?title; #title of the article
           wdt:P921 ?subject; #main subject
           wdt:P577 ?date. #publication date
  ?subject rdfs:label ?subjectLabel.
  BIND(YEAR(?date) as ?year).
  #published after 2010
  FILTER(lang(?title)="en" &&
     lang(?subjectLabel)="en" && ?year>2010)
}
LIMIT 5000

Now, compute and display the following information (using various operations available in pandas library (Ref: (https://pandas.pydata.org/pandas-docs/stable/10min.html))):

  1. The number of articles published on different subjects every year.

  2. Top subject of interest to the scientific community every year(based on the above query results).

  3. Top 10 subjects of interest to the scientific community (based on the above query results) since 2010.

Hint:Take a look at functions groupby, reset_index, head, tail, sort_values, count of Pandas

Note: If you get time-out errors, please change the LIMIT to some lower values (1000, 2000, 5000).

Exercise 6 ★★★#

In our final exercise, we will query Wikidata and obtain the URLs of the images. Then we will download these images.

import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

# Get cities
query = """SELECT DISTINCT ?bigcity ?bigcityLabel ?country ?countryLabel ?image {
  ?bigcity wdt:P31 wd:Q1549591;
               wdt:P17 ?country;
               wdt:P18 ?image.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 10"""


def get_results(endpoint_url, query):
    user_agent = "WDQS-example Python/%s.%s" % (
        sys.version_info[0],
        sys.version_info[1],
    )
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()


array = []
results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]:
    array.append(
        (
            result["bigcityLabel"]["value"],
            result["countryLabel"]["value"],
            result["image"]["value"],
        )
    )
dataframe = pd.DataFrame(array, columns=["city", "country", "image"])
dataframe = dataframe.astype(
    dtype={"city": "<U200", "country": "<U200", "image": "<U200"}
)
dataframe
city country image
0 Solingen Germany http://commons.wikimedia.org/wiki/Special:File...
1 Paderborn Germany http://commons.wikimedia.org/wiki/Special:File...
2 Cottbus Germany http://commons.wikimedia.org/wiki/Special:File...
3 Bissau Guinea-Bissau http://commons.wikimedia.org/wiki/Special:File...
4 Ouagadougou Burkina Faso http://commons.wikimedia.org/wiki/Special:File...
5 Belgrade Serbia http://commons.wikimedia.org/wiki/Special:File...
6 Neuss Germany http://commons.wikimedia.org/wiki/Special:File...
7 Moers Germany http://commons.wikimedia.org/wiki/Special:File...
8 Remscheid Germany http://commons.wikimedia.org/wiki/Special:File...
9 Muscat Oman http://commons.wikimedia.org/wiki/Special:File...

We will now download these images

import requests
import shutil
import os


def download_image(url):

    headers = {"User-Agent": "Mozilla/5.0"}
    request = requests.get(url, allow_redirects=True, headers=headers, stream=True)
    if request.status_code == 200:
        with open(os.path.basename(url), "wb") as image:
            request.raw.decode_content = True
            shutil.copyfileobj(request.raw, image)
    return request.status_code
dataframe.image.apply(download_image)
0    200
1    200
2    200
3    200
4    200
5    200
6    200
7    200
8    200
9    200
Name: image, dtype: int64

Modify the above code and download images related to any subject of your choice (cities, historical monuments, flowers, buildings, etc.)