Practical Work 1#
Academic year: 2024-2025#
Goals#
Reinforce concepts using tools like NumPy and scikit-learn.
Apply propositional logic and test it on image datasets.
Explore predicate logic and validate it on image datasets.
Understand text analysis techniques, including stemming, lemmatization, and morphological analysis.
Exercise 1.0 [★]#
Test the Python Jupyter notebook recalls and familiarize yourself with the different methods of the libraries: numpy, scikit-learn etc.
Exercise 1.1 [★]#
The first exercise involves testing the propositional logic. The CSV file (image_data.csv
) contains attributes like color, shape, size, texture, and classification after image analysis process. Define the propositions given below and test logical expressions based on these attributes to evaluate relationships within the data.
Step 1: Load and Inspect the CSV File#
Read the CSV file into a pandas DataFrame.
Print the first few rows of the DataFrame to understand the structure.
import pandas as pd
# Load the CSV file
df = pd.read_csv('../../data/image_data.csv')
# Inspect the data
print(df.head())
Step 2: Define Propositions Based on the Data#
Create boolean propositions based on the columns in the CSV file.
P
: Checks if the color is blue.Q
: Checks if the shape is a circle.R
: Checks if the classification is animal.S
: Checks if the size is large (define a threshold for ‘large’).T
: Checks if the texture is rough.U
: Checks if the classification is vehicle.V
: Checks if the classification is building.
Step 3: Define the Logical Expressions#
Create logical expressions based on the propositions defined in Step 2.
expr1
: If color is blue and shape is a circle, then the classification is animal.expr2
: If size is large and texture is rough, then the classification is vehicle.expr3
: If classification is building, shape is circle, color is blue, and size is large, then classification is building.
Step 4: Test the Expressions#
Test the logical expressions on each row of the DataFrame.
Output the results of the expressions for each row.
Step 5: Add a Compound Proposition with Negation and Disjunction#
Create a new complex expression that tests the following:
If the object is not blue or has a smooth texture, then it is classified as not an object.
Step 6: Count Satisfying Rows for Each Expression#
Count the number of rows where each logical expression is True
and compare the frequencies of satisfied propositions.
Exercise 1.2 [★]#
Step 1: Install and Import Z3#
Install the Z3 solver library (
z3-solver
).Import Z3 and familiarize yourself with its basic functions.
!pip install z3-solver
from z3 import *
Step 2: Define the Attributes as First-Order Logic Variablesm#
Define variables for each column (e.g.,
color
,shape
,size
).Specify the possible values (e.g.,
color
can be blue, red, green, etc.).Define constraints for attributes such as size being an integer and other attributes being strings.
# Declare Z3 variables
Color = String('color')
Shape = String('shape')
Size = Int('size')
Texture = String('texture')
Classification = String('classification')
Step 3: Encode Logical Propositions in FOL#
Encode the provided propositions using Z3’s logic (See exercise 1.1).
Example: If the object is blue and circular, then it is classified as an animal (
Implies(And(Color == "blue", Shape == "circle"), Classification == "animal")
).
# Define constraints
valid_colors = Or(Color == "blue", Color == "red", Color == "green", Color == "yellow", Color == "purple")
valid_shapes = Or(Shape == "circle", Shape == "square", Shape == "triangle", Shape == "rectangle", Shape == "ellipse")
valid_size = Size >= 100 # Size constraint
valid_textures = Or(Texture == "polka dot", Texture == "smooth", Texture == "patterned", Texture == "rough")
valid_classifications = Or(Classification == "animal", Classification == "plant", Classification == "object", Classification == "vehicle", Classification == "building")
# Add these constraints to the solver
solver = Solver()
solver.add(valid_colors, valid_shapes, valid_size, valid_textures, valid_classifications)
Step 4: Solve for Satisfiability#
Use the Z3 solver to check whether the propositions are satisfiable.
Print the results.
# Check if the solver finds a solution that satisfies the constraints
if solver.check() == sat:
print("The propositions are satisfiable.")
model = solver.model()
print(model)
else:
print("The propositions are not satisfiable.")
Step 5: Add Additional Constraints#
Add a constraint that restricts certain combinations, such as “if the object is green, it cannot be circular.”
Add another constraint where “polka dot objects cannot be vehicles.”
Exercise 1.3 [★★]#
Read the CSV file image_data.csv
and define attributes as Z3 variables based on the file’s data. Encode logical propositions and constraints using first-order logic for each row, then solve for satisfiability.
import pandas as pd
from z3 import *
# Load CSV data
df = pd.read_csv('../../data/image_data.csv')
# Initialize Z3 solver
solver = Solver()
Step 2: Define Z3 Variables Dynamically from CSV Data#
For each row in the CSV, define the attributes as Z3 variables and ensure the types are consistent.
# Define Z3 variables for each attribute dynamically for each row
for index, row in df.iterrows():
color = String(f'color_{index}')
shape = String(f'shape_{index}')
size = Int(f'size_{index}')
texture = String(f'texture_{index}')
classification = String(f'classification_{index}')
# Add constraints for valid values
solver.add(Or(color == row['color'], shape == row['shape'], size == row['size'], texture == row['texture'], classification == row['classification']))
Step 3: Encode Propositions in FOL#
Write logical propositions for each row, like “if an object is blue and circular, then it is classified as an animal.”
Use exercise 1.1 and add additional propositions
# Example FOL for each row
for index, row in df.iterrows():
expr1 = Implies(And(String(f'color_{index}') == "blue", String(f'shape_{index}') == "circle"), String(f'classification_{index}') == "animal")
solver.add(expr1)
Step 4: Solve for Satisfiability#
Check whether the logical propositions for the CSV data are satisfiable.
Step 5: Add Additional Constraints#
Include new constraints (e.g., “green objects cannot be circular”) and check the result again.
no_green_circle = Implies(String(f'color_{index}') == "green", String(f'shape_{index}') != "circle")
solver.add(no_green_circle)
Step 6: Introduce an Unsatisfiable Constraint#
Add a conflicting constraint that forces an object to have two mutually exclusive attributes (e.g., being both blue and red).
Check for satisfiability and show that the model becomes unsatisfiable.
# Add a conflicting constraint: Object must be both blue and red (which is impossible)
for index in range(len(df)):
conflicting_constraint = And(String(f'color_{index}') == "blue", String(f'color_{index}') == "red")
solver.add(conflicting_constraint)
# Check satisfiability after adding the conflicting constraint
if solver.check() == sat:
print("The propositions are still satisfiable.")
else:
print("The model is now unsatisfiable due to conflicting constraints.")
Exercise 1.4 [★★]#
Download this webpage of Wikipedia: https://fr.wikipedia.org/wiki/Paris and save the file as an HTML. Analyze the Wikipedia page by extracting and counting words, links, images, numbers, dates, proper nouns, and structured data from tables, while differentiating between sections and paragraphs. This involves downloading the HTML, parsing it, and systematically identifying relevant content. Write a program to implement these tasks:
Download HTML: Fetch and save the Wikipedia page as an HTML file.
Load Content: Read and parse the HTML file for analysis.
Word Analysis: Count word occurrences in the text.
Extract Links: Identify and categorize internal and external links.
Image Extraction: Locate images and gather their URLs and sizes.
Number and Date Extraction: Identify numbers, dates, and geographical coordinates.
Proper Nouns: Extract names of people and places.
Table Data: Locate and extract data from tables.
Section Differentiation: Identify sections and paragraphs in the content.
Analysis of Wikipedia Page: Paris#
In this notebook, tasks will be performed to extract and analyze various elements from the Wikipedia page of Paris.
Step 1: Download the HTML Page#
First, download the HTML content of the specified Wikipedia page and save it as an HTML file. We use the requests
library to handle the HTTP request. Remember to check the response status to confirm that the page was downloaded successfully.
import requests
# URL of the Wikipedia page
url = "https://fr.wikipedia.org/wiki/Paris"
# Send a GET request to the URL
response = requests.get(url)
# Save the content as an HTML file
with open("paris.html", "w", encoding='utf-8') as file:
file.write(response.text)
print("HTML page downloaded and saved as paris.html")
Step 2: Load the HTML Content#
Load the downloaded HTML file for further analysis.
Comment: Parsing the HTML is crucial for extracting data. Make sure to use a library like BeautifulSoup that can navigate the HTML structure effectively.
Familiarize yourself with the BeautifulSoup
methods to find elements in the HTML, such as find()
and find_all()
.
from bs4 import BeautifulSoup
# Load the HTML file
with open("paris.html", "r", encoding='utf-8') as file:
html_content = file.read()
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
print("HTML content loaded.")
Step 3: Extract and Analyze Words#
Count the occurrences of each word in the page.
Comment: Consider normalizing the text by converting it to lowercase to avoid counting the same word in different cases separately. We use regular expressions to effectively filter out non-word characters when splitting the text into words.
from collections import Counter
import re
# Extract text from the HTML content
text = soup.get_text()
# Clean and split text into words
words = re.findall(r'\w+', text.lower())
word_count = Counter(words)
# Display the 10 most common words
print(word_count.most_common(10))
Step 4: Extract Links#
Identify all internal and external links from the page.
Comment: Understanding the difference between internal and external links is important for categorization.
Hint: Check the
href
attribute of the anchor (<a>
) tags to determine the type of link.
Step 5: Extract Images and Their Sizes#
Identify all images on the page and get their sizes.
Comment: Be aware that images may not always be stored in the same format. Ensure you construct the correct URLs for them.
Hint: You may need to check the attributes of the
<img>
tags to get additional information, such as the size of the images if available.
Step 6: Extract Numbers, Dates, and Geographical Coordinates#
Identify numbers, dates, and geographical coordinates from the text.
Comment: Different formats for dates and numbers can complicate extraction. Consider the various ways these can appear on the page.
Hint: Use regular expressions tailored for specific patterns (e.g., date formats or geographic coordinates) to accurately identify them.
Step 7: Identify Proper Nouns#
Extract proper nouns from the text.
Comment: Proper nouns can include names of people, places, and organizations. Identifying them correctly can enhance your data analysis.
Hint: Use Natural Language Processing (NLP) techniques, such as named entity recognition, to automate the identification of proper nouns.
Step 8: Extract Structured Data (Tables)#
Identify and extract data from tables present in the HTML.
Comment: Tables often contain organized data that can be useful for analysis. Make sure to capture both header and data cells.
Hint: Familiarize yourself with the structure of HTML tables, including how to navigate rows (
<tr>
) and cells (<td>
and<th>
).
Step 9: Differentiate Sections and Paragraphs#
Identify and separate sections and paragraphs in the content.
Comment: Sections help in understanding the organization of the content. Recognizing different heading levels can aid in content navigation.
Hint: Use appropriate tags (
<h1>
,<h2>
, etc.) to differentiate between sections and ensure you capture their associated content, like paragraphs.
Exercise 1.5 [★★★]#
Analyze the text from the downloaded Wikipedia page by applying stemming, n-gram extraction, PoS tagging, lemmatization, morphological analysis, named entity recognition, and word embedding using Word2Vec models. Compare the results from NLTK, spaCy, and Gensim to evaluate their effectiveness in text analysis tasks.
Prerequisites#
Make sure you have the required libraries installed. You can install them using pip if you haven’t already:
!pip install nltk spacy gensim wordcloud seaborn
! python -m spacy download fr_core_news_sm # For French language processing
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')
Step 1: Load the Wikipedia Page#
Start by loading the HTML file you saved earlier and extracting the text.
from bs4 import BeautifulSoup
# Load the HTML file
with open("paris.html", "r", encoding='utf-8') as file:
html_content = file.read()
# Parse the HTML content
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()
Step 2: Apply Stemming Algorithms#
Use the Porter and Snowball stemmers from NLTK to stem the words from the text.
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
from collections import Counter
import re
# Tokenize and clean the text
words = re.findall(r'\w+', text.lower())
# Initialize stemmers
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
# Apply stemming
porter_stems = [porter_stemmer.stem(word) for word in words]
snowball_stems = [snowball_stemmer.stem(word) for word in words]
# Count unique stems
porter_stem_count = Counter(porter_stems)
snowball_stem_count = Counter(snowball_stems)
# Display the most common stems and count of unique stems
print("Most common Porter stems:", porter_stem_count.most_common(10))
print("Unique Porter stems count:", len(porter_stem_count))
print("Most common Snowball stems:", snowball_stem_count.most_common(10))
print("Unique Snowball stems count:", len(snowball_stem_count))
Step 3: Extract N-grams#
Generate and display the most common n-grams (1-grams to 5-grams) from the text.
Step 4: Part-of-Speech (PoS) Tagging#
Use NLTK or spaCy to perform PoS tagging on the text.
Step 5: Lemmatization#
Apply lemmatization using NLTK or spaCy.
Step 6: Morphological Analysis#
Use spaCy to perform morphological analysis on the text.
Step 7: Named Entity Recognition (NER)#
Use spaCy to identify named entities in the text.
Step 8: Frequency Distribution of Words#
Visualize the frequencydistribution of words using Matplotlib.
Step 9: Create a Word Cloud#
Generate a word cloud to visualize the most frequent words.
Step 10: Visualization of Named Entities#
Visualize the named entities recognized in the text using Matplotlib.
Step 11: Visualization of Most Common Nouns#
Visualize the most common nouns in the text, which can provide insights into the main subjects discussed.