Creative Commons License

Questions: Second session

Question 1.a

What are the different ways to acquire data for the purpose of data analysis? (1 point)

Question 1.b

Online survey is one way to obtain feedback for projects and products. However, we still see persons asking us questions in commercial malls and sometimes doing door to door surveys. Why do you think that manual or face to face surveys are still important? (1 point)

Question 2.a

What are ACID constraints? Which of these constraints were relaxed by NoSQL data stores and why? (1 point)

Question 2.b

Before downloading and using data from a website, what are your considerations? What was your approach towards this direction concerning your project? (1 point)

Question 3

You have been asked to build a recommendation system of images for your project. Give an overview of your system, detailing the various steps, algorithms and the architecture. Compare your work with the lifecycle of data. What are the steps that you used and what did you miss? (1 point)

Question 4

Data cleaning is a major step before doing data analysis. Why? What are the different types of errors in the data? How do you deal with them? (1 point)

Question 5.a

What are the differences between classification and clustering algorithms? (1 point)

Question 5.b

How do you evaluate and compare the efficiency of a classifiers? (1 point)

Question 5.c

Consider a CSV file containing the following columns: Country, City, Year, and Population, i.e., it contains the information of population of a city (of a country) as recorded every year from 1900. Your goal is to write a Python program using pandas that can read this CSV file and perform the following: (1 point)

  1. Find the city with the minimum population in the year 2010
  2. For every country, compute the average population of the cities in the year 2010

(1.5 points)

Question 6

Consider a CSV file containing the following columns: PhotographId, City, Year, and ViewCount. It contains the detailed information about photographs on a photography website: PhotographId: unique identifier of an image, City: the city where the photograph was taken, Year: the year in which the photograph was taken and ViewCount: the number of times, the photograph was viewed on this website. Your goal is to write a Python program (preferably using pandas library) that can read this CSV file and perform the following:

  1. Find the most viewed and least viewed photograph
  2. Find the city with the maximum and least number of photographs
  3. Find the year with the highest number of photographs
  4. For every city, calculate the average number of views for photographs in the year 2018

(2 points)

Question 7.a

What is an artificial neural network? (1 point)

Question 7.b

Why do you think reinforcement learning is relevant for internal and outdoor navigation by robots? (1 point)

Question 8

An annotation website asked 10 users to describe a picture using 5 hashtags.Given below is a table detailing user’s use of hashtags for describing this 1 picture. The table consist of 5 columns and 10 rows. Each row correspond to one user. Each column corresponds to one hashtag and the column values consists of 0 and 1. If a value is 0, the user did not use the hashtag and if the value is 1, the user used the hashtag. Find all possible association rules from this table. What do you conclude about this picture? (1.5 points)

User #Architecture #Nature #Paris #StreetArt #Fractals
U110010
U211111
U310010
U411111
U501001
U610110
U700000
U800000
U901111
U1010010