Analysis of Bookmarks and Revisiting the Forgotten Topics: John Samuel

Our thoughts, as well as interests evolve and our reading sometimes reflects this evolution. A major part of our reading has now moved to the internet and the way we use to read the news on the internet has changed dramatically over the past few years. From reading news on certain news outlets to reading the news headlines on social media, we can get a glimpse of this rapid change of the reading habits of users. However, we still use some sort of bookmarking¹ for saving certain articles for future reads and references. These bookmarks³, whether on browsers or social media websites contain valuable information, which can be used to improve our future reading as well as remind us of some forgotten topics of past interests. A reading profile can be built by the analysis of the articles read in the past and the associated tags used by the readers. This article explores some of the key information that can be obtained from bookmarked and tagged articles, especially for revisiting some forgotten topics.

Analysis of Bookmarks

1. Identifying the Data Sources

There are primarily two data sources for building a user reading profile: bookmarks from internet browsers and bookmarks from social media. Blogs, news websites, and news aggregators⁶ were frequently used by the readers in the past for reading. This however limited the users to a limited number of websites. However, bookmarking web services and social media have exposed them to a large number of websites, thanks to the possibility of sharing. Social media help users discover relevant media from the users interested in similar topics. Users use their browsers to navigate these articles and even bookmark them. However, some social media companies display a snippet of the article (including an image and a small text) which helps the readers get a glimpse of the article content without clicking it. Many social media websites have different ways to let users bookmark such information. Some of the following may not seem completely close to the explicit option for bookmarks, since bookmarks are usually considered to be something personal. But the following information may be public or available to a group of users (depending on the settings).

Likes and Favorites
Retweets and reposts

Internet browsers usually use HTML, XML, or JSON data format to store the bookmarks. But the data format is not standardized across browsers. Browsers let you export or import bookmarks in HTML format. In some cases like Firefox, user tags are not present in the exported HTML file. One option here, in the case of Firefox, is to make use of bookmark backups that provide a lot of information usually absent in the exported files like the logo URL of the bookmarked sites, time of bookmarking, user-generated tags. Social media websites provide application programming interfaces (or API) so that the developers can access some of the above information.

2. Data Analysis

So the next obvious question that arises: what are the data available for performing analysis⁷ and what do we want to achieve from this analysis? Following is a list of data available from the exported bookmark HTML and bookmark backup JSON files from browsers:

Source URL
- Article title
- Article content (usually for feeds, after fetching the content)
- Website URL
- Website title
- Website description (often available)
Timestamp
- Bookmarked time
- Time of last modification
User-generated content
- Categories/Folders⁴
- Tags or hashtags⁵

In addition to the above, the social media companies provide additional information based on their own analysis. The following information is also available to the readers:

Number of users who bookmarked the article
Number of times an article was shared
The users who shared the article
...

This is a treasure-trove of information. Take for example, given the availability of time, one interesting aspect is to know what theme of topics interested me during specific periods of time and when did a particular topic catch my first attention. Tags and categories (or folders) may also give an idea of the information classification style, in this case, how the users categorize different articles under a category or a sub-category. Internet browsers usually allow only one category (or subcategory) for an article. Tags may play an important role in the classification since an article can have more than one tag.

Now that we have seen the data to analyze, we can take a look at how to infer interesting insights with these data. Some non-exhaustive questions are given below:

Bookmarks
- Total number of bookmarks
- The average number of bookmarks in a day/week/month/year
- Maximum number of bookmarks made on a day/week/month/year
- Minimum number of bookmarks made on a day/week/month/year
Websites
- Total number of unique websites
- The average number of bookmarks per website
- The website with the maximum number of bookmarks or commonly referred website(s)
- The website with the least number of bookmarks or the least referred website(s)
Tags/Categories
- Total number of unique tags/categories
- Tags/categories with the maximum number of bookmarks or the most common tags
- Tags/categories with the minimum number of bookmarks or the least used tags
- Tags/categories with the maximum number of websites or the most common tags
- Tags/categories with the minimum number of websites or the most common tags
Analytics Information
- Tags of interest
- Users interested in similar topics

All the above questions are based on the aggregation functions like count, minimum, maximum, average. These are commonly used tasks in data analysis in almost every domain. This information is sufficient enough to build a basic reading profile. But care must be taken to ensure that it doesn't lead to building a profile that ignores the least commonly used bookmarks or websites. In other words, data analysis must also help in filtering out the forgotten topics or bookmarks and websites. To revisit the forgotten topics, one needs to focus on the least frequented websites, the least used tags, the least used categories, etc. Additionally, the focus must be on the past bookmarks and not on the recent ones.

Unfortunately, most of our current recommendation systems^8,9 focus on aspects like recency. Hence recent posts and articles have a prominent place on the search results. Certain past topics of interest never appear in the results.

3. Data Visualization

What better way to analyze than to visualize! Visualization techniques help not only to find interesting patterns but also to pop out striking features. As time passes, bookmark related information accumulates. When the number of data increases, it's usually not feasible to analyze them with tables. More data need to be analyzed than those that can be analyzed at a single glance. There are a large number of visualization libraries available that can help to obtain a word cloud² corresponding to bookmarked articles, particularly to understand the main topics of interest. Then some can be used to visualize the pattern of website visits on a timeline. This idea can be extended to understand the (recurrent) pattern of bookmarks concerning a particular topic. The interesting results thus obtained can be fed back to the system in the form of new user-generated tags.

Building a Reading Profile

Such a reading profile built after analysis of past bookmarks can be used to recommend articles. But a mere recommendation isn't enough. The justification for a recommendation must also be presented. This contextual information can be verified and any corrections required can be fed back to the system for improving the recommendation mechanism. Most of the current recommendation systems are opaque about their functioning. However, content-based filtering systems focus on individual users and not on the results related to multiple users. This is extremely relevant in the case of bookmarks analysis, especially when the user is interested in quantifying themself.

Another important aspect of reading is the surprise factor. Users must get a certain amount of articles that are surprising that may help them to exit their filter bubble. And as discussed above, the past topics of interest and the least visited sites may give certain insights.

Conclusion

The overall goal of bookmark analysis, especially for individual users may not be to understand reading behavior but to save time skipping through irrelevant or redundant articles. Yet some users may wish to understand the time taken by them on the internet, their reading habits, the topics of interest, the number of bookmarks they had made during a period, etc. Others may use it to understand and revisit the forgotten topics. By clustering the articles based on theme and grouping similar articles, relevant and non-redundant articles can be shown to the users. But the aspect of surprise (e.g., new technological advancement) should not be missing from the overall scene of filtering. Information overload is a major challenge, but there are several ways to reduce it, and just focusing on recency should not be one of them.

References

Bookmark
Tag (Word) Cloud
Bookmark (World Wide Web)
Categorization
Tag (metadata)
News Aggregator
Data Analysis
Pazzani, Michael J., and Daniel Billsus. “Content-Based Recommendation Systems.” The Adaptive Web: Methods and Strategies of Web Personalization, edited by Peter Brusilovsky et al., Springer, 2007, pp. 325–41.
Ricci, Francesco, et al. “Introduction to Recommender Systems Handbook.” Recommender Systems Handbook, edited by Francesco Ricci et al., Springer US, 2011, pp. 1–35.