Building a Personalized Feed Aggregator: John Samuel

Feed aggregators are helpful in getting the latest news from various blogs and news websites. But the consequence of subscribing to a large number of feeds is waste of time, especially when it requires skimming through a multitude of irrelevant and redundant articles. Many of the current desktop feed aggregators don't have options to filter out the articles based on the past reading profile and to group similar articles from multiple sites. A reading profile can be built by analysis of the articles read in the past and their associated tags (e.g., important, read-later). Grouping of articles, however, can be done considering their titles, their website source etc. So what I am looking for is a feed aggregator that can help me to analyse bookmarked and tagged articles to automatically filter out relevant articles and group together similar articles from subscribed feeds.

Analysis of Bookmarks and Feeds

1. Identifying the Data Sources

There are primarily two data sources for building a user reading profile: bookmarks from internet browsers and feed aggregators. Internet browsers and feed aggregators usually use HTML, XML or JSON data format to store bookmarks and feeds. But the data format is not standardized across browsers and feed aggregators. Feed aggregators usually support importing or exporting files of subscriptions in OPML (Outline Processor Markup Language)¹ format. Browsers let you export or import bookmarks in HTML format. In some cases like Firefox, user tags are not present in the exported HTML file. One option here, in case of Firefox is to make use of bookmark backups that provide a lot of information usually absent in the exported files like the logo URL of the bookmarked sites, time of bookmarking, user-generated tags.

2. Data Analysis

So the next obvious question that arises: what are the data available for performing analysis and what do we want to achieve from this analyis? Following is a list of data available from the exported bookmark HTML, bookmark backup JSON files and subscribed feeds OPML file:

Source URL
- Article title
- Article content (usually for feeds, after fetching the content)
- Website URL
- Website title
- Website description (often available)
Timestamp
- Bookmarked time
- Time of last modification
User-generated content
- Categories/Folders
- Tags

This is a treasure-trove of information. Take for example, given the availability of time, one interesting aspect is to know what theme of topics interested me during specific periods of time and when did a particular topic caught first attention. Tags and categories (or folders) may also give an idea of the way of information classification style. Seemingly we are aware of the data to analyse but not what all to infer with these data. Yet some non-exhaustive questions are given below:

Bookmarks
- Total Number of Bookmarks
- Average Number of Bookmarks in a Day
- Maximum Number of Bookmarks made on a Day
- Minimum Number of Bookmarks made on a Day
Websites
- Total Number of Unique websites
- Average Number of Bookmarks per website
- Website with Maximum Number of Bookmarks or Commonly referred website(s)
- Website with Least Number of Bookmarks or the Least referred website(s)
Tags/Categories
- Total Number of Unique Tags/Categories
- Tags/Categories with Maximum number of Bookmarks or the Most Common Tags
- Tags/Categories with Minimum number of Bookmarks or the Least used Tags
- Tags/Categories with Maximum number of Websites or the Most Common Tags
- Tags/Categories with Minimum number of Websites or the Most Common Tags

All of the above questions are based on the aggregation functions like count, minimum, maximum, average. These are commonly asked while performing analysis in almost every domain. This information is sufficient enough to build a basic reading profile. But care must be taken to ensure that it doesn't lead to building a profile that ignores least commonly used bookmarks or websites. In other words, data analysis must also help in filtering out the forgotten topics or bookmarks and websites.

3. Data Visualization

What better way to analyse than to visualize! Visualization techniques help not only to find interesting patterns but also pop out striking features. As time passes, bookmark related information accumulates. When the amount of data increases, it's usually not feasible to analyse with tables. There are more data that need to be analysed than those that can be analysed on a single glance. There are a large number of visualization libraries available that can help to obtain a word cloud² corresponding to bookmarked articles, particularly to understand the main topics of interest. Then there are those that can be used to visualize the pattern of website visits on a timeline. This idea can be extended to understand the (recurrent) pattern of bookmarks concerning a particular topic. The interesting results thus obtained can be fed back to the system in the form of new user-generated tags.

Building a Reading Profile

1. Recommendation

Such a reading profile built after analysis of past bookmarks and feeds can be used to recommend articles. But a mere recommendation isn't enough. The justification for a recommendation must also be presented. This contextual information can be verified and any corrections required can be fed back to the system for improving the recommendation mechanism.

A Personalized Feed Aggregator

Conclusion

The overall goal is not to understand reading behaviour but to save time skipping through irrelevant or redundant articles. Therefore it is very important to cluster the articles based on theme and grouping similar articles. But the aspect of surprise (e.g., new technological advancement) should not be missing from the overall scene of filtering.

Information overload is a major challenge, but there are several ways to reduce it.

Building a Personalized Feed Aggregator

John Samuel