When I started blogging, one of my visions was to ensure the machine-readability of the articles. Years have passed, I am still far from completely achieving it. No doubt, there are technical limitations to achieve full machine readability. But a certain amount of machine readability can still be achieved. There are a several advantages to use structured data, the major one being for the findability of relevant information. I have been blogging for quite a long time and the number of articles, notes that I have been documenting is increasing every month. A simple search for keywords, sometimes do not give me the information that I have been looking for.

I have been using a limited amount of structured data in the form of RDFa1,2 since the beginning for the different sections of the article like WebPage, BreadcrumbList, ListItem, etc. from Schema.org3. However, this was quite limited. More data can be added, take, for example, the author name, date of creation, date of publication, date of last modification, the title of the article, etc. The following information can be easily obtained for any published article and does not require a lot of effort.

	 
   {
      "@context": "http://schema.org",
      "@type": "BlogPosting",
      "mainEntityOfPage": {
	       "@type": "WebPage",
         "@id": "https://johnsamuel.info"
      },
      "articleSection": "blog",
      "name": "Integrating Linked Data",
      "headline": "Integrating Linked Data",
      "description": "Article by John Samuel",
      "inLanguage": "en",
      "author": "John Samuel",
      "datePublished": "2020-05-03 19:04:28",
      "dateModified": "2020-05-03 19:04:28",
      "dateCreated": "2020-05-03 19:04:28",
      "url": "https://johnsamuel.info/en/programming/linkeddata-integration.html",
      "keywords": ["Blog"]
   }
	 
	

As you may have observed, I am using JSON-LD4,5 (JSON for Linked Data) for this purpose, mainly because of the ease of generation of this code. The above information can be easily embedded using the following script tag.

	 
      <script type="application/ld+json">
		 ....
      </script>
	 
	

From a programming point of view, the main challenge was to correctly represent this information. Example snippets can easily be found on Schema.org3. Other authors have previously documented about their choice of properties6. As I am using version control systems, I can easily obtain the creation date and the last modification date of the article. With HTML parser like BeautifulSoup, I obtain the title of the article. I used the following Python libraries for generating and supporting JSON-LD on this blog, like the above example for a given blog posting.

  1. extruct: for extracting metadata (RDFa, JSON-LD, Microdata) from a web page. It can also be used to extract and verify the newly added JSON-LD and RDFa information.
  2. argparse: for parsing command line arguments, especially to work with one or more files and to support options for extraction or addition of metadata.
  3. w3lib: for obtaining the base URL
  4. pygit2: for obtaining the creation date, the last modification date of a blog post
  5. bs4 (BeautifulSoup): for parsing HTML file and obtaining information like the title of the article

My next goal will be to go beyond just annotating the metadata of the article to its actual content.

References

  1. RDFa
  2. RDFa 1.1 Primer-Third Edition
  3. Schema.org
  4. JSON-LD
  5. JSON for Linking Data
  6. Converting JSON-LD schema.org RDF to other vocabularies