Programmers can no longer assume that only humans will consume the data produced by software applications. More and more tools and applications are becoming interdependent. Therefore, it is very important to understand the importance of using structured content. However, the definition of structured-content differs from person to person. This article explores these variations of use considering structure, style and semantics.

Structured Content

What is structured content? There may be multiple responses depending on the audience.

Human Readability

Let's start with a use case, where our target audience is human readers. In this case, the goal of structured content1 is to ensure the placement of content in a predictable manner. That is, the reader knows where to find the headings,  menu, menu items, sidebars, footer, main content, etc. A site following uniform site-wide experience or a magazine following a uniform writing style ensures that the reader can directly look or find the desired section at a predictable place. The concept of structured content is not limited to the online space, but can also be found in the physical world. Some of us who still read the print media know those regular placements of page numbers, magazine name, issue number, year, etc.

Ever wondered, how the authors of these articles, especially freelancers and independent journalists manage to easily write to multiple different websites. If we ask the creator of such articles on how they decide where to place a given section, they may respond that they use an article or book/magazine template provided by the site owners or publishers. Most of the time, they have to focus only on the article content and not on the magazine-style. There are several ways by which a uniform reader or creator experience can be ensured. Providing templates for slides, documents, articles, blogs, etc. is one such commonly used approach. The template users know where to place images, sections, footnotes, references, etc. Templates ensure uniform background, font styles, colors and sizes for titles, headings of sections and subsections across the different articles on a given website. These skeletons of the structured content guarantee that the readers and content creators need to focus only on their work and not be worried about being distracted by or managing the style of a document, blog post or a newspaper article.

Partial Machine Readability

It's not just humans who read our blog articles and web pages. Web crawlers navigate our websites and robots 'read' our content for indexation and supporting search engine queries. Thus machine-readability has also become the focus of the template and website builders, in letting these robots understand and distinguish between sidebar and article content, which is visually simple for humans. Now, therefore, if we turn our attention to the developers, the term 'structured content' may mean something else.

For most of them, structured content could mean JSON (or XML some years ago) that can be used to create the templates for storing and exchanging data in a manner that machines could easily 'read'. Machine-readability means that machines can comprehend what's present in a file or on a web page. Having said that, structured content in developer space does not simply mean the use of JSON or XML. There are several other data formats like CSV, HTML, YAML or markdown that exist to ensure proper structure to data. Some of the above data formats also come with a way to guarantee homogeneous representation of the complete data, especially HTML, DocBook, etc. Nevertheless, the use of such data formats is just a partial attempt to machine-readability. Machines may not still understand the content, what do those random strings represent, whether they correspond to country names, personal names or dates, etc.

This partial machine-readability, though has got advantages, compared to that of just using unstructured data (e.g., a paragraph of a text). Programs can communicate with each other easily. The output of one program can be used as input to another program. Personally, while programming, I make use of multiple tools and applications. Thanks to the structured content, I can easily export the data of one application that can be later exported to another program for further analysis and visualization. On the command line, this is usually done by piping. My experiments with the command line though show some limitations. Many of the current command-line make use of tabular format, which is useful for human comprehension, but not for other programs. Tabular formatting, usually removes part of the content from every column for fitting the output on the user screen. Piping of output from one program to another often requires some text pre-processing to detect such issues. Going forward, I believe that developers must assume the usage of the data produced by programs by other applications. The application output can also be easily validated and verified to check whether some errors are present. Hence, tools and applications must have the option to export and import data in the form of structured content.

Human and other programs can make use of different data formatting and stylesheets for visualization. Take for example, changing the look of HTML page with different CSS stylesheets. Separation of structure and style is therefore very important. Programmers cannot assume that only humans will consume the data using the same applications or visualization tools. As a developer, it's quite usual to start testing simple messages on the screen, making use of tabular formats for human consumption as well as visualization. But as software becomes stable and more users start using software, it is equally important to provide interfaces that contain structured content that separates any styling information from the main content.

What should be a common standard for structuring the data? The answer depends on your use-case and the programming expertise. Standards like XML, JSON, HTML to a certain extent are both human-readable and machine-readable. Data using these standards are parseable by many existing tools and libraries. Most programming languages have libraries for formatting (indenting), validating, pre-processing and transforming them.

Full Machine Readability

Nonetheless, as mentioned above, these formats in their classical forms are not fully comprehensible by machines. That's why many recent application developers may refer to them as semi-structured data formats. They are capable in provide basic structure and homogeneity to the user data. However, machines may not be able to distinguish between the different sections of the data, the purpose and contents of each data snippet. Thus, there is a need to give meaning or add semantics to the content.

Multiple Templates and Data Interoperability

We may be reading one or thousands of blogs, using numerous different styles and templates, yet we are still capable to distinguish the different sections of these web pages and can find the article content, filtering out any non-relevant content from our immediate attention. But what about machines. With millions of different templates and stylesheets, can machines filter out the main content from all other non-relevant content? Do machines understand when it sees an address, a telephone number, a name etc.? The last several years have seen a growing effort to aid the machines in understanding the text.

Take for example, HTML 5 introduced several tags like section, nav, aside, header, footer, etc. using which program writers can tell the machines the purpose of each element on a web page. Before HTML 5, programmers made extensive use of div for specifying them and with the help of CSS stylesheets, humans could still distinguish the different elements. With the newly introduced tags, it is now easier also for the machines to find the menu items, navigation links, etc.

Gamechanger

However, such standardization efforts may not be enough to completely understand the content. There was a need for further standardization2. Schema.org 3 helped to create a vocabulary for describing the concepts of different domains as well as inter and intra-domain relationships existing among them. Examples include vocabulary for describing businesses, persons, creative works, etc. Therefore, by making use of this vocabulary, website owners can easily describe an address that can be understood not only by humans but also by machines.

Other works like DBpedia4, Wikidata5 on one hand extract structured data from Wikipedia templates, especially the infoboxes and store them in a database, but they can also be used to give semantics to our data. This extracted data now play an important role in building an open, multilingual, free, collaborative, linked knowledge base across domains. The different concepts on Wikidata and DBpedia are linked to the concepts described in other (multilingual) knowledge bases. Hence, by annotating and linking the data items in our data to the associated concepts on these websites can also help towards the complete machine readability of our content. Using such linked structured data will be useful for building interoperable data solutions.

Conclusion

Should application from the very beginning be driven by structured content? Considering our current software development practices and lack of tools, it may seem difficult to build solutions that produce and consume machine-readable structured content. Using such content in our applications on one-hand will be able to build interoperable data solutions and on the other hand will also ensure maximum software reuse.

References

  1. Structured Content
  2. Schema.org: Evolution of Structured Data on the Web
  3. Schema.org
  4. DBpedia
  5. Wikidata