This article is part of a series on Knowledge Representation and Semantic Web.

Overview

Shape Expressions (ShEx) is a concise, formal language for describing and validating data structures in RDF (Resource Description Framework) graphs, which underpin Wikidata. Through ShEx, contributors can define schemas—also called entity schemas—that enforce consistent patterns and quality standards in Wikidata data. This article introduces ShEx for Wikidata with an emphasis on how CSV-like tabular formats using ShExStatements simplify schema authoring for users without deep technical knowledge.

Why Use ShEx in Wikidata?

Simplifying ShEx with ShExStatements

ShExStatements is a tool that enables users to create ShEx schemas using straightforward, tabular CSV files rather than traditional ShEx syntax. This makes authoring and maintaining entity schemas accessible to Wikidata users with diverse backgrounds.

CSV Format Structure

A typical ShExStatements CSV includes the following columns:

  1. Node name (e.g., @language)
  2. Property name (e.g., wdt:P31)
  3. Value(s) (e.g., wd:Q34770 or . for any value)
  4. Cardinality (e.g., +, *, ?, left empty for required, or m, m,n for custom ranges)
  5. Comment (starts with #)

The CSV may start with definitions for prefixes, where only the first three columns are used.

Example: Language Schema in CSV and ShExC

Below is a direct example from the ShExStatements documentation for modeling a language entity in Wikidata, first in the CSV-like format, then the generated ShEx Compact syntax (ShExC):

Node Property Value Cardinality Comment
wd <http://www.wikidata.org/entity/>
wdt <http://www.wikidata.org/prop/direct/>
xsd <http://www.w3.org/2001/XMLSchema#>
@language wdt:P31 wd:Q34770 # instance of a language
@language wdt:P1705 LITERAL # native name
@language wdt:P17 . + # spoken in country
@language wdt:P2989 . + # grammatical cases
@language wdt:P282 . + # writing system
@language wdt:P1098 . + # speakers
@language wdt:P1999 . * # UNESCO language status
@language wdt:P2341 . + # indigenous to

The above CSV can be converted using ShExStatements into the following ShExC:

                  PREFIX wd: 
                  PREFIX wdt: 
                  PREFIX xsd: 
                  
                  start = @
                   {
                    wdt:P31 [ wd:Q34770 ] ;        # instance of a language
                    wdt:P1705 LITERAL ;            # native name
                    wdt:P17 .+ ;                   # spoken in country
                    wdt:P2989 .+ ;                 # grammatical cases
                    wdt:P282 .+ ;                  # writing system
                    wdt:P1098 .+ ;                 # speakers
                    wdt:P1999 .* ;                 # UNESCO language status
                    wdt:P2341 .+ ;                 # indigenous to
                  }
                    

Other Examples

More example CSV files for other Wikidata entity types are available in the ShExStatements GitHub repository, such as schemas for TV series, footballers, and more. The process remains the same: edit a simple CSV, and use the tool to generate robust ShEx code.

Development and Tooling

Challenges and Ongoing Efforts

Conclusion

ShExStatements enables Wikidata contributors to author, maintain, and apply data validation schemas with minimal technical hurdles. Its CSV-based approach lowers the barrier for high-quality data modeling, promoting robust, interoperable, and well-documented Wikidata content.

References

  1. ShExStatements Documentation: Official Docs
  2. Wikidata Schema Portal: Guidance and tutorials for entity schemas
  3. ShEx Primer and Specification: Learn more about Shape Expressions as a language
  4. GitHub Examples: Practical CSV and ShExC examples for various domains in Wikidata
  5. ShExStatements: Simplifying Shape Expressions for Wikidata