Overview
Shape Expressions (ShEx) is a concise, formal language for describing and validating data structures in RDF (Resource Description Framework) graphs, which underpin Wikidata. Through ShEx, contributors can define schemas—also called entity schemas—that enforce consistent patterns and quality standards in Wikidata data. This article introduces ShEx for Wikidata with an emphasis on how CSV-like tabular formats using ShExStatements simplify schema authoring for users without deep technical knowledge.
Why Use ShEx in Wikidata?
- Improved Data Quality: ShEx validation catches missing or inconsistent statements, ensuring reliable, well-structured data.
- Better Documentation: ShEx schemas serve as living documentation of data models, making complex Wikidata structures more transparent to current and new contributors.
- Enhanced Interoperability: Adherence to formal schemas ensures compatibility with other RDF and Linked Data systems, facilitating data reuse on the semantic web.
- Automated Error Detection: Schemas allow tools and editors to flag structural issues in real time, improving both human and automated contributions.
Simplifying ShEx with ShExStatements
ShExStatements is a tool that enables users to create ShEx schemas using straightforward, tabular CSV files rather than traditional ShEx syntax. This makes authoring and maintaining entity schemas accessible to Wikidata users with diverse backgrounds.
- Tabular Input: Users enter schema constraints in CSV files, with each row representing a statement or prefix.
- Accessible Tools: ShExStatements can be used via the command line or web interfaces, and supports common spreadsheet formats.
- Wide Adoption: Used to simplify schema creation for various data models, from languages to TV series and more.
CSV Format Structure
A typical ShExStatements CSV includes the following columns:
- Node name (e.g., @language)
- Property name (e.g., wdt:P31)
- Value(s) (e.g., wd:Q34770 or . for any value)
- Cardinality (e.g., +, *, ?, left empty for required, or m, m,n for custom ranges)
- Comment (starts with #)
The CSV may start with definitions for prefixes, where only the first three columns are used.
Example: Language Schema in CSV and ShExC
Below is a direct example from the ShExStatements documentation for modeling a language entity in Wikidata, first in the CSV-like format, then the generated ShEx Compact syntax (ShExC):
| Node | Property | Value | Cardinality | Comment |
|---|---|---|---|---|
| wd | <http://www.wikidata.org/entity/> | |||
| wdt | <http://www.wikidata.org/prop/direct/> | |||
| xsd | <http://www.w3.org/2001/XMLSchema#> | |||
| @language | wdt:P31 | wd:Q34770 | # instance of a language | |
| @language | wdt:P1705 | LITERAL | # native name | |
| @language | wdt:P17 | . | + | # spoken in country |
| @language | wdt:P2989 | . | + | # grammatical cases |
| @language | wdt:P282 | . | + | # writing system |
| @language | wdt:P1098 | . | + | # speakers |
| @language | wdt:P1999 | . | * | # UNESCO language status |
| @language | wdt:P2341 | . | + | # indigenous to |
The above CSV can be converted using ShExStatements into the following ShExC:
PREFIX wd:
PREFIX wdt:
PREFIX xsd:
start = @
{
wdt:P31 [ wd:Q34770 ] ; # instance of a language
wdt:P1705 LITERAL ; # native name
wdt:P17 .+ ; # spoken in country
wdt:P2989 .+ ; # grammatical cases
wdt:P282 .+ ; # writing system
wdt:P1098 .+ ; # speakers
wdt:P1999 .* ; # UNESCO language status
wdt:P2341 .+ ; # indigenous to
}
Other Examples
More example CSV files for other Wikidata entity types are available in the ShExStatements GitHub repository, such as schemas for TV series, footballers, and more. The process remains the same: edit a simple CSV, and use the tool to generate robust ShEx code.
Development and Tooling
- Python Implementation: ShExStatements is primarily built in Python for both CLI and API usage.
- Integration with Spreadsheets: Supports .csv, .xls, .xlsx, and .ods file formats for schema definitions.
Challenges and Ongoing Efforts
- Schema Adoption: There is a gap between available schemas and the breadth of Wikidata’s domains. Community engagement and tool improvements aim to close this gap.
- Coverage and References: Ensuring comprehensive schemas, including references and sources, is a continuous task as Wikidata’s structure evolves.
- Ease of Use: Striking the right balance between simplicity and expressiveness in schema authoring tools is a priority for the community.
Conclusion
ShExStatements enables Wikidata contributors to author, maintain, and apply data validation schemas with minimal technical hurdles. Its CSV-based approach lowers the barrier for high-quality data modeling, promoting robust, interoperable, and well-documented Wikidata content.
References
- ShExStatements Documentation: Official Docs
- Wikidata Schema Portal: Guidance and tutorials for entity schemas
- ShEx Primer and Specification: Learn more about Shape Expressions as a language
- GitHub Examples: Practical CSV and ShExC examples for various domains in Wikidata
- ShExStatements: Simplifying Shape Expressions for Wikidata