Why and How I developed Wikidata Multilinguality Calculator? : John Samuel

Wikidata is a remarkable, free, and open knowledge base that connects structured data from across the world. One of its standout features is its multilingual nature, allowing users to contribute and access data in their native languages. But despite this powerful capability, many Wikidata pages remain a patchwork of languages, where labels and descriptions lack translations, and cryptic Q-identifiers and P-identifiers stand in for meaningful words. This inconsistent multilingual experience is not just frustrating; it holds back Wikidata’s potential as a truly global platform.

Wikidata’s strength lies in its ability to let users translate data once and then share that knowledge universally, across languages. But if the properties themselves aren’t translated, this advantage fades away. This is where the idea for the mlscores - Wikidata Multilingual Calculator was born. I wanted to create a tool that helps bridge these gaps, making it easier for contributors to identify where translation efforts are most needed and to unlock Wikidata’s full potential.

The Need for a Solution

The goal of the Wikidata Multilingual Calculator is straightforward: to empower contributors by highlighting where translations are missing in Wikidata items. The tool calculates a multilingual score, revealing how complete the translations are for a given item. This score helps users prioritize their translation efforts, focusing on the properties that need the most attention and ensuring a more uniform multilingual experience across the platform.

By pinpointing gaps, contributors—whether they’re beginners or seasoned pros—can target their efforts more effectively, knowing precisely which languages or properties require the most work. This level of focus is crucial for enhancing both the quality and accessibility of the data on Wikidata.

How It Works

Building the calculator involved a mix of SPARQL, Python, and a touch of creativity. Here’s a breakdown of its workflow:

Data Retrieval: The tool sends SPARQL queries to Wikidata to gather properties and values linked to a specified item.
Data Processing: Once retrieved, the data is analyzed to detect missing translations.
Score Calculation: A multilingual score is computed based on the number of translations available for each property, giving a clear indication of where the gaps lie.
User-Friendly Display: The results are shown in an intuitive format using the Rich library in Python, highlighting which properties and values lack translations and need attention.

This streamlined approach makes it easy for contributors to visualize the state of multilingualism for any Wikidata item, guiding them toward where their efforts will have the most impact.

Example Use Case

Let’s consider the item human (Q5). Running the multilingual calculator on this item will reveal how complete its translations are in different languages, like English, French, Spanish, and Portuguese. The output might look like this:

    Combined Language
Percentages for property
label and property value
         labels

┏━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Language ┃ Percentage ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━┩
│ en       │     99.40% │
│ fr       │     96.39% │
│ es       │     92.17% │
│ pt       │     71.08% │
└──────────┴────────────┘

This table shows at a glance how complete the translations are for each language. With the --missing option, the tool even lists specific properties that lack translations:

Properties missing translation

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Languages ┃ Items                                                      ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ fr        │ P11955, P12596                                             │
│ es        │ P5806, P9495, P11955, P10376, P7314, P12596, P8785         │
│ pt        │ P4527, P8419, P5247, P12596, P8785, P4212, P7807, P8814,   │
│           │ P7329, P3222, P8168, P1256, P7497, P6839, P8895, P7775,    │
│           │ ... and more ...                                           │
└───────────┴────────────────────────────────────────────────────────────┘

These insights can be a game-changer for translators, helping them focus their efforts on the areas that will make the most difference.

Future Plans

The Wikidata Multilingual Calculator is only the beginning of a broader vision to enrich Wikidata’s multilingual experience. Here’s what’s on the horizon:

Optimized Queries: Fine-tuning SPARQL queries for faster data retrieval, making the tool even more efficient, especially when handling large datasets.
Web Interface: Developing a user-friendly web-based version of the calculator to make it accessible to a wider audience.
Wikidata Integration: Creating a JavaScript gadget that could be embedded directly into Wikidata pages, enabling contributors to check multilingual scores without ever leaving the platform.
Enhanced Features: Adding translation suggestions and potentially allowing direct contributions to translations right from the calculator itself.

Conclusion

mlscores or Wikidata Multilingual Calculator is a practical tool designed to enhance the multilingual capabilities of Wikidata by highlighting gaps in translations and guiding contributors on where to focus their efforts. With plans to optimize its functionality, develop a web interface, and integrate directly with Wikidata, the calculator aims to streamline the translation process and make it more accessible to all users. Whether you’re a newcomer looking to make your first contribution or a seasoned editor seeking to close translation gaps, this tool offers a clear path to making Wikidata’s data more inclusive and accessible.