Hash Functions - SPARQLqueries

Hash Function Reference

SPARQL provides cryptographic hash functions that convert strings into fixed-length hexadecimal representations.

Function	Description	Output Length
`MD5(str)`	MD5 hash (128-bit)	32 hex characters
`SHA1(str)`	SHA-1 hash (160-bit)	40 hex characters
`SHA256(str)`	SHA-256 hash (256-bit)	64 hex characters
`SHA384(str)`	SHA-384 hash (384-bit)	96 hex characters
`SHA512(str)`	SHA-512 hash (512-bit)	128 hex characters

Note: Hash functions are one-way transformations. The same input always produces the same hash, but you cannot reverse a hash to get the original input.

Basic Hash Generation

Generate hashes from string values.

MD5 Hash of Labels

Create MD5 hashes from entity labels:

MD5 Hash of Country Names

Run

SELECT ?country ?countryLabel ?hash WHERE {
  ?country wdt:P31 wd:Q6256 .  # country
  ?country rdfs:label ?label .
  FILTER(LANG(?label) = "en")
  BIND(MD5(?label) AS ?hash)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?countryLabel
LIMIT 20

SHA256 for Stronger Hashing

Use SHA256 for more secure hashes:

SHA256 Hash of City Names

Run

SELECT ?city ?cityLabel ?sha256Hash WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515 .  # city
  ?city wdt:P17 wd:Q142 .            # France
  ?city rdfs:label ?label .
  FILTER(LANG(?label) = "en")
  BIND(SHA256(?label) AS ?sha256Hash)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 20

Comparing Hash Algorithms

See the different outputs from various hash algorithms.

Multiple Hash Types

Generate different hash types for the same input:

Compare Hash Algorithms

Run

SELECT ?person ?personLabel ?md5 ?sha1 ?sha256 WHERE {
  ?person wdt:P106 wd:Q170790 .   # mathematician
  ?person wdt:P166 wd:Q38104 .   # Fields Medal winner
  ?person rdfs:label ?label .
  FILTER(LANG(?label) = "en")
  BIND(MD5(?label) AS ?md5)
  BIND(SHA1(?label) AS ?sha1)
  BIND(SHA256(?label) AS ?sha256)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 10

Creating Unique Identifiers

Use hashes to create unique identifiers from combined data.

Composite Key Hashing

Create a unique hash from multiple fields:

Hash of Name + Birth Year

Run

SELECT ?person ?personLabel ?birthYear ?compositeHash WHERE {
  ?person wdt:P106 wd:Q36180 .      # writer
  ?person wdt:P27 wd:Q145 .        # UK
  ?person wdt:P569 ?birthDate .
  ?person rdfs:label ?label .
  FILTER(LANG(?label) = "en")
  BIND(YEAR(?birthDate) AS ?birthYear)
  BIND(CONCAT(?label, "|", STR(?birthYear)) AS ?composite)
  BIND(MD5(?composite) AS ?compositeHash)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?birthYear
LIMIT 20

Hash as Short Identifier

Use substring of hash as a short ID:

Short Hash IDs (8 characters)

Run

SELECT ?painting ?paintingLabel ?shortId WHERE {
  ?painting wdt:P31 wd:Q3305213 .   # painting
  ?painting wdt:P170 wd:Q5582 .    # by Van Gogh
  ?painting rdfs:label ?label .
  FILTER(LANG(?label) = "en")
  BIND(SUBSTR(MD5(?label), 1, 8) AS ?shortId)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 20

Hashing URIs

Create hashes from Wikidata entity URIs.

Hash Entity URIs

Generate a hash from the entity's IRI:

Hash of Entity URI

Run

SELECT ?element ?elementLabel ?symbol ?uriHash WHERE {
  ?element wdt:P31 wd:Q11344 .    # chemical element
  ?element wdt:P246 ?symbol .      # element symbol
  BIND(MD5(STR(?element)) AS ?uriHash)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?symbol
LIMIT 30

Data Verification

Use hashes to verify data consistency and detect changes.

Detecting Duplicate Labels

Find entities with the same label hash:

Countries with Same Hash (Same Name)

Run

SELECT ?hash (COUNT(?person) AS ?count) (SAMPLE(?label) AS ?sampleName) WHERE {
  ?person wdt:P106 wd:Q33999 .   # actor
  ?person wdt:P27 wd:Q30 .       # USA
  ?person rdfs:label ?label .
  FILTER(LANG(?label) = "en")
  BIND(MD5(LCASE(?label)) AS ?hash)
}
GROUP BY ?hash
HAVING (COUNT(?person) > 1)
ORDER BY DESC(?count)
LIMIT 20

Content Fingerprint

Create a fingerprint for entity data:

Data Fingerprint from Multiple Fields

Run

SELECT ?book ?bookLabel ?authorLabel ?year ?fingerprint WHERE {
  ?book wdt:P31 wd:Q7725634 .     # literary work
  ?book wdt:P50 ?author .          # author
  ?book wdt:P577 ?pubDate .        # publication date
  ?book rdfs:label ?bookName .
  ?author rdfs:label ?authorName .
  FILTER(LANG(?bookName) = "en")
  FILTER(LANG(?authorName) = "en")
  BIND(YEAR(?pubDate) AS ?year)
  BIND(CONCAT(?bookName, "|", ?authorName, "|", STR(?year)) AS ?combined)
  BIND(SHA256(?combined) AS ?fingerprint)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 20

Anonymization with Hashing

Use hashes to create pseudonymous identifiers.

Pseudonymous IDs

Replace identifiable information with hashes:

Pseudonymized Person Records

Run

SELECT ?pseudoId ?birthDecade ?countryLabel WHERE {
  ?person wdt:P106 wd:Q901 .      # scientist
  ?person wdt:P569 ?birthDate .
  ?person wdt:P27 ?country .
  BIND(FLOOR(YEAR(?birthDate) / 10) * 10 AS ?birthDecade)
  BIND(SUBSTR(SHA256(STR(?person)), 1, 12) AS ?pseudoId)
  FILTER(?birthDecade >= 1900)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ?birthDecade
LIMIT 30

Hash-Based Grouping

Use hash prefixes to distribute data into buckets.

Partition by Hash Prefix

Group items by the first character of their hash:

Count Items by Hash Bucket

Run

SELECT ?bucket (COUNT(?city) AS ?count) WHERE {
  ?city wdt:P31/wdt:P279* wd:Q515 .  # city
  ?city wdt:P17 wd:Q183 .            # Germany
  BIND(SUBSTR(MD5(STR(?city)), 1, 1) AS ?bucket)
}
GROUP BY ?bucket
ORDER BY ?bucket

Sample by Hash

Use hash to get a reproducible random sample:

Reproducible 10% Sample

Run

SELECT ?museum ?museumLabel ?countryLabel WHERE {
  ?museum wdt:P31/wdt:P279* wd:Q33506 .  # museum
  ?museum wdt:P17 ?country .
  BIND(SUBSTR(MD5(STR(?museum)), 1, 1) AS ?bucket)
  FILTER(?bucket = "a")  # ~6% sample (1/16 hex values)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 50

Use Cases Summary

Use Case	Recommended Hash	Why
Quick lookups	MD5	Fast, sufficient for non-security uses
Unique identifiers	SHA256	Lower collision probability
Data fingerprints	SHA256	Good balance of security and length
Partitioning	MD5	Fast and evenly distributed
Maximum security	SHA512	Highest cryptographic strength