Skip to main content

Unified language metadata toolkit for NLP - identifiers, scripts, speakers, and geographic data.

Project description

Qwanqwa (qq): Language Metadata

A unified language metadata toolkit for NLP: identifiers, scripts, speakers, geographic data, and traversable relationships across ~27,256 languoids.

Name: Qwanqwa is a phonetic spelling of 'ቋንቋ', which means language in Amharic; qq is short to type.

Features

  • Identifiers: BCP-47, ISO 639-1, ISO 639-3, ISO 639-2B, ISO 639-2T, ISO 639-5, Glottocode, Wikidata ID, Wikipedia ID, NLLB-style codes
  • Geographic information: Countries, subdivisions, regions, which can be traversed, including from languoids and back
  • Speaker information: Population counts, UNESCO endangerment status
  • Writing systems: ISO 15924 script codes with canonical/historical metadata
  • Multilingual names: Language names in 500+ languages
  • Relationships: Traversable graph of language families, scripts, and geographic regions
  • Phylogenetic data: Language family trees from Glottolog

Languoids

In qq, language-like entities are referred to as Languoids: this includes dialects, macro-languages, and language families, not just individual languages. Not all languoids have coverage for all features.

Installation

pip install qwanqwa
# or from git
uv add git+https://github.com/WPoelman/qwanqwa
# or
pip install git+https://github.com/WPoelman/qwanqwa

Quick Start

from qq import Database, IdType

# Load the pre-compiled database
db = Database.load()

# Get a language by BCP-47 code (default)
dutch = db.get("nl")
print(dutch.name)          # "Dutch"
print(dutch.iso_639_3)     # "nld"
print(dutch.speaker_count) # 24085200

# Also works with ISO 639-3, Glottocode, etc.
dutch2 = db.get("nld", id_type=IdType.ISO_639_3)
dutch3 = db.get("dutc1256", id_type=IdType.GLOTTOCODE)
dutch4 = db.guess("dut") # guessing works too
# This will all resolve to the same languoid
assert dutch.id == dutch2.id == dutch3.id == dutch4.id

# Search by name
results = db.search("Chinese")
for lang in results:
    print(f"{lang.name} ({lang.glottocode})")

Important: qq makes a strict distinction between None (don't know) and False (it is not the case). When checking boolean attributes, prefer explicit checks over truthiness: use if script.is_canonical is None: rather than if not script.is_canonical:.

Traversal

Languoids, scripts, and geographic regions are all part of the same graph, which can be traversed:

dutch = db.get("nl")

# Language family navigation (Glottolog tree)
dutch.parent             # Global Dutch
dutch.parent.parent      # Modern Dutch
dutch.family_tree        # [Global Dutch, Modern Dutch, ..., West Germanic, Germanic, Indo-European]
dutch.siblings           # [Afrikaansic, Javindo, Petjo]
dutch.children           # [North Hollandish, Central Northern Dutch, ...]
dutch.descendants()      # All descendants (recursive)

# Writing systems
dutch.scripts            # [Script(Latin, code=Latn)]
dutch.script_codes       # ["Latn"]
dutch.canonical_scripts  # scripts marked canonical in LinguaMeta

# Geographic regions
dutch.regions            # [Aruba, Belgium, ..., Netherlands, Suriname, ...]
dutch.country_codes      # ["AW", "BE", "BQ", "CW", "NL", "SR", "SX"]

# Reverse traversal to script
latin = dutch.scripts[0]
latin.languoids          # All languages using Latin script

# Cross-domain queries
dutch.languoids_with_same_script   # other languages sharing any script
dutch.languoids_in_same_region     # other languages in the same regions

Identifiers and Conversion

from qq import IdType

# Automatic detection
lang = db.guess("nld")   # tries all identifier types

# Explicit conversion
db.convert("nl", IdType.BCP_47, IdType.ISO_639_3)    # "nld"
db.convert("nld", IdType.ISO_639_3, IdType.GLOTTOCODE) # "dutc1256"

# Conversion where you don't know or care what the source is, just the target.
# Useful for normalizing multiple standards to one
db.convert("nl", IdType.ISO_639_3)    # "nld"
db.convert("dutc1256", IdType.ISO_639_3) # "nld"

# NLLB-style codes
dutch.nllb_codes()              # ["nld_Latn"]
dutch.nllb_codes(use_bcp_47=True) # ["nl_Latn"]

Multilingual Names

# Name of Dutch in French
dutch.name_in("fr")    # "néerlandais"
dutch.name_in(french)  # also accepts a Languoid object

# Native name
dutch.endonym  # "Nederlands"

Command Line Interface

# Look up a language
qq get nl
qq get nld --type ISO_639_3

# Search by name
qq search Dutch

# Database statistics and validation
qq validate

# Rebuild the database from sources
qq rebuild

# Check source status
qq status

# Update sources (only needed if you want to rebuild the database,
# not necessary in normal use)
qq update

Examples

See the examples/ directory for runnable scripts covering:

  • 01_basic_usage.py: Loading and accessing attributes
  • 02_identifiers.py: Working with identifier types and retired codes
  • 03_conversion.py: Converting between identifiers
  • 04_traversal.py: Language family navigation
  • 05_search.py: Searching and filtering
  • 06_names.py: Multilingual name data
  • 07_geographic.py: Geographic regions and countries
  • 08_relations.py: Relationship graph traversal
  • 09_advanced_queries.py: Complex queries and statistics
  • 10_linking_datasets.py: Joining datasets that use different identifier systems
  • 11_normalizing_datasets.py: Normalizing mixed identifier codes to a single standard

Case studies

The case-studies/ directory contains runnable analyses that use qq:

  • hugginface-audit/: Scans all multilingual datasets on the HuggingFace Hub and classifies every language: tag as valid, deprecated, a misused country code, or unknown. qq resolves 99.2% of the 8,189 codes; the rest are deprecated, misused country codes, or HuggingFace-specific tags.
  • linking-datasets/: Links four lexical datasets (Concepticon, WordNet, Etymon, Phonotacticon) that each use a different identifier standard. qq resolves these four to a shared canonical ID: 102 languages are covered by all four.
  • latex-tables/: Generates a LaTeX table of language metadata (identifiers, scripts, speaker counts, families) for an imaginary 30-language NLP benchmark.
  • identifier-coverage/: Visualizes which combinations of identifier standards (Glottocode, ISO 639-3, ISO 639-1, Wikidata) cover which languoids as an UpSet plot.

Sources

This project builds on the work of many people. See docs/sources.md for the full list. All sources are available under Creative Commons BY or BY-SA licenses.

License

CC BY-SA 4.0

[^1]: NLLB-style codes combine an ISO 639-3 (or BCP-47) language tag with an ISO 15924 script tag (e.g., nld_Latn). The script part is derived from Glotscript, excluding Braille.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwanqwa-1.0.2.tar.gz (7.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwanqwa-1.0.2-py3-none-any.whl (7.2 MB view details)

Uploaded Python 3

File details

Details for the file qwanqwa-1.0.2.tar.gz.

File metadata

  • Download URL: qwanqwa-1.0.2.tar.gz
  • Upload date:
  • Size: 7.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for qwanqwa-1.0.2.tar.gz
Algorithm Hash digest
SHA256 d55b4cb98c0f3cecb17614e925b7881ea0929e126377f2569ff6ba0bfb15517f
MD5 6257b21af8ecdaa4e18b7db16e5a5161
BLAKE2b-256 85424b9a7e7c06d8c0e1695f7d52554c5b16089b0ee2f86527fb69ca6e164d80

See more details on using hashes here.

File details

Details for the file qwanqwa-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: qwanqwa-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.27 {"installer":{"name":"uv","version":"0.9.27","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for qwanqwa-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d00e727817fa164da9ea3a614db8e1fa842a16e266451cdca9067f308dc5af73
MD5 15c6d5c799f1eeb3a43892b968e85fc5
BLAKE2b-256 f4879f6e6398f84375b4652491c51b61ef008e809f3d098d185247a16331a60b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page