Detect language support for font binaries
Project description
Hyperglot – a database and tools for detecting language support in fonts
Hyperglot is an open research project dedicated to documenting how the world’s languages are written. By mapping orthographies and their requirements, it supports inclusive, multilingual type design and equitable access to high-quality typography for underserved communities. Hyperglot currently covers 783 languages, representing approximately 7.3 billion speakers, and is developed as open source by Rosetta Type/Research in collaboration with a global community of contributors and licensed under the Apache 2.0 license.
Hyperglot is available as:
- the Hyperglot web apps,
- the command-line tool:
hyperglot, - the python packagage:
import hyperglot(see examples for basic usage).
📖 Learn more about Hyperglot
🙋 Read the FAQ
💰 Sponsor via GitHub or directly via Hyperglot sponsorship. Any and all contributions are much appreciated! 🙏
Data validity & contributing
Hyperglot is a work in progress and provided AS IS. The validity of language data varies and continues to improve. Each language includes a validity label (todo, draft, preliminary, verified) to help you assess the data.
Mapping all the world’s languages is a huge task—we need help from native speakers and language users! If you notice an error or see that a language is missing, please get in touch (via email or Issues). We welcome contributions and will credit your input.
The data structure is documented in a separate README file along with guidelines for contributing.
Core concepts
The following concepts are essential to understanding how Hyperglot works.
A language can be written in one or more scripts. Each such writing system is represented in Hyperglot as an orthography. Most languages have a single primary orthography; however, some use multiple orthographies either independently (for example, in different regions) or concurrently (such as Serbian or Japanese).
In the database, an orthography contains the following character sets:
base– the required, essential characters,aux– non-essential, recommended characters,marks– combining marks,punctuation,numerals, andcurrency.
A script, however, is more than a collection of characters. It also defines how characters interact when combined. This behavior is known as shaping and, in digital fonts, is implemented using OpenType features.
Read the detailed description of the database structure
Language support detection process
To detect language support in a font, Hyperglot performs the following checks:
- Required characters are present. Which characters are considered required is specified by filtering based on language/orthography status, data validity, and by selecting which character sets to check against.
- Precomposed character combinations are handled by the font. For character combinations that have a unique code point in Unicode, one of the following (depending on the setting):
- The encoded, precomposed character combinations are present.
- Base characters and mark characters from these combinations are present independently.
- Both of the above.
- Shaping behaviour is correctly handled by the font, where applicable:
- Required mark-positioning instructions are present.
- Required alternates for joining behavior (for example, in Arabic) are present.
- Conjunct syllable construction in Brahmi-derived scripts is supported. (Currently supported only for Hindi/Devanagari.)
Additional design-related notes are provided for the user’s discretion when assessing design quality. Hyperglot does not assess the font design in any way.
Command-line tools
Installation
You will need to have Python 3 installed. Install via pip:
pip install hyperglot
Besides the main hyperglot command used for font inspection, the package also includes:
hyperglot-report– explore missing language support (see below).hyperglot-data– review language data stored in the database.hyperglot-validate,hyperglot-save, andhyperglot-export– manage and process data when contributing.
Basic usage
Use:
hyperglot path/to/font.otf
to output a list of supported languages (and other data) for a font. Use:
hyperglot path/to/font.otf path/to/anotherfont.otf …
to check several fonts at once, or their combined coverage (with -m union).
Advanced options
-c, --check: Specify which character sets to check against. Options are 'base, auxiliary, punctuation, numerals, currency, all', or a comma-separated combination of these. (Default: 'base')--validity: Filter languages by data validity level. Options are 'todo, draft, preliminary, verified'. (Default: 'preliminary')-s, --status: Specify which languages to consider when checking support. Options are 'living, historical, constructed, all', or a comma-separated combination of these . (Default: 'living,constructed')-o, --orthography: Which orthographies to consider when checking support for a language. Options are 'primary, secondary, historical, transliteration, all', or a comma-separated combination of these. (Default: 'primary')-d, --decomposed: For precomposed character combinations, require only the individual component characters. By default, precomposed character combinations are also required when they have a unique code point in Unicode. (Default: False)-m, --marks: Require that a font include all combining marks used by a language’s orthography. By default, only marks that are not part of precomposed character combinations are required. (Default: False)--sort: Specify the sort order. Use "speakers" to sort by number of speakers. (Default: "alphabetic")--sort-dir: Specify the sort direction. Use "desc" for descending order. (Default: "asc" for ascending order)-y, --output: Specify a file path to write the output to, in YAML format. For a single input font, the output is a subset of the Hyperglot database containing the languages and orthographies supported by the font. When multiple fonts are provided, the YAML file contains a top-level key for each font. If the-moption is provided, the output includes the specific intersection or union result.-t, --shaping-threshold: Set the frequency threshold for complex-script shaping checks. A font passes when it renders correctly for combinations at or above this threshold. Frequencies range from 1.0 (most frequent combinations) to 0.0 (rares combinations). (Default: 0.01)--no-shapingDisable shaping checks (mark attachment, joining behavior, and conjunct shaping). (Default: shaping checks enabled)-v, --verbose: Enable verbose logging.-V, --version: Print the Hyperglot version number.
Explore missing language support
The hyperglot-report reports missing characters and shaping support. A common use case is identifying languages that could be supported with minimal additional work in a given font. The command accepts the same options as hyperglot and the following options:
--report-missing: Report languages missingnor fewer characters. Ifnis 0, all languages with any number of missing characters are reported. (Default: 0)--report-marks: Report languages missingnor fewer mark-attachment sequences. Ifnis 0, all languages with any number of missing mark-attachment sequences are reported. (Default: 0)--report-joining: Report languages missingnor fewer joining sequences. Ifnis 0, all languages with any number of missing joining sequences are reported. (Default: 0)--report-all: Set or override all other--report-*options.
Roadmap
- 🪶 Change licence to Apache 2
- 💰 Invite sponsorship and funding#174
- 🤖 Basic analysis of shaping support provided by the font (GPOS and GSUB): check whether character combinations are affected by font OpenType features, enabling scalable support for complex combinations (e.g., Arabic, Hindi/Devanagari). #176
- ➡️ Export in a format suitable for submission to Unicode CLDR
- 🌍 Database web app: add links to other resources per language
- 📚 Improve language data, sources, and validity for languages with fewer authoritative references #157
- 🌍 Add data for more African languages and scripts, e.g., N'Ko #195
- 🇮🇳 Add more shaping checks for Brahmi-derived scripts #176
- 🇧🇷 Add data for indigenous Brazilian languages (Rafael Dietzch and students)
- 🇺🇳 Secure funding to expand language coverage
Other
The comparison of Hyperglot and the Unicode CLDR (this might be outdated atm.)
Notes
- Fonts included in the repository for testing purposes are licenses under their respective licenses
- Data included in the
otherdirectory is replicated from various public domain and open source origins for compasion and aggregation (mostly present in historic commits of this repository)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hyperglot-0.8.1.tar.gz.
File metadata
- Download URL: hyperglot-0.8.1.tar.gz
- Upload date:
- Size: 335.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6364cf5b535a6c16ce31c168687624b0ba053164fdc47698fbbbff91f50e9263
|
|
| MD5 |
6965c5437e6deac513a722359814f9ef
|
|
| BLAKE2b-256 |
78f88cf6ed686d558c68b2bb7fd68453914add179889e2edc4616074dfb07877
|
File details
Details for the file hyperglot-0.8.1-py3-none-any.whl.
File metadata
- Download URL: hyperglot-0.8.1-py3-none-any.whl
- Upload date:
- Size: 683.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cdcb4af90a8aae182a3c60882891651d112a031ef66db7c0848936e042a6af7
|
|
| MD5 |
bd38cf2b923cf75cebd937194bf8be47
|
|
| BLAKE2b-256 |
099bbd2836d6338d5f75b9c09ec65ccebfa1222299a96153f07c730afbc2f406
|