Multilingual and diachronic function-word datasets, with modular composition
Project description
functionwordsets
Comprehensive multilingual function-word datasets with a simple Python API
Overview
functionwordsets ships ready-to-use function-word lists for many languages and time-periods.
Each dataset is a tiny Python module located in functionwordsets/datasets/ and is loaded on demand through a minimal API.
Supported out of the box :
| ID | Language / period | Entries* |
|---|---|---|
fr_21c |
French – 21st century | 688 |
en_21c |
English – 21st century | 390 |
sp_21c |
Spanish – 21st century | 481 |
it_21c |
Italian – 21st century | 495 |
nl_21c |
Dutch – 21st century | 287 |
gr_5cbc |
Ancient Greek – 5th-4th c. BCE | 264 |
oc_13c |
Old Occitan – 12th-13th c. | 360 |
la_1cbc |
Classical Latin – 1st c. BCE | 353 |
*Number of distinct word-forms in the union of all categories.
You can also add or fork your own datasets: just drop a <id>.py file following the template shown below.
💡 Supported grammatical categories
(summary unchanged – see below for details)
Installation
pip install functionwordsets # from PyPI
# or, from a cloned repo
pip install -e .
Python ≥ 3.8 – zero runtime dependencies – wheel < 20 kB zipped.
Quick start
import functionwordsets as fw
# List available datasets
print(fw.available_ids()) # ['fr_21c', 'en_21c', …]
# Load one set (defaults to fr_21c)
fr = fw.load() # same as fw.load('fr_21c')
print(fr.name, len(fr.all)) # French – 21st century 688
# Membership test
if 'ne' in fr.all:
...
# Build a custom stop-set: only articles + prepositions
stops = fr.subset(['articles', 'prepositions'])
Command-line helpers
# List dataset IDs
fw-list
# Export every French function word to a text file
fw-export fr_21c -o fr.txt
# Export only conjunctions & negations from Spanish as JSON
fw-export sp_21c --include coord_conj subord_conj negations -o sp_stop.json
Dataset layout
Internally each dataset is defined as a small Python dictionary:
data = {
"name": "English – 21st century",
"language": "en",
"period": "21c",
"categories": {
"articles": [...],
"prepositions": [...],
# …
}
}
functionwordsets treats the object as read-only, so feel free to edit or extend it in your fork.
Notes on auxiliary categories
Keys for auxiliary verbs follow the pattern aux_<lemma> (e.g. aux_être, aux_be, aux_ser). They vary by language; see each dataset file for the exact key.
Enjoy !
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file functionwordsets-1.2.1.tar.gz.
File metadata
- Download URL: functionwordsets-1.2.1.tar.gz
- Upload date:
- Size: 33.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
91c435dd60a2163f3434e69159b4be97f4ee95d488fafd7c9ae652e6df565986
|
|
| MD5 |
ff6b551a9b5a0592e4fd2b520bd5004b
|
|
| BLAKE2b-256 |
dacadb0bb3dccb3cbd57d6e605121f5d1118fe8adaf3e07a2f352b204cb5563c
|
Provenance
The following attestation bundles were made for functionwordsets-1.2.1.tar.gz:
Publisher:
python-publish.yml on floriancafiero/Function_words
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
functionwordsets-1.2.1.tar.gz -
Subject digest:
91c435dd60a2163f3434e69159b4be97f4ee95d488fafd7c9ae652e6df565986 - Sigstore transparency entry: 273822503
- Sigstore integration time:
-
Permalink:
floriancafiero/Function_words@7dd3fa42cfd7d3c96a80f2426f7be0631d3196f7 -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/floriancafiero
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7dd3fa42cfd7d3c96a80f2426f7be0631d3196f7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file functionwordsets-1.2.1-py3-none-any.whl.
File metadata
- Download URL: functionwordsets-1.2.1-py3-none-any.whl
- Upload date:
- Size: 35.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fba1e1fec8458ac997ca28650f08d0257606629de38898a5973d21f5f984237
|
|
| MD5 |
7b26bb18ab16fe4793209fa8cea4992e
|
|
| BLAKE2b-256 |
afd2212650853c3bd6369e5b5aa65b62a3114ba671a4f05b7a4932f445931953
|
Provenance
The following attestation bundles were made for functionwordsets-1.2.1-py3-none-any.whl:
Publisher:
python-publish.yml on floriancafiero/Function_words
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
functionwordsets-1.2.1-py3-none-any.whl -
Subject digest:
3fba1e1fec8458ac997ca28650f08d0257606629de38898a5973d21f5f984237 - Sigstore transparency entry: 273822506
- Sigstore integration time:
-
Permalink:
floriancafiero/Function_words@7dd3fa42cfd7d3c96a80f2426f7be0631d3196f7 -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/floriancafiero
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@7dd3fa42cfd7d3c96a80f2426f7be0631d3196f7 -
Trigger Event:
release
-
Statement type: