Skip to main content

Free and open source library for fast sentence boundary detection

Project description

MiniSBD

Free and open source Python library for fast sentence boundary detection. It uses 8bit quantized ONNX models for inference, thus making it fast and lightweight.

The only dependency is onnxruntime / onnxruntime-gpu.

Installation

pip install -U minisbd

Usage

from minisbd import SBDetect

text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""

detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
    print(f"--> {sent}")

# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.

By default models are downloaded from GitHub and stored in the user's ~/.cache/minisbd folder. You can change this at runtime via:

from minisbd import models
models.cache_dir = '/path/to/cache'

You can optionally specify a path to a ONNX model instead of having MiniSBD download the model for you:

from minisbd import SBDetect
detector = SBDetect("/path/to/model.onnx")
# ...

Language Support

from minisbd.models import list_models
print(list_models())
Language Code
Afrikaans af
Ancient Greek grc
Ancient Hebrew hbo
Arabic ar
Armenian hy
Basque eu
Belarusian be
Bulgarian bg
Buryat bxr
Catalan ca
Chinese (Simplified) zh-hans
Chinese (Traditional) zh-hant
Classical Chinese lzh
Coptic cop
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Erzya myv
Estonian et
Faroese fo
Finnish fi
French fr
Galician gl
German de
Gothic got
Greek el
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Irish ga
Italian it
Japanese ja
Kazakh kk
Korean ko
Kurmanji kmr
Kyrgyz ky
Latin la
Latvian lv
Ligurian lij
Lithuanian lt
Maghrebi Arabic French qaf
Maltese mt
Manx gv
Marathi mr
Naija pcm
North Sami sme
Norwegian nb
Norwegian Nynorsk nn
Old Church Slavonic cu
Old East Slavic orv
Old French fro
Persian fa
Polish pl
Pomak qpm
Portuguese pt
Romanian ro
Russian ru
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Tamil ta
Telugu te
Thai th
Turkish tr
Turkish German qtd
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Vietnamese vi
Welsh cy
Western Armenian hyw
Wolof wo

Converting Stanza Models

The extract.py script can be used to extract existing Stanza models and convert them to ONNX. See the source code.

Credits

MiniSBD is a port of Stanza's tokenizer models to ONNX. The models are the same as those from Stanza, but have been converted to ONNX and quantized for faster inference and smaller size.

License

AGPLv3

Some code has been originally modified from Stanza.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minisbd-0.9.1.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minisbd-0.9.1-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file minisbd-0.9.1.tar.gz.

File metadata

  • Download URL: minisbd-0.9.1.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.1.tar.gz
Algorithm Hash digest
SHA256 119892b5f2121d08170c33a82dc42711d6dd418d1f43e064d96ad3d1e28d6170
MD5 64551d3149cd516ba1fe704e09d8603f
BLAKE2b-256 2ebba4a50f358fb6cddacc4379b490bb8d147fee1b6c917354f297eda3c1634d

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.1.tar.gz:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minisbd-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: minisbd-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d0c3981c2b670b384091a95f51628bb3fdba1a39983822ddc96d23590c241bd2
MD5 ba078f75e2e1a6096294d07fe78c641d
BLAKE2b-256 05340e95836f9ecfcc3899aaed76f578283e9142dff8b1e8af9cec09d55870ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.1-py3-none-any.whl:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page