Skip to main content

Free and open source library for fast sentence boundary detection

Project description

MiniSBD

Free and open source Python library for fast sentence boundary detection. It uses 8bit quantized ONNX models for inference, thus making it fast and lightweight.

The only dependency is onnxruntime / onnxruntime-gpu.

Installation

pip install -U minisbd

Usage

from minisbd import SBDetect

text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""

detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
    print(f"--> {sent}")

# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.

By default models are downloaded from GitHub and stored in the user's ~/.cache/minisbd folder. You can change this at runtime via:

from minisbd import models
models.cache_dir = '/path/to/cache'

You can optionally specify a path to a ONNX model instead of having MiniSBD download the model for you:

from minisbd import SBDetect
detector = SBDetect("/path/to/model.onnx")
# ...

Language Support

from minisbd.models import list_models
print(list_models())
Language Code
Afrikaans af
Ancient Greek grc
Ancient Hebrew hbo
Arabic ar
Armenian hy
Basque eu
Belarusian be
Bulgarian bg
Buryat bxr
Catalan ca
Chinese (Simplified) zh-hans
Chinese (Traditional) zh-hant
Classical Chinese lzh
Coptic cop
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Erzya myv
Estonian et
Faroese fo
Finnish fi
French fr
Galician gl
German de
Gothic got
Greek el
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Irish ga
Italian it
Japanese ja
Kazakh kk
Korean ko
Kurmanji kmr
Kyrgyz ky
Latin la
Latvian lv
Ligurian lij
Lithuanian lt
Maghrebi Arabic French qaf
Maltese mt
Manx gv
Marathi mr
Naija pcm
North Sami sme
Norwegian nb
Norwegian Nynorsk nn
Old Church Slavonic cu
Old East Slavic orv
Old French fro
Persian fa
Polish pl
Pomak qpm
Portuguese pt
Romanian ro
Russian ru
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Tamil ta
Telugu te
Thai th
Turkish tr
Turkish German qtd
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Vietnamese vi
Welsh cy
Western Armenian hyw
Wolof wo

Converting Stanza Models

The extract.py script can be used to extract existing Stanza models and convert them to ONNX. See the source code.

Credits

MiniSBD is a port of Stanza's tokenizer models to ONNX. The models are the same as those from Stanza, but have been converted to ONNX and quantized for faster inference and smaller size.

License

AGPLv3

Some code has been originally modified from Stanza.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minisbd-0.9.3.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minisbd-0.9.3-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file minisbd-0.9.3.tar.gz.

File metadata

  • Download URL: minisbd-0.9.3.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.3.tar.gz
Algorithm Hash digest
SHA256 7f35f489fb64a63b121f06fbda8b5e0e2a982b0d111c926daaabbec3e535ab19
MD5 ffa59a9a4c0221b51fa7f28818b14507
BLAKE2b-256 e3c5388b9cf14891925a8bb02968c6c689ad09dea0ca4800bd09ce204fb4847d

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.3.tar.gz:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minisbd-0.9.3-py3-none-any.whl.

File metadata

  • Download URL: minisbd-0.9.3-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e1eeef3126263663b29971a88fc9f5e93a2de46619ff56b1b7b7ac9c42dc90ee
MD5 8fee71a34096ba57e08592aaa20405cb
BLAKE2b-256 aa3d143f06e7a0f35d69583b6177ba71d3faef4a6a0db6be7ebafa15c27133e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.3-py3-none-any.whl:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page