Skip to main content

pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages.

Project description

PySBD logo

pySBD: Python Sentence Boundary Disambiguation (SBD)

Python package codecov License PyPi GitHub

pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.

This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.

pysbd_code

Highlights

'PySBD: Pragmatic Sentence Boundary Disambiguation' a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020.

Research Paper:

https://arxiv.org/abs/2010.09657

Recorded Talk:

pysbd_talk

Poster:

name

Install

Python

pip install pysbd

Usage

  • Currently pySBD supports 22 languages.
import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']
import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('en')

# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))

# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)

doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]

Contributing

If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to CONTRIBUTING.md to know more and follow these steps.

  1. Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Citation

If you use pysbd package in your projects or research, please cite PySBD: Pragmatic Sentence Boundary Disambiguation.

@inproceedings{sadvilkar-neumann-2020-pysbd,
    title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
    author = "Sadvilkar, Nipun  and
      Neumann, Mark",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
    pages = "110--114",
    abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}

Credit

This project wouldn't be possible without the great work done by Pragmatic Segmenter team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysbd_yurts-0.3.4.tar.gz (53.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pysbd_yurts-0.3.4-py3-none-any.whl (61.4 kB view details)

Uploaded Python 3

File details

Details for the file pysbd_yurts-0.3.4.tar.gz.

File metadata

  • Download URL: pysbd_yurts-0.3.4.tar.gz
  • Upload date:
  • Size: 53.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for pysbd_yurts-0.3.4.tar.gz
Algorithm Hash digest
SHA256 acd458a5243a5a03c72c9c9ccd04d8a48510cd7e4c47b7ee8d97c193f3a2173f
MD5 ee1a0373480c3757c4e768630d53d318
BLAKE2b-256 bf94f2b888c411011a2e7142195896c05cba50f4436a35052ba9582470252177

See more details on using hashes here.

File details

Details for the file pysbd_yurts-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: pysbd_yurts-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 61.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for pysbd_yurts-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 851ac7de4c47e8361da2dce7dbb133490fb1fa2a4a9e156de3cb4c325454cea1
MD5 f938d18fcb100d39a557d38f9b03adcd
BLAKE2b-256 003e05c558cdb4a78a1a2ea9c6f54ee4efca512a4251dbaa4ec43432c4216bea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page