cypunct

Cypunct is a Cython package to split Unicode strings based on a given frozenset of Unicode code points.

These details have not been verified by PyPI

Project links

Homepage

Project description

Cypunct is designed to solve the problem of quickly splitting a Unicode string based on a set of characters.

Cypunct is designed to work on Python 2.6, 2.7, and 3.3+. Because Cypunct is a Cython extension, it will (probably) only work in the CPython runtime.

For Python versions 2.6 and 2.7, Cypunct will only run if these CPython runtimes are compiled with the flag --enable-unicode=ucs4. Cypunct will throw an exception if your Python 2 runtime was not compiled with UCS-4.

Installation

Installation is easiest with pip. Just run

pip install cypunct

Usage

Cypunct takes a Unicode string and a frozenset of delimiter characters, and splits the string based on that set. Every delimiter character should be a single Unicode code point – len(char) should be 1.

A simple example, where we provide a small frozenset is below.

>>> from cypunct import split
>>> split("James Mishra is the... best human ever, or so I think.", frozenset({' ', '.', ','}))
['James', 'Mishra', 'is', 'the', 'best', 'human', 'ever', 'or', 'so', 'I', 'think', '']

However, if you only need to split on whitespace characters, str.split() much better performance. If you only need to split on one character, str.split(char) will also be much faster.

Cypunct really shines when you need to split on many possible characters, such as an entire Unicode character category.

The below example splits on all Unicode punctuation, and nothing else.

>>> from cypunct.unicode_classes import P
>>> split("James Mishra is the... best human ever, or so I think.", P)
['James Mishra is the', ' best human ever', ' or so I think', '']

The following Unicode classes are available as sets:

Category	Description
C	Other
Cc	Other, Format
Cf	Other, Not Assigned
Co	Other, Private Use
Cs	Other, Surrogate
L	Letter
Ll	Letter, Lowercase
Lm	Letter, Modifier
Lo	Letter, Other
Lt	Letter, Titlecase
Lu	Letter, Uppercase
M	Mark
Mc	Mark, Space Combining
Me	Mark, Enclosing
Mn	Mark, Nonspacing
N	Number
Nd	Number, Decimal Digit
Nl	Number, Letter
No	Number, Other
P	Punctuation
Pc	Punctuation, Connector
Pd	Punctuation, Dash
Pe	Punctuation, Close
Pf	Punctuation, Final Quote
Pi	Punctuation, Initial Quote
Po	Punctuation, Other
Ps	Punctuation, Open
S	Symbol
Sc	Symbol, Currency
Sk	Symbol, Modifier
Sm	Symbol, Math
So	Symbol, Other
Z	Separator
Zl	Separator, Line
Zp	Separator, Paragraph
Zs	Separator, Space

cypunct.unicode_classes.COMMON_SEPARATORS is the union of the C, P, S, and Z frozensets. I have found it personally useful when splitting text for natural language processing applications.

If you don’t specify a frozenset for Cypunct to use, then Cypunct will default to COMMON_SEPARATORS.

Updating Unicode data

Currently, cypunct.unicode_classes is a Python module autogenerated from a UnicodeData.txt file. The autogeneration script exists in make_punctuation_file.py.

Most Cypunct users will not need to concern themselves with this, but this is important to know if you are experiencing Unicode bugs or want to contribute to Cypunct.

The current UnicodeData.txt is from ftp://ftp.unicode.org/Public/10.0.0/ucd/UnicodeData.txt.

Frequently Asked Questions (FAQ)

Q: I got an installation error involving “pkg_resources.VersionConflict (setuptools xx.xx.xx”. How do I fix this?

You have a very old version of setuptools, and we won’t be able to compile our Cython extension with it. Run pip install --upgrade setuptools and try installing Cypunct again.

Q: Wouldn’t this be way faster if it were written in Pure C?

Yes, it would. I’m too lazy to hand-code a C CPython extension, but it’s on my todo list. Right now, Cypunct is “fast enough”, and I can move onto other things in my daily life.

However, if you want to take on the challenge of rewriting Cypunct in C and having the exact same functionality as the current Cython version, I’ll send you $100 USD.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.1

Jul 3, 2017

0.1.0

Jul 3, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cypunct-0.1.1.tar.gz (185.6 kB view details)

Uploaded Jul 3, 2017 Source

File details

Details for the file cypunct-0.1.1.tar.gz.

File metadata

Download URL: cypunct-0.1.1.tar.gz
Upload date: Jul 3, 2017
Size: 185.6 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for cypunct-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6f3999419b1a6541c223991b64f7255e3c519e568aaf93cd04109c1df3240056`
MD5	`16121e30b3385ed5135dbd68f2eda173`
BLAKE2b-256	`e32623e1e676fc8b1c86cdf5360969942cd1bbd2ad86606fee1911d5b4edcdfc`

See more details on using hashes here.

cypunct 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage

Updating Unicode data

Frequently Asked Questions (FAQ)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes