Skip to main content

No project description provided

Project description

WordSegment is an Apache2 licensed module for English word segmentation, modified from grantjenks/python-wordsegment ported to rust, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Features

  • ~Pure-Python~ Partly rust

  • Fully documented

  • 100% Test Coverage

  • Includes unigram and bigram data

  • Command line interface for batch processing

  • Easy to hack (e.g. different scoring, new data, different language)

  • Developed on Python 2.7

  • Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4, 3.5, 3.6 and PyPy, PyPy3

  • Tested on Windows, Mac OS X, and Linux

  • Tested using Travis CI and AppVeyor CI

https://api.travis-ci.org/grantjenks/python-wordsegment.svg https://ci.appveyor.com/api/projects/status/github/grantjenks/python-wordsegment?branch=master&svg=true

Quickstart

Installing WordSegment is simple with pip:

$ pip install wordsegment-rs

You can access documentation in the interpreter with Python’s built-in help function:

>>> import wordsegment
>>> help(wordsegment)

Tutorial

In your own Python programs, you’ll mostly want to use segment to divide a phrase into a list of its parts:

>>> from wordsegment import load, segment
>>> load()
>>> segment('thisisatest')
['this', 'is', 'a', 'test']

The load function reads and parses the unigrams and bigrams data from disk. Loading the data only needs to be done once.

WordSegment also provides a command-line interface for batch processing. This interface accepts two arguments: in-file and out-file. Lines from in-file are iteratively segmented, joined by a space, and written to out-file. Input and output default to stdin and stdout respectively.

$ echo thisisatest | python -m wordsegment
this is a test

If you want to run WordSegment as a kind of server process then use Python’s -u option for unbuffered output. You can also set PYTHONUNBUFFERED=1 in the environment.

>>> import subprocess as sp
>>> wordsegment = sp.Popen(
        ['python', '-um', 'wordsegment'],
        stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.STDOUT)
>>> wordsegment.stdin.write('thisisatest\n')
>>> wordsegment.stdout.readline()
'this is a test\n'
>>> wordsegment.stdin.write('workswithotherlanguages\n')
>>> wordsegment.stdout.readline()
'works with other languages\n'
>>> wordsegment.stdin.close()
>>> wordsegment.wait()  # Process exit code.
0

The maximum segmented word length is 24 characters. Neither the unigram nor bigram data contain words exceeding that length. The corpus also excludes punctuation and all letters have been lowercased. Before segmenting text, clean is called to transform the input to a canonical form:

>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']

Sometimes its interesting to explore the unigram and bigram counts themselves. These are stored in Python dictionaries mapping word to count.

>>> import wordsegment as ws
>>> ws.load()
>>> ws.UNIGRAMS['the']
23135851162.0
>>> ws.UNIGRAMS['gray']
21424658.0
>>> ws.UNIGRAMS['grey']
18276942.0

Above we see that the spelling gray is more common than the spelling grey.

Bigrams are joined by a space:

>>> import heapq
>>> from pprint import pprint
>>> from operator import itemgetter
>>> pprint(heapq.nlargest(10, ws.BIGRAMS.items(), itemgetter(1)))
[('of the', 2766332391.0),
 ('in the', 1628795324.0),
 ('to the', 1139248999.0),
 ('on the', 800328815.0),
 ('for the', 692874802.0),
 ('and the', 629726893.0),
 ('to be', 505148997.0),
 ('is a', 476718990.0),
 ('with the', 461331348.0),
 ('from the', 428303219.0)]

Some bigrams begin with <s>. This is to indicate the start of a bigram:

>>> ws.BIGRAMS['<s> where']
15419048.0
>>> ws.BIGRAMS['<s> what']
11779290.0

The unigrams and bigrams data is stored in the wordsegment directory in the unigrams.txt and bigrams.txt files respectively.

User Guide

References

WordSegment License

Copyright 2018 Grant Jenks

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordsegment_rs-0.2.0.tar.gz (5.0 MB view details)

Uploaded Source

Built Distributions

wordsegment_rs-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl (5.9 MB view details)

Uploaded PyPy manylinux: glibc 2.5+ x86-64

wordsegment_rs-0.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.5+ x86-64

wordsegment_rs-0.2.0-cp310-none-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.10 Windows x86-64

wordsegment_rs-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.5+ x86-64

wordsegment_rs-0.2.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.4 MB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

wordsegment_rs-0.2.0-cp39-none-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.9 Windows x86-64

wordsegment_rs-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.5+ x86-64

wordsegment_rs-0.2.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.4 MB view details)

Uploaded CPython 3.9 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

wordsegment_rs-0.2.0-cp38-none-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

wordsegment_rs-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.5+ x86-64

wordsegment_rs-0.2.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.4 MB view details)

Uploaded CPython 3.8 macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

wordsegment_rs-0.2.0-cp37-none-win_amd64.whl (5.1 MB view details)

Uploaded CPython 3.7 Windows x86-64

wordsegment_rs-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (5.9 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.5+ x86-64

wordsegment_rs-0.2.0-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.4 MB view details)

Uploaded CPython 3.7m macOS 10.9+ universal2 (ARM64, x86-64) macOS 10.9+ x86-64 macOS 11.0+ ARM64

File details

Details for the file wordsegment_rs-0.2.0.tar.gz.

File metadata

  • Download URL: wordsegment_rs-0.2.0.tar.gz
  • Upload date:
  • Size: 5.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.12.20

File hashes

Hashes for wordsegment_rs-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e059375905f6d4bf920f9e31edd89dd7cad99ff4796a5b85e90861629a3c74da
MD5 08e17b4dc853094f291adf09ebed3f52
BLAKE2b-256 f46e75a6ca10aa8b0ff66d727b944be949ae5cbfbc190dd8fff68ac0afeaf87a

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c4fe2bf4a6c4b52e315b5a54b80e7a1bc844e17a5f54920c6ec9ad01eed0fd2b
MD5 b94b7553ab77f2f1d81787ab1edcd668
BLAKE2b-256 6205e09a845af081c72182b198ada11b07128ebc8e195aa41d59378ee56717c0

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6052a5a2f733f75d8ef9c54588bf0aea0efcea27afbf310a90bc7bb7827c066a
MD5 f14800e02c203d2104697b73e84dc94c
BLAKE2b-256 4a98556f74444fbce2109a71f39f6e89c48d76a708ddeed4fc4b73a89b4c352a

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 65569624d917b9c7737c8bd0eebe2c4116764c44327e3d2eb46b7060b45556e5
MD5 d40964b2cd17b29f34c6e0ae8a33e8ad
BLAKE2b-256 442475b1337d44d72d8f98d52d2d3b699d5b86071e8680d2d31e1f45a5819229

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1d06c81e1e926088dbc8a1dd4e11484b02d5c7452c132378d1a0c4f47a3a5dc4
MD5 6990f651205368269b2c08a19dfeca7f
BLAKE2b-256 33104f2f9777b86050f4336fff6ab367997b2f653419f30dea5c8029940d4606

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 be1edc8462690663bdd6e9cb47d5b51fc571840a6008f6ff721e2fd2656e7096
MD5 66910d5bd18abf614bfb3e74591b44c8
BLAKE2b-256 ec91b5d0851b95f20ef7c5efd9ac03256170bb0184292e225c4be300b6ab129a

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 1d92dcd7a5a4dd373e937b75cd1a6c355611081755723c952a0fdacdf9353b01
MD5 7ea2f77855193c1a1056be396836e9f6
BLAKE2b-256 f57e76d2a0e215d37e0921b490b6424315ce2ef4f6717eeb425963b1499cf4a1

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5200a9af44f6ee05f55237f6a05849f6db366fcbf8e6cfddfe981f24348ffed1
MD5 d174e2ad0618cbffd948cd6cdfe0ae32
BLAKE2b-256 8cf8cdab47aa06f81816c53f60777f2e71f36315ae9db7bfebfcdf1d567d6d8b

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp39-cp39-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 854c096caf7c22afa7af349d5fe2736d62763320ef9a888feaccec503a9adca0
MD5 bf80ef6c97e042f85df301de8c658a79
BLAKE2b-256 f6f318a8b1e8cac23ea48908be88a7726aaa1fe631d6a9a9a3921a711c67fc19

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 ced6b22eac37993ae043039107c9b30c73b8587f4821ac95297ba94c036faee7
MD5 c1557c19de904e75822e22230731ab9b
BLAKE2b-256 3bc3288a2bbba309ad3b80a9eb771f8a4ea08c20b9d8d3000d1c4287163938cb

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fd8fdc172eaecd914ac582e4fdedd3c6214deb83fa365a64cbd5bc1a703260eb
MD5 db5994e1ec253e803f9627a2317a29e3
BLAKE2b-256 7ba1ea1da6f8568e715805239f9a96167ffbb52ecd90fbff3035adabfacc62e5

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp38-cp38-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5afb732b5074d0eb7d839b3666c0bfc8203c0af8b6218f873d3ec4e12836dc7e
MD5 2def0c0171707627563fd152081203e2
BLAKE2b-256 8b0d6e64388f71418b05dcb38d2c3dd78ac639f596056093e41187e4abc38023

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp37-none-win_amd64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp37-none-win_amd64.whl
Algorithm Hash digest
SHA256 52d5db70006da4ea3abec91445ca0ec9719e2e71437d9561abaa25a7792c3136
MD5 292592e8db57bb4fb999f1962a99bd29
BLAKE2b-256 baa4df83f53f7dd878c8368718f55a2f611e582829bec041587733829bd9b957

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2a0bc12aacca652575103911902996ea892a730dbc452aa9bbcc6f05bf65eb17
MD5 45816a03ba832969756d15d5c1e2509a
BLAKE2b-256 92a18bcc7c952afaf4d575887ffdcee96a3090b48d248bf369d2be88de8b8c91

See more details on using hashes here.

File details

Details for the file wordsegment_rs-0.2.0-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for wordsegment_rs-0.2.0-cp37-cp37m-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 b93d8a3c76dd044672fde26c0345ea3ca28562ec93abdc145bb7dad316b30e94
MD5 bbe7203ce95cd86f135ea8ccbc065caf
BLAKE2b-256 b48076ed11bdc5478a0d38052a9c67fa4c85a5a79b6dc8ae3175046cac2af51c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page