Skip to main content

English word segmentation.

Project description

WordSegmentation is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Inspired by Grant Jenks https://pypi.python.org/pypi/wordsegment. Based on word weighing algorithm from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium.

Features

  • Pure-Python

  • Segmentation Algorithm Using Divide and Conquer so that there is NO max length limit set to input text.

  • Segmentation Algotrithm used Dynamic Programming to achieve a polynomial time complexity.

  • Used Google Trillion Corpus to do scoring for the word segmentation.

  • Developed on Python 2.7

  • Tested on CPython 2.6, 2.7, 3.4.

Quickstart

Installing WordSegment is simple with pip:

$ pip install wordsegmentation

Dependency required networkx:

$ pip install networkx

Tutorial

In your own Python programs, you’ll mostly want to use segment to divide a phrase into a list of its parts:

>>> from wordsegmentation import WordSegment
>>> ws = WordSegment()

>>> ws.segment('universityofwashington')
['university', 'of', 'washington']
>>> ws.segment('thisisatest')
['this', 'is', 'a', 'test']
>>> ws.segment('thisisanURLcontaining123345and-&**^&butitstillworks')
['this', 'is', 'an', 'url', 'containing', '123345', 'and', '-&**^&', 'but', 'it', 'still', 'works']
>>> ws.segment('NoMatterHowLongThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextMightBe')
['no', 'matter', 'how', 'long', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'might', 'be']

Bug Report

Weihan@Github

weihan.github AT gmail.com

Tech Details

In the code, the segmentation algorithm consists of the following steps,

  1. divide and conquer – safely divide the input string into substring. This way we solved the length limit which will dramatically slow down the performance. For example, “facebook123helloworld” will be treated as 3 sub-problems – “facebook”, “123”, and “helloworld”.

  2. for each sub-string. I used dynamic programming to calculate and get the optimal words.

  3. combine the sub-problems, and return the result for the original string.

Segmentation algorithm used in this module, has achieved a time-complexity of O(n^2). By comparison to existing segmentation algorithms, this module does better on following aspects,

  1. can handle very long input. There is no arbitary max lenght limit set to input string.

  2. segmentation finished in polynomial time via dynamic programming.

  3. by default, the algorithm uses a filtered Google corpus, which contains only English words that could be found in dictionary.

An extreme example is to solve the classic English Scriptio_continua segmentation problem as shown below::

>>>ws.segment(‘MARGARETAREYOUGRIEVINGOVERGOLDENGROVEUNLEAVINGLEAVESLIKETHETHINGSOFMANYOUWITHYOURFRESHTHOUGHTSCAREFORCANYOUAHASTHEHEARTGROWSOLDERITWILLCOMETOSUCHSIGHTSCOLDERBYANDBYNORSPAREASIGHTHOUGHWORLDSOFWANWOODLEAFMEALLIEANDYETYOUWILLWEEPANDKNOWWHYNOWNOMATTERCHILDTHENAMESORROWSSPRINGSARETHESAMENORMOUTHHADNONORMINDEXPRESSEDWHATHEARTHEARDOFGHOSTGUESSEDITISTHEBLIGHTMANWASBORNFORITISMARGARETYOUMOURNFOR’)

Our algorithm solved this issue in polynomial time, and the output is:

[‘margaret', 'are', 'you', 'grieving', 'over', 'golden', 'grove', 'un', 'leaving', 'leaves', 'like', 'the', 'things', 'of', 'man', 'you', 'with', 'your', 'fresh', 'thoughts', 'care', 'for', 'can', 'you', 'a', 'has', 'the', 'he', 'art', 'grows', 'older', 'it', 'will', 'come', 'to', 'such', 'sights', 'colder', 'by', 'and', 'by', 'nor', 'spa', 're', 'a', 'sigh', 'though', 'worlds', 'of', 'wan', 'wood', 'leaf', 'me', 'allie', 'and', 'yet', 'you', 'will', 'weep', 'and', 'know', 'why', 'now', 'no', 'matter', 'child', 'the', 'name', 'sorrows', 'springs', 'are', 'the', 'same', 'nor', 'mouth', 'had', 'non', 'or', 'mind', 'expressed', 'what', 'he', 'art', 'heard', 'of', 'ghost', 'guessed', 'it', 'is', 'the', 'blight', 'man', 'was', 'born', 'for', 'it', 'is', 'margaret', 'you', 'mourn', 'for']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordsegmentation-0.3.5.tar.gz (4.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordsegmentation-0.3.5-py2.py3-none-any.whl (8.2 MB view details)

Uploaded Python 2Python 3

File details

Details for the file wordsegmentation-0.3.5.tar.gz.

File metadata

File hashes

Hashes for wordsegmentation-0.3.5.tar.gz
Algorithm Hash digest
SHA256 1ba8cc24567816cee2df32844aed6d55e1863b7d3eb548ce5188ea23b7d9caf4
MD5 102ada6b311e0e817a9435ccb4cbab62
BLAKE2b-256 543deaf59a59d8a27b54332cecbebfc34cc6c885f29e39b8c9ebc9d04e71045c

See more details on using hashes here.

File details

Details for the file wordsegmentation-0.3.5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for wordsegmentation-0.3.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 3c55f55941678627a879678a22c01687db27915e73e53a0d2763901f679d9150
MD5 f07aafe599a8776d8ec43fe2f4673c8b
BLAKE2b-256 97cf83b624050e4f9d7d3d99fdf15c2ec38af149b4e205bab0ecf28d6772b528

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page