This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

WordSegmentation is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Inspired by Grant Jenks https://pypi.python.org/pypi/wordsegment. Based on word weighing algorithm from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium.

Features

  • Pure-Python
  • Segmentation Algorithm Using Divide and Conquer so that there is NO max length limit set to input text.
  • Segmentation Algotrithm used Dynamic Programming to achieve a polynomial time complexity.
  • Used Google Trillion Corpus to do scoring for the word segmentation.
  • Developed on Python 2.7
  • Tested on CPython 2.6, 2.7, 3.4.

Quickstart

Installing WordSegment is simple with pip:

$ pip install wordsegmentation

Dependency required networkx:

$ pip install networkx

Tutorial

In your own Python programs, you’ll mostly want to use segment to divide a phrase into a list of its parts:

>>> from wordsegmentation import WordSegment
>>> ws = WordSegment()

>>> ws.segment('universityofwashington')
['university', 'of', 'washington']
>>> ws.segment('thisisatest')
['this', 'is', 'a', 'test']
>>> ws.segment('thisisanURLcontaining123345and-&**^&butitstillworks')
['this', 'is', 'an', 'url', 'containing', '123345', 'and', '-&**^&', 'but', 'it', 'still', 'works']
>>> ws.segment('NoMatterHowLongThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextThisTextMightBe')
['no', 'matter', 'how', 'long', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'this', 'text', 'might', 'be']

Bug Report

Weihan@Github

weihan.github AT gmail.com

Tech Details

In the code, the segmentation algorithm consists of the following steps,

  1. divide and conquer – safely divide the input string into substring. This way we solved the length limit which will dramatically slow down the performance. For example, “facebook123helloworld” will be treated as 3 sub-problems – “facebook”, “123”, and “helloworld”.
  2. for each sub-string. I used dynamic programming to calculate and get the optimal words.
  3. combine the sub-problems, and return the result for the original string.

Segmentation algorithm used in this module, has achieved a time-complexity of O(n^2). By comparison to existing segmentation algorithms, this module does better on following aspects,

  1. can handle very long input. There is no arbitary max lenght limit set to input string.
  2. segmentation finished in polynomial time via dynamic programming.
  3. by default, the algorithm uses a filtered Google corpus, which contains only English words that could be found in dictionary.
An extreme example is to solve the classic English Scriptio_continua segmentation problem as shown below::
>>>ws.segment(‘MARGARETAREYOUGRIEVINGOVERGOLDENGROVEUNLEAVINGLEAVESLIKETHETHINGSOFMANYOUWITHYOURFRESHTHOUGHTSCAREFORCANYOUAHASTHEHEARTGROWSOLDERITWILLCOMETOSUCHSIGHTSCOLDERBYANDBYNORSPAREASIGHTHOUGHWORLDSOFWANWOODLEAFMEALLIEANDYETYOUWILLWEEPANDKNOWWHYNOWNOMATTERCHILDTHENAMESORROWSSPRINGSARETHESAMENORMOUTHHADNONORMINDEXPRESSEDWHATHEARTHEARDOFGHOSTGUESSEDITISTHEBLIGHTMANWASBORNFORITISMARGARETYOUMOURNFOR’)

Our algorithm solved this issue in polynomial time, and the output is:

[‘margaret', 'are', 'you', 'grieving', 'over', 'golden', 'grove', 'un', 'leaving', 'leaves', 'like', 'the', 'things', 'of', 'man', 'you', 'with', 'your', 'fresh', 'thoughts', 'care', 'for', 'can', 'you', 'a', 'has', 'the', 'he', 'art', 'grows', 'older', 'it', 'will', 'come', 'to', 'such', 'sights', 'colder', 'by', 'and', 'by', 'nor', 'spa', 're', 'a', 'sigh', 'though', 'worlds', 'of', 'wan', 'wood', 'leaf', 'me', 'allie', 'and', 'yet', 'you', 'will', 'weep', 'and', 'know', 'why', 'now', 'no', 'matter', 'child', 'the', 'name', 'sorrows', 'springs', 'are', 'the', 'same', 'nor', 'mouth', 'had', 'non', 'or', 'mind', 'expressed', 'what', 'he', 'art', 'heard', 'of', 'ghost', 'guessed', 'it', 'is', 'the', 'blight', 'man', 'was', 'born', 'for', 'it', 'is', 'margaret', 'you', 'mourn', 'for']
Release History

Release History

0.3.5

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3.4

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.3

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

0.0.2

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
wordsegmentation-0.3.5-py2.py3-none-any.whl (8.2 MB) Copy SHA256 Checksum SHA256 py2.py3 Wheel Oct 26, 2015
wordsegmentation-0.3.5.tar.gz (4.9 MB) Copy SHA256 Checksum SHA256 Source Oct 26, 2015

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting