Skip to main content

No project description provided

Project description

WordSegment is an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Features

  • Pure-Python

  • Fully documented

  • 100% Test Coverage

  • Includes unigram and bigram data

  • Command line interface for batch processing

  • Easy to hack (e.g. different scoring, new data, different language)

  • Developed on Python 2.7

  • Tested on CPython 2.6, 2.7, 3.2, 3.3, 3.4 and PyPy 2.5+, PyPy3 2.4+

Quickstart

Installing WordSegment is simple with pip:

$ pip install wordsegmentation

Tutorial

In your own Python programs, you’ll mostly want to use segment to divide a phrase into a list of its parts:

>>> from wordsegmentation import Wordsegment
>>> ws = WordSegment(use_google_corpus=True)

>>> ws.segment('universityofwashington')
['university', 'of', 'washington']
>>> ws.segment('thisisatest')
['this', 'is', 'a', 'test']
>>> segment('thisisatest')
['this', 'is', 'a', 'test']

WordSegment License

Copyright 2015 Weihan Jiang

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordsegmentation-0.3.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordsegmentation-0.3-py2.py3-none-any.whl (5.9 MB view details)

Uploaded Python 2Python 3

File details

Details for the file wordsegmentation-0.3.tar.gz.

File metadata

File hashes

Hashes for wordsegmentation-0.3.tar.gz
Algorithm Hash digest
SHA256 f6bdbe497cc06a5c5e960a265f5255f311eaa6da8b3d8aa7a5028651d1c32a46
MD5 6cf9a4ed8cd34cada0067c44339f4003
BLAKE2b-256 6a961442581eeae46e34fe409becabdad3eda93461c45217e8a76dcae6de5738

See more details on using hashes here.

File details

Details for the file wordsegmentation-0.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for wordsegmentation-0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d23435a67bcb8a8aefedb582996e7f19c077544ca306dff0f7bddd31ac1e569a
MD5 05350b554bea45998357f393d7e22581
BLAKE2b-256 56573b5c3d10de1520da6e31be9f49d1b156ec2a723f8e7f8f4f8ca7c81fe0b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page