Skip to main content

Fast English word segmentation

Project description

Cover logo

Instant Segment: fast English word segmentation in Rust

Documentation Crates.io PyPI Build status License: Apache 2.0

Instant Segment is a fast Apache-2.0 library for English word segmentation. It is based on the Python wordsegment project written by Grant Jenks, which is in turn based on code from Peter Norvig's chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009).

For the microbenchmark included in this repository, Instant Segment is ~500x faster than the Python implementation. The API was carefully constructed so that multiple segmentations can share the underlying state to allow parallel usage.

How it works

Instant Segment works by segmenting a string into words by selecting the splits with the highest probability given a corpus of words and their occurrences.

For instance, provided that choose and spain occur more frequently than chooses and pain, and that the pair choose spain occurs more frequently than chooses pain, Instant Segment can help identify the domain choosespain.com as ChooseSpain.com which more likely matches user intent.

Read about how we built and improved Instant Segment for use in production at Instant Domain Search to help our users find relevant domains they can register.

Using the library

Python (>= 3.9)

pip install instant-segment

Rust

[dependencies]
instant-segment = "0.8.1"

Examples

The following examples expect unigrams and bigrams to exist. See the examples (Rust, Python) to see how to construct these objects.

import instant_segment

segmenter = instant_segment.Segmenter(unigrams, bigrams)
search = instant_segment.Search()
segmenter.segment("instantdomainsearch", search)
print([word for word in search])

--> ['instant', 'domain', 'search']
use instant_segment::{Search, Segmenter};
use std::collections::HashMap;

let segmenter = Segmenter::new(unigrams, bigrams);
let mut search = Search::default();
let words = segmenter
    .segment("instantdomainsearch", &mut search)
    .unwrap();
println!("{:?}", words.collect::<Vec<&str>>())

--> ["instant", "domain", "search"]

Check out the tests for more thorough examples: Rust, Python

Testing

To run the tests run the following:

cargo t -p instant-segment --all-features

You can also test the Python bindings with:

make test-python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

instant_segment-0.1.9-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292.5 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp313-cp313-win_amd64.whl (159.3 kB view details)

Uploaded CPython 3.13 Windows x86-64

instant_segment-0.1.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292.5 kB view details)

Uploaded CPython 3.13 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp313-cp313-macosx_11_0_arm64.whl (255.8 kB view details)

Uploaded CPython 3.13 macOS 11.0+ ARM64

instant_segment-0.1.9-cp313-cp313-macosx_10_12_x86_64.whl (262.7 kB view details)

Uploaded CPython 3.13 macOS 10.12+ x86-64

instant_segment-0.1.9-cp312-cp312-win_amd64.whl (159.2 kB view details)

Uploaded CPython 3.12 Windows x86-64

instant_segment-0.1.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292.1 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp312-cp312-macosx_11_0_arm64.whl (255.5 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

instant_segment-0.1.9-cp312-cp312-macosx_10_12_x86_64.whl (262.4 kB view details)

Uploaded CPython 3.12 macOS 10.12+ x86-64

instant_segment-0.1.9-cp311-cp311-win_amd64.whl (158.3 kB view details)

Uploaded CPython 3.11 Windows x86-64

instant_segment-0.1.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292.4 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp311-cp311-macosx_11_0_arm64.whl (258.6 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

instant_segment-0.1.9-cp311-cp311-macosx_10_12_x86_64.whl (265.7 kB view details)

Uploaded CPython 3.11 macOS 10.12+ x86-64

instant_segment-0.1.9-cp310-cp310-win_amd64.whl (158.4 kB view details)

Uploaded CPython 3.10 Windows x86-64

instant_segment-0.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292.5 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp310-cp310-macosx_11_0_arm64.whl (258.7 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

instant_segment-0.1.9-cp310-cp310-macosx_10_12_x86_64.whl (265.7 kB view details)

Uploaded CPython 3.10 macOS 10.12+ x86-64

instant_segment-0.1.9-cp39-cp39-win_amd64.whl (159.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

instant_segment-0.1.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (293.4 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp39-cp39-macosx_11_0_arm64.whl (258.6 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

instant_segment-0.1.9-cp39-cp39-macosx_10_12_x86_64.whl (265.6 kB view details)

Uploaded CPython 3.9 macOS 10.12+ x86-64

instant_segment-0.1.9-cp38-cp38-win_amd64.whl (158.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

instant_segment-0.1.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292.9 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.9-cp37-cp37m-win_amd64.whl (158.9 kB view details)

Uploaded CPython 3.7m Windows x86-64

instant_segment-0.1.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (293.3 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file instant_segment-0.1.9-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4f9079f7a8ff5d0f2fa810193f18cdcb7ecac2a83189ed2122a9534a63b43b81
MD5 e9500cb9045c182ed8810c6575e03b46
BLAKE2b-256 edb96ae91d46c5cec2f0e29ad04ae0cd2212ed4bd4be3848259eef0d0a6272b5

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d28a5fd09124e9872d6d870490156a19b4dfaf59f983f9500c26b8d9837ac773
MD5 e30918741a4b4fe370c9ce803201da80
BLAKE2b-256 b064524fa11c199334b27129f90ca9a3bfc85897978d5babe8193a5a429e139c

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 464de4f5b62ad41f3cac85a371853128cf2d4705bbb5b399aedf15453269b7c9
MD5 a58e5e04cfe76d33ef16513be1ce6e6a
BLAKE2b-256 4a099326e90a319544a747a42b6f4bb70ad3cd7fe54d183abbd163c3312cbddd

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7c7c62d4704ee9d5a7157caf0fa00ccefe3b1ab8e091275bc882d78d5f908a76
MD5 f0fb8e4289d3c7e1372b8b37ebd20cd2
BLAKE2b-256 f0cf431f19df89c5bd7861889dc3158e879b2ec9c2e1dd5313f0b3a349d01ca9

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 788a5e7adef34c003dedac88d1c8b153fd445b0c6ebc773e441df169f0d143d6
MD5 93384586b62530f5536577a2f7b993d3
BLAKE2b-256 c71674804f618e17c51c4479617130d20a5f0d2227ba71ffd66ba08cb049f7f7

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5a26161d8c04fe0ca441a1ec1d092154f46923b6bdcbc3759fa473f5d5ca85aa
MD5 84e0adede1e3f30c1c041b66c38db0b9
BLAKE2b-256 49516547efd0d50738f59dd2d17a510f7ad8b570848264e44b5f03fb2c92bbeb

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a564ec0bff585d655da4dd1ca089cdd5b56a940e0586fbded6ada15584b6b957
MD5 5202622ee4cbc05c284a6f61513dadc8
BLAKE2b-256 bf18a0cdab8317adb4a0ae4e9645149420c8f8497a161774b0704e11fabd1700

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2616a13bfc3c4a5c0e1d4628ba2aa6c4cb08c7f34fe1b24cc6bc32c043e45373
MD5 043fdc54cb4ec3ca6e86bcee85f3eca6
BLAKE2b-256 50d473f4ecaf830eb5a92aac53db2533c8395d4f7402b65d4e0ed3cfddc7be88

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1e80c6269c34cd6e8847da96250e0142fc40bd4ebdc7a3af31367ac834c82d83
MD5 02c281c74ed05a8c054487abf5dbaf9f
BLAKE2b-256 a3ebbd2b8b22ccd21db860ba1fd42b1802affce6cfcf8bc4c750b9020eb4258d

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b32a3fd6b1aeeb4349301234a71d386b492d9f6d9f78129160b6675a2242d6c3
MD5 01d35ceb4c935fd395aac2af26e194ea
BLAKE2b-256 9cd00a9186d8b284521c96014599e9e004bcaa62031f45d9e1df749c36796f98

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5518420435e7de0a00e3863747bd4b1a9b388151fc254b6effe9e16f2a819ddf
MD5 c9f26b829f642e04ba4d657576eff7b6
BLAKE2b-256 38207639e07b5e7fa7adadf4bd64b4afbfc228b515c89db1269da4b63535a51b

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bf3d41545725dab50bbd8b0af232879ef1947a1f182046897dc75dc6f510dfd1
MD5 204b72e13bdd4f5245640ec842ac9388
BLAKE2b-256 6f22af0f510d3cc616361be8fda744935e0879d6f9292958200b66c260fa316c

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ea1d0e3a234ae68d4f5e6687e9214b8918f892d5ff274f25af1a7917780c0b8f
MD5 52508e5d94296a44cb5908f5eec66e68
BLAKE2b-256 eb2bce7b88cad55aa3ca1a00964a593ece92dd1469fac3b7e6744941ea9835be

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8c2de00216c2ec7c8531e0da53ed994030ea7139db386e70d912f1198aedd8f6
MD5 0cd0bdd4dbab8a06829f821d7337f957
BLAKE2b-256 84063f9a62e0d1097350705df1ec547474267df5f9650e00a649250a42dd3154

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 718fd9b617e49bdcb9c14f3efa9730c1c65a0417ea53be8df5bd0419b0f12501
MD5 c60e26378c98037906c427cd8bed98e3
BLAKE2b-256 7377996f4ae79c60f89cb7285a9aff60d8cef822a75b27cb5b6a0e088ef5e2c2

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4cab79d8daf85c7cba990bfe1a86a7c306cd808ae670941ba283743ddbb2bc2d
MD5 84d2fe6775d8b67c04b75261b74ae25c
BLAKE2b-256 84c9287543a3b4c7f210d71c4cb32d8cf56a8f5e28eacf7c87e6624fc293431c

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 456248f73fae9b8a4044672e1892e75e5d4a757d8cd87b33030453120fff8aa9
MD5 5d44b8cd5ea68ebc29eb165fe31dd654
BLAKE2b-256 c100b811320d05da5c37b669b4c7b4074a614c20fc2f14ab34294d39067a2801

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 61c22dc96e5caef1bbe25eee99c74985a5897901bdef6f2b01ca91cec8e10b7c
MD5 e55aa9eb72a1c56a4576d0ef1f49d764
BLAKE2b-256 fc14bd998461e668b8546a8184a771c2dd6324bf012b35e244462ebeca0d6eb0

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 826c0fc70a76d741920d979d08bc44a8a5809e6a24b0332815400548575397fc
MD5 70938c4b527961e4a2f524b296f53df9
BLAKE2b-256 c3e604e905c2a49ee7bc09430dfc165141b3ed0f012d5476f6bcce35f820ad26

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e7996316ed046ffb531245f331988ddfacd10e91329a802ee069a38edaa3a4e6
MD5 764d304c294f1290ed0965bcb6fbb4de
BLAKE2b-256 833a5dfb1577344919a1fbaab5e95bbc98ebf1cdab85af4ccc0871292b64c387

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4285f2c33e43df07ffa1640adabd78f8ff55c5bfbc9d13e2f472adbf7b7d5a2c
MD5 a70246cb23afd44af23d61b72e888334
BLAKE2b-256 57a2cc817d521d372a7bbb0bfd2154c225a2404dd9a697ee884a88ae063608ea

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 bf6556c21cfe5042ffc6acfa8e1d35af858a13a1be401cb595ed9cc1fa7ec230
MD5 492b90195c2717b7188f0196269065be
BLAKE2b-256 ff8fc6e044d04f6a1d4d10a04b98d2b53f95b40e9a15786c16f0ae5a21b7971f

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1f5df80bb8e0e24b7ddd8114ffe37acd99a2708c58469094ac2d2d06737d12e0
MD5 a6420d21f4e1fdffc39bb800a118a006
BLAKE2b-256 4837a2359f09938f19ef3308d21422967f6df298f39582bf20bb5f5576bcc519

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp37-cp37m-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 84058433e63d59a442d47b8027042dea4d9300c80816c42e699964b4cc3db297
MD5 2baabd40a07a48904a35c73eebfb41cc
BLAKE2b-256 c5e68d078fa15c11d939d59772696878854f09951dabc0476d5a77eff0a0234c

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.9-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b82187b75e5a101ac98daf66e502cd340f69d6ec5c1fd8d302de858cf0eae7ee
MD5 095efdaad85476db9f22db634a819bdd
BLAKE2b-256 60e8c7f3f84e3c1961af2cb6aff585960cd20cdb679709f4f5f4eb2bad167748

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page