Skip to main content

Fast English word segmentation

Project description

Cover logo

Instant Segment: fast English word segmentation in Rust

Documentation Crates.io PyPI Build status License: Apache 2.0

Instant Segment is a fast Apache-2.0 library for English word segmentation. It is based on the Python wordsegment project written by Grant Jenks, which is in turn based on code from Peter Norvig's chapter Natural Language Corpus Data from the book Beautiful Data (Segaran and Hammerbacher, 2009).

For the microbenchmark included in this repository, Instant Segment is ~500x faster than the Python implementation. The API was carefully constructed so that multiple segmentations can share the underlying state to allow parallel usage.

How it works

Instant Segment works by segmenting a string into words by selecting the splits with the highest probability given a corpus of words and their occurrences.

For instance, provided that choose and spain occur more frequently than chooses and pain, and that the pair choose spain occurs more frequently than chooses pain, Instant Segment can help identify the domain choosespain.com as ChooseSpain.com which more likely matches user intent.

Read about how we built and improved Instant Segment for use in production at Instant Domain Search to help our users find relevant domains they can register.

Using the library

Python (>= 3.9)

pip install instant-segment

Rust

[dependencies]
instant-segment = "0.8.1"

Examples

The following examples expect unigrams and bigrams to exist. See the examples (Rust, Python) to see how to construct these objects.

import instant_segment

segmenter = instant_segment.Segmenter(unigrams, bigrams)
search = instant_segment.Search()
segmenter.segment("instantdomainsearch", search)
print([word for word in search])

--> ['instant', 'domain', 'search']
use instant_segment::{Search, Segmenter};
use std::collections::HashMap;

let segmenter = Segmenter::new(unigrams, bigrams);
let mut search = Search::default();
let words = segmenter
    .segment("instantdomainsearch", &mut search)
    .unwrap();
println!("{:?}", words.collect::<Vec<&str>>())

--> ["instant", "domain", "search"]

Check out the tests for more thorough examples: Rust, Python

Testing

To run the tests run the following:

cargo t -p instant-segment --all-features

You can also test the Python bindings with:

make test-python

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

instant_segment-0.1.8-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.8 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.7 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-cp312-none-win_amd64.whl (155.8 kB view details)

Uploaded CPython 3.12 Windows x86-64

instant_segment-0.1.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (286.9 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-cp312-cp312-macosx_11_0_arm64.whl (247.3 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

instant_segment-0.1.8-cp312-cp312-macosx_10_12_x86_64.whl (252.0 kB view details)

Uploaded CPython 3.12 macOS 10.12+ x86-64

instant_segment-0.1.8-cp311-none-win_amd64.whl (157.1 kB view details)

Uploaded CPython 3.11 Windows x86-64

instant_segment-0.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.2 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-cp311-cp311-macosx_11_0_arm64.whl (248.8 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

instant_segment-0.1.8-cp311-cp311-macosx_10_12_x86_64.whl (253.8 kB view details)

Uploaded CPython 3.11 macOS 10.12+ x86-64

instant_segment-0.1.8-cp310-none-win_amd64.whl (157.2 kB view details)

Uploaded CPython 3.10 Windows x86-64

instant_segment-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.2 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-cp310-cp310-macosx_11_0_arm64.whl (248.8 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

instant_segment-0.1.8-cp310-cp310-macosx_10_12_x86_64.whl (253.8 kB view details)

Uploaded CPython 3.10 macOS 10.12+ x86-64

instant_segment-0.1.8-cp39-none-win_amd64.whl (157.9 kB view details)

Uploaded CPython 3.9 Windows x86-64

instant_segment-0.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.7 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-cp39-cp39-macosx_11_0_arm64.whl (248.6 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

instant_segment-0.1.8-cp39-cp39-macosx_10_12_x86_64.whl (253.9 kB view details)

Uploaded CPython 3.9 macOS 10.12+ x86-64

instant_segment-0.1.8-cp38-none-win_amd64.whl (157.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

instant_segment-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.6 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

instant_segment-0.1.8-cp37-none-win_amd64.whl (157.2 kB view details)

Uploaded CPython 3.7 Windows x86-64

instant_segment-0.1.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.6 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file instant_segment-0.1.8-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2d8e44189f132914b886bbda403433ef42d025e75f25805a3b80b0ad9b08944d
MD5 39672f7bd04c3b476b1abb730346466e
BLAKE2b-256 4161ff26f052922b433d2fac12b8253746f9bdb49cbeb12795143ce669e91088

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bef1dd8b8acd139ef372d9d66b6f3952df2afd717688c3e78bc3783575f721b
MD5 708d1a3ecc5287871ecc55d0bdf99455
BLAKE2b-256 0ff3bf130276f8e4faf8e1a2c2b28c723682b43b1bc3876c62cca838f7da44dd

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp312-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp312-none-win_amd64.whl
Algorithm Hash digest
SHA256 becf42d3fd3abcc2e7ebd971de3d97edec4128c867dbf68623ed5ba0c4593156
MD5 c51e3de7942c86fb002d2be0cf1cc3c3
BLAKE2b-256 23f51d97b4b98f75b2ef11074a03b4607983edecb5795d502e61c9e1ff7130fb

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 736006c9167299aaf2690cd5f1f13057d38e0738f5f5d4d91bddc905cca0b1cd
MD5 bb2d27d7811af977cdfd5c57b3f27411
BLAKE2b-256 a6fb796da322fbda21f23cdf8259047498ab934421be5538d3f396dbdd383660

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d69648d17dc7604dad8fa1e05bdf21c7f7a556adcf371fa4bf871cd7d8852a4d
MD5 f42d57c574f9a5a4ebd4f6bf888b182c
BLAKE2b-256 3fcbc828002483299a6b06fd6cb31b26eab0cc3cd8cbb0a3e6c4e68467ebbb13

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c59edcaa712bf1d0287d61a847540c16ab1805f860453660b374bb9b7032cf98
MD5 f1cb75ac0c0670f3f454561bd745cbe4
BLAKE2b-256 ccdbcdd7e5379e7aaaebb59b12ae1cceeee62dd75610f398d71d697365371968

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 f7e285fe106ce0e21de479986e8e81e60692f533b3c72638660b80f553b62da3
MD5 e596b09914e09c221d7ccd22cb16df6d
BLAKE2b-256 de85d81f5aee5192bf73d69e9bd29320893af112fbc76a598f7b445db6a9ca91

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e7780a80961bb4e7aea66f965c46a5b4eef6f9d590941e38a448d3e64e477597
MD5 0524e45eb0c22445214eee55588095e6
BLAKE2b-256 ec596754e2d9da56040f7982c0cd151d851d87128d2b6ce31486de7ae86b9f3e

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1d4674c77727799b900adad8e0f08b3848fa59caedd1744093527454f602da64
MD5 7c260b9348cdad3f4e5831bbcbffd155
BLAKE2b-256 13d0b66c524c4219a26996787ef7d3ebcaf40e2fa3138ff7f211988620b9c68b

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 7d045018cf08e4131e58f9e4267d1b1f178000c52e8d08fe6da4b70988e655f0
MD5 5a172955b1709abb5d067095f1726934
BLAKE2b-256 729288ac27e67f8b4771bcc030fe7c19c66664edf61bead26e0f05c7fe65a3cc

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 cbaff845b59fe07bd8a798df8faf7cf93a1ddbdf775688df566cf9a69baef378
MD5 59eee7779fc8e0797858e56abccd0cec
BLAKE2b-256 11d3fdf7914791c1dae4f13e9cc8168faaa571a80037c285552b4adb6b194446

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a8b0cfeafd167d51b7cf88082c3af26055e470c916dad8d848bd863b1c66a8d0
MD5 794ce930f30fb7d18d759c768854f700
BLAKE2b-256 d5dc8bdbfea2ef0df1a99366cd671befefb4feda341e0cf19eaa351e7ade114b

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 86f19268585bb09656f98ecad68111fc853f9fb184af2c035b67478c0bb3a926
MD5 0ff529d1462d1a79a81717baa781e263
BLAKE2b-256 7018d2a2b14f2b324d0b42d6d08c95e28689221a4fe8cc7050cc3a970a940ca7

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 99cf82510aeab8493232f39d002d49ede3b316790a36e68f85eadb1efac3d600
MD5 87a0d7167b10038b5ae88ec56329c47f
BLAKE2b-256 1de08053c23dc978c26efa2f1c20a059b326da8e0a6e288af0c043c1f5bf06f5

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp39-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp39-none-win_amd64.whl
Algorithm Hash digest
SHA256 dab7385ec71e0d65db032fb4397ef5badf47b42d78f96f3b0ce462b57d28b752
MD5 9c3cafbc799ad01b8ca63787616d6d3e
BLAKE2b-256 7f3d52affce0679a3a6242811f7a5b43e0402427a277965c025f526e415c42d1

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6652944309e557f38f1dfe6a0e76b6f6c29d0ae0b3dfb04148b0b9c05e994598
MD5 9030b52131aba06ea446855288f7a6d5
BLAKE2b-256 458a24a362aaa03c85fd851e6b2c307de03750abd04399c1144637731abbbb74

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bf290c274004e40023a9e03230882c9f7c4f4afdc59c01cf9232e8818cfc74fa
MD5 07aa5bdf00f34207e375fc55525de8b1
BLAKE2b-256 f028698e15875169cf7fb27152812ce7ce463b02220bd8b290a9c86a4dd42dca

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp39-cp39-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp39-cp39-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5f05be3b66a96132fab912d34c7f10ca02b623d103c4699cc3a5cf1a4606fe1e
MD5 0403d15653e532bb59d49b2252af1058
BLAKE2b-256 ee222ea034c25aa4993a30dd1d147da9ae370e85680f33ac5ae3a3f6a6fe26aa

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp38-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp38-none-win_amd64.whl
Algorithm Hash digest
SHA256 12847df115391241c58d4e3aa2484b687ddd595f0c15747d1b0835e0a0ab624e
MD5 fcdf15d523b19f563585af8a05864e9a
BLAKE2b-256 2b9604a0e766bf430ff332bfafd85b281f5be3df832e81051e2954b990fceb13

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6d7cbe833d1017af02a59de4be63a6f7790556c8fc90789704c8f13c215f5330
MD5 1e9f87b6638537d4b1dfda4df1a7e613
BLAKE2b-256 3d81de9d430b21ed41a6f824d1a9e9ec66aa43588331c3b9b318fee02bc5e018

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp37-none-win_amd64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp37-none-win_amd64.whl
Algorithm Hash digest
SHA256 9e63bb37ac438b3fe5f51e3817f680f3c0f32b6eebde1f2e07888c78c24517a0
MD5 0830a5306ff7e8f040a4ccdd893c89b9
BLAKE2b-256 491eb495b147b72b09fcb8f3d9dfb59192bf6ebb2681485454818561949a35da

See more details on using hashes here.

File details

Details for the file instant_segment-0.1.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for instant_segment-0.1.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 542ad5ad0236c1cb7603ea3e83be8ec566542e66d047ef1f4b1fa4b73ecb1e8a
MD5 22b8564706d685982c1ca08184511fcc
BLAKE2b-256 68de300049b9b8ed4326d7e9df89cd50b3060ddbae6e5d5304a2c175e9cbc0ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page