Skip to main content

No project description provided

Project description

LeNLP

Natural Language Processing toolbox for Python with Rust

license

LeNLP is a toolkit dedicated to natural language processing (NLP). It provides optimized and parallelized functions in Rust for use in Python, offering high performance and ease of integration.

Installation

We can install LeNLP using:

pip install lenlp

Sections

Quick Start

Sparse Module

The sparse module offers a variety of vectorizers and transformers for text data. These sparse matrices are scipy.sparse.csr_matrix objects, optimized for memory usage and speed. They can be used as drop-in replacements for scikit-learn vectorizers.

CountVectorizer

The CountVectorizer converts a list of texts into a sparse matrix of token counts. This is a Rust implementation of the CountVectorizer from scikit-learn.

from lenlp import sparse

vectorizer = sparse.CountVectorizer(
    ngram_range=(3, 5), # range of n-grams
    analyzer="char_wb", # word, char, char_wb
    normalize=True, # lowercase and strip accents
    stop_words=["based"], # list of stop words
)

You can fit the vectorizer and transform a list of texts into a sparse matrix of token counts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP CountVectorizer versus Sklearn CountVectorizer fit_transform with char analyzer.

TfidfVectorizer

The TfidfVectorizer converts a list of texts into a sparse matrix of tf-idf weights, implemented in Rust.

from lenlp import sparse

vectorizer = sparse.TfidfVectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP TfidfVectorizer versus Sklearn TfidfVectorizer fit_transform with char analyzer.

BM25Vectorizer

The BM25Vectorizer converts texts into a sparse matrix of BM25 weights, which are more accurate than tf-idf and count weights.

from lenlp import sparse

vectorizer = sparse.BM25Vectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP BM25Vectorizer versus LeNLP TfidfVectorizer fit_transform with char analyzer. BM25Vectorizer counterpart is not available in Sklearn.

FlashText

The flashtext module allows for efficient keyword extraction from texts. It implements the FlashText algorithm as described in the paper Replace or Retrieve Keywords In Documents At Scale.

from lenlp import flash

flash_text = flash.FlashText(
    normalize=True # remove accents and lowercase
) 

# Add keywords we want to retrieve:
flash_text.add(["paris", "bordeaux", "toulouse"])

Extract keywords and their positions from sentences:

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

flash_text.extract(sentences)

Output:

[[('toulouse', 0, 8), ('bordeaux', 60, 68), ('bordeaux', 74, 82)],
 [('paris', 0, 5), ('bordeaux', 62, 70), ('toulouse', 76, 84)]]

The FlashText algorithm is highly efficient, significantly faster than regular expressions for keyword extraction. LeNLP's implementation normalizes input documents by removing accents and converting to lowercase to enhance keyword extraction.

Benchmark:

LeNLP FlashText is benchmarked versus the official implementation of FlashText.

Extras

Counter

The counter module allows to convert a list of texts into a dictionary of token counts.

from lenlp import counter

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

counter.count(
    sentences,
    ngram_range=(1, 1), # Range of n-grams
    analyzer="word", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["its", "in", "is", "of", "the", "and", "to", "a"] # List of stop words
)

Output:

[{'compared': 1,
  'south': 1,
  'city': 1,
  'toulouse': 1,
  'bordeaux': 2,
  'france': 1},
 {'toulouse': 1,
  'france': 1,
  'capital': 1,
  'paris': 1,
  'north': 1,
  'compared': 1,
  'bordeaux': 1}]

Normalizer

The normalizer module allows to normalize a list of texts by removing accents and converting to lowercase.

from lenlp import normalizer

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

normalizer.normalize(sentences)

Output:

[
	'toulouse is a city in france its in the south compared to bordeaux and bordeaux',
 	'paris is the capital of france its in the north compared to bordeaux and toulouse',
]

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lenlp-1.0.3.tar.gz (10.2 kB view details)

Uploaded Source

Built Distributions

lenlp-1.0.3-cp311-cp311-win_amd64.whl (385.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

lenlp-1.0.3-cp311-cp311-manylinux2014_x86_64.whl (747.5 kB view details)

Uploaded CPython 3.11

lenlp-1.0.3-cp311-cp311-macosx_14_0_universal2.whl (510.4 kB view details)

Uploaded CPython 3.11 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.3-cp310-cp310-win_amd64.whl (386.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

lenlp-1.0.3-cp310-cp310-manylinux2014_x86_64.whl (747.2 kB view details)

Uploaded CPython 3.10

lenlp-1.0.3-cp310-cp310-macosx_14_0_universal2.whl (509.6 kB view details)

Uploaded CPython 3.10 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.3-cp39-cp39-win_amd64.whl (386.3 kB view details)

Uploaded CPython 3.9 Windows x86-64

lenlp-1.0.3-cp39-cp39-manylinux2014_x86_64.whl (747.5 kB view details)

Uploaded CPython 3.9

lenlp-1.0.3-cp39-cp39-macosx_14_0_universal2.whl (509.8 kB view details)

Uploaded CPython 3.9 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.3-cp38-cp38-win_amd64.whl (386.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

lenlp-1.0.3-cp38-cp38-manylinux2014_x86_64.whl (747.7 kB view details)

Uploaded CPython 3.8

lenlp-1.0.3-cp38-cp38-macosx_14_0_universal2.whl (510.0 kB view details)

Uploaded CPython 3.8 macOS 14.0+ universal2 (ARM64, x86-64)

File details

Details for the file lenlp-1.0.3.tar.gz.

File metadata

  • Download URL: lenlp-1.0.3.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.3.tar.gz
Algorithm Hash digest
SHA256 be9e48e47e5b9f41df18e3d91dbdd7cebea10fc5c507502c6070510e6866f7a6
MD5 33f11a06588329615f467f6db06edcea
BLAKE2b-256 0c38152b9999be976281443527a148757bed3d2a56bbc858c4de060ba3eb1393

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 385.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 c80efa33e4f520517fc9281250eb45c4c8f32950c2db32bb4e58afc97b913c9c
MD5 2d1a717a05e814468aacac678a73e331
BLAKE2b-256 e5f90f96060a041402292c4b1ffc9d764abae7a037aca96cc98effbf489f6dfd

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a887514230acbab5e4508b400158203f3a678d382f1845ec9bb6d3fbf5a7fe40
MD5 ea617fee23d21155569605cfd7dc5f7c
BLAKE2b-256 c667ebda62e59693e1bb35412e415121729206b52f54f1249e9f01241f927e1f

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp311-cp311-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp311-cp311-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 93657158d7f2045b311b6df3932a00eb9eb2d72592b70f647daf0f5a5d772aed
MD5 6cbdb36a32fdff22d2a3c9f4037a0631
BLAKE2b-256 ed640ebeea9e114168b6d0e035fab670265dacfa4fcb3d4734e48e1793ff60aa

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 386.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8eb50643824e26e5fa8cdf02e54d1aac51b0b0fa1fcbce9ee50aa9be568805d1
MD5 942d1c8afb5ba100dcfb768073e7cef8
BLAKE2b-256 b9933a95409d0b6406bad187092f9fdae5b40ae7aa3e12cb1ed978bbeb96c125

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4381ff46f4418dcc44ed0f7d31c9d3ba902c79210c407efebb6b0c95c1b1f1d0
MD5 a255f743ce7451de945205416057a50b
BLAKE2b-256 1533c3178955798eec4265e0406f7df2e3720b7de2f5fe4d52b384c57f83ebb7

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp310-cp310-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp310-cp310-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 7cc38e26302a58909c1ae88f07ae574b76e6b04f00aa44a190c925ee8ce8d07d
MD5 52a24010790db91af6e7137ea1902882
BLAKE2b-256 1d98da245f91b3662ca2156640ab1ca7521e93427ce66773b979100ad99c0942

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 386.3 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 041136e54b2d2dd6c22b8138b876361aa50fd405b93777659148651447922aa1
MD5 1e767af192e86f71397c27f65e719872
BLAKE2b-256 4803c8406a8c3b0e17eec74166ccc40cb8cf73a0bccb362338e932212c7a6083

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 12a855acbe1924438ecacb39a112eddccb83f1c93baa12925f5ff37819f5fc92
MD5 69c9555b388685c5927e9b3e6bd3c4a3
BLAKE2b-256 8354863787ba8d8cf6b541548c97375968d1accbf6325c6842527c224cabf3b8

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp39-cp39-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp39-cp39-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 e2a4da03c700339af64149e26b0514c31eb3f7e799893fefc2c7fd025396bb09
MD5 fed6cf86a7a0cb9af1b2fab1e3d171bc
BLAKE2b-256 9501637be6cff8fca557632df891b67cd6210071707171bd6ed9f42159b6bda6

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 386.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e7488890b66e992359b9622051b5da8c58236a318f2210b362fee048552f09fa
MD5 c829408f966fdbb7a25555bbbe86e171
BLAKE2b-256 9cf22448d45a37f41bd0627686609870131cc4ddffb445fb3e334bd1a9b569e1

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d48645a457369818e52bf6115c1d7978b09c0f393f8103f0489f7137ac8fbc4c
MD5 a81c2d0652ad1e7b5258e1789f2a4c14
BLAKE2b-256 9a4dc0e56bb0500cf3c3008701fb69562f518a4fcf493d6e32b978f48590f36f

See more details on using hashes here.

File details

Details for the file lenlp-1.0.3-cp38-cp38-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.3-cp38-cp38-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 a93adc8716138444c6ff09d210712f164f218282890283839a0c9f4db39ad785
MD5 2e9117c3d4163e4846059576c6397a69
BLAKE2b-256 baec006fbf73c57ed0ea5eab010a41f6e4371c390bcb7155b6ad290e7c65b451

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page