Skip to main content

No project description provided

Project description

LeNLP

Natural Language Processing toolbox for Python with Rust

license

LeNLP is a toolkit dedicated to natural language processing (NLP). It provides optimized and parallelized functions in Rust for use in Python, offering high performance and ease of integration.

Installation

We can install LeNLP using:

pip install lenlp

Sections

Quick Start

Sparse Module

The sparse module offers a variety of vectorizers and transformers for text data. These sparse matrices are scipy.sparse.csr_matrix objects, optimized for memory usage and speed. They can be used as drop-in replacements for scikit-learn vectorizers.

CountVectorizer

The CountVectorizer converts a list of texts into a sparse matrix of token counts. This is a Rust implementation of the CountVectorizer from scikit-learn.

from lenlp import sparse

vectorizer = sparse.CountVectorizer(
    ngram_range=(3, 5), # range of n-grams
    analyzer="char_wb", # word, char, char_wb
    normalize=True, # lowercase and strip accents
    stop_words=["based"], # list of stop words
)

You can fit the vectorizer and transform a list of texts into a sparse matrix of token counts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP CountVectorizer versus Sklearn CountVectorizer fit_transform with char analyzer.

TfidfVectorizer

The TfidfVectorizer converts a list of texts into a sparse matrix of tf-idf weights, implemented in Rust.

from lenlp import sparse

vectorizer = sparse.TfidfVectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP TfidfVectorizer versus Sklearn TfidfVectorizer fit_transform with char analyzer.

BM25Vectorizer

The BM25Vectorizer converts texts into a sparse matrix of BM25 weights, which are more accurate than tf-idf and count weights.

from lenlp import sparse

vectorizer = sparse.BM25Vectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP BM25Vectorizer versus LeNLP TfidfVectorizer fit_transform with char analyzer. BM25Vectorizer counterpart is not available in Sklearn.

FlashText

The flashtext module allows for efficient keyword extraction from texts. It implements the FlashText algorithm as described in the paper Replace or Retrieve Keywords In Documents At Scale.

from lenlp import flash

flash_text = flash.FlashText(
    normalize=True # remove accents and lowercase
) 

# Add keywords we want to retrieve:
flash_text.add(["paris", "bordeaux", "toulouse"])

Extract keywords and their positions from sentences:

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

flash_text.extract(sentences)

Output:

[[('toulouse', 0, 8), ('bordeaux', 60, 68), ('bordeaux', 74, 82)],
 [('paris', 0, 5), ('bordeaux', 62, 70), ('toulouse', 76, 84)]]

The FlashText algorithm is highly efficient, significantly faster than regular expressions for keyword extraction. LeNLP's implementation normalizes input documents by removing accents and converting to lowercase to enhance keyword extraction.

Benchmark:

LeNLP FlashText is benchmarked versus the official implementation of FlashText.

Extras

Counter

The counter module allows to convert a list of texts into a dictionary of token counts.

from lenlp import counter

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

counter.count(
    sentences,
    ngram_range=(1, 1), # Range of n-grams
    analyzer="word", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["its", "in", "is", "of", "the", "and", "to", "a"] # List of stop words
)

Output:

[{'compared': 1,
  'south': 1,
  'city': 1,
  'toulouse': 1,
  'bordeaux': 2,
  'france': 1},
 {'toulouse': 1,
  'france': 1,
  'capital': 1,
  'paris': 1,
  'north': 1,
  'compared': 1,
  'bordeaux': 1}]

Normalizer

The normalizer module allows to normalize a list of texts by removing accents and converting to lowercase.

from lenlp import normalizer

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

normalizer.normalize(sentences)

Output:

[
	'toulouse is a city in france its in the south compared to bordeaux and bordeaux',
 	'paris is the capital of france its in the north compared to bordeaux and toulouse',
]

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lenlp-1.0.4.tar.gz (10.1 kB view details)

Uploaded Source

Built Distributions

lenlp-1.0.4-cp311-cp311-win_amd64.whl (385.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

lenlp-1.0.4-cp311-cp311-manylinux2014_x86_64.whl (747.5 kB view details)

Uploaded CPython 3.11

lenlp-1.0.4-cp311-cp311-macosx_14_0_universal2.whl (510.3 kB view details)

Uploaded CPython 3.11 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.4-cp310-cp310-win_amd64.whl (386.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

lenlp-1.0.4-cp310-cp310-manylinux2014_x86_64.whl (747.2 kB view details)

Uploaded CPython 3.10

lenlp-1.0.4-cp310-cp310-macosx_14_0_universal2.whl (509.4 kB view details)

Uploaded CPython 3.10 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.4-cp39-cp39-win_amd64.whl (386.3 kB view details)

Uploaded CPython 3.9 Windows x86-64

lenlp-1.0.4-cp39-cp39-manylinux2014_x86_64.whl (747.5 kB view details)

Uploaded CPython 3.9

lenlp-1.0.4-cp39-cp39-macosx_14_0_universal2.whl (509.6 kB view details)

Uploaded CPython 3.9 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.4-cp38-cp38-win_amd64.whl (386.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

lenlp-1.0.4-cp38-cp38-manylinux2014_x86_64.whl (747.7 kB view details)

Uploaded CPython 3.8

lenlp-1.0.4-cp38-cp38-macosx_14_0_universal2.whl (509.7 kB view details)

Uploaded CPython 3.8 macOS 14.0+ universal2 (ARM64, x86-64)

File details

Details for the file lenlp-1.0.4.tar.gz.

File metadata

  • Download URL: lenlp-1.0.4.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.4.tar.gz
Algorithm Hash digest
SHA256 11a4238e8acda8a0ac73e46b0ca50d25857cf4ef148c266f66395f563abe8f9a
MD5 52b9d0eef6ddcc4ce3936e942f37070d
BLAKE2b-256 cf7ee585a6e5932e4236d8e75f15e4c14a3e253c95118229a8b73ec5c082429b

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 385.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b82b1d4617ae298f95b264d772e2aabd600e67b1a05e7d877350b8ca8fdf3f5f
MD5 d3d0ae59f43bb41088ab903732fd51e4
BLAKE2b-256 ac6bb4cd47a507e59904227c0354eed46d523676057bed77ab4ec9065f55aee2

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d3a77194f5fedb6443f8298126c480bb59f4a1873143bab8cd8fd29f35aacd07
MD5 5f82718b81a285996371f7d1c3e7a5b4
BLAKE2b-256 e9c817b38f1d1463b6f2f9d01c398c5ed51eca291c55949a23acd06b8ae4c6fd

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp311-cp311-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp311-cp311-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 45398974b6b99ae6e103841cc8fd0db2907c10c11ad4a036af34ff63dcbf8be1
MD5 4d135b76b4beacf7e6520937625bb5cc
BLAKE2b-256 ef3913066866e3a4bb1fd6f8bc535a7bb45626c8e6b42fc98812c9cebf20e54b

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 386.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 afb637a4a098a2d1755ff747803eedfa5b3ec19663c0d7e06be4eb92e75385fd
MD5 4b7076bd65780ae4f22bdefc5105e8f6
BLAKE2b-256 96b8d197d06e49bee0456a2f213ce037a8381cd28a7a57c745584dcb007fd7b0

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 edc8ea4299288b2d6c5c3dadd72a2d7c9f439584541f13c17fd1db50381b3e64
MD5 aaac469a85df4e10b57a0fb9eeef724c
BLAKE2b-256 39584e12f4a5fd0ea58e1968a76d9936d8830d995a457a980cf3c9052c74a286

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp310-cp310-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp310-cp310-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 d474cf51d8056445388d12b8a41588ad3a05dab5a04f6028a325263050fd8632
MD5 75ce3dbdea249f6c3112674978d8c755
BLAKE2b-256 9c7538bf7dd145cf3a9c58d30052a4b6e83d580940a0ade8ac3248f667dbfda8

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 386.3 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 70280fed22d275efb04ca62919b4a0dee0bc997239de29a043bd08ad3e16479c
MD5 2e8f23ca8e0a3e4b227d1dfd0b21d08f
BLAKE2b-256 52e25d340be7a14ec5d3e22ebc1ddd2d6e92dcf01e5e9360b18df11dc90909cc

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f5e3f4402b1b1f01b8e8725b367b9b19aadaa7b23df07213dfbf7a616eae344a
MD5 95038a2b38975ced5f9c8140d648d297
BLAKE2b-256 4ec413df841630d6b45f54d08e2da1924794b25f61c3bbbcde39b435410861d9

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp39-cp39-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp39-cp39-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 1f526c439205da266302a1657b43cde75f7bfb3adf64c8b8e252e40fa4b22c06
MD5 c6ad1038bc7190b2bbb5d0d89f387b03
BLAKE2b-256 f46d42bdf566b0ef8aacba92f10cbe498f68c82536206667e00bfe598028ceee

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 386.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 20ab3f3d5e9e72bc10c4790ca2bbbc5c10237841e9bac884348a24dc72f43231
MD5 76502311e5cb51fc9f3d9e9f978965ef
BLAKE2b-256 8275de3c4b46c4a0f0a5ea5d80f4f4777e4334d8d3230e11f53819deea12a70c

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e9ebe267635af0f3bcdc77af72deba79ca211f22219e94d224f46aeddbc28d87
MD5 f7dcf6f386e3eedecf25477b25112fef
BLAKE2b-256 1943ffdb6fe2248dcccd08f9cded7921b5e29146a12e369f162e30abf3c090cc

See more details on using hashes here.

File details

Details for the file lenlp-1.0.4-cp38-cp38-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.4-cp38-cp38-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 d4600c07aa2f28a2a2bfa3b62944f9653cd9782ef33cf91c66a7a0ab8ef0b099
MD5 f19c9202c7bb64a3120c8efd5e24bfa5
BLAKE2b-256 763cee6dad65da5dcfb771c353ee05487be1274fc6c6136837d61c17f9846685

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page