Skip to main content

No project description provided

Project description

LeNLP

Natural Language Processing toolbox for Python with Rust

license

LeNLP is a toolkit dedicated to natural language processing (NLP). It provides optimized and parallelized functions in Rust for use in Python, offering high performance and ease of integration.

Installation

We can install LeNLP using:

pip install lenlp

Sections

Quick Start

Sparse Module

The sparse module offers a variety of vectorizers and transformers for text data. These sparse matrices are scipy.sparse.csr_matrix objects, optimized for memory usage and speed. They can be used as drop-in replacements for scikit-learn vectorizers.

CountVectorizer

The CountVectorizer converts a list of texts into a sparse matrix of token counts. This is a Rust implementation of the CountVectorizer from scikit-learn.

from lenlp import sparse

vectorizer = sparse.CountVectorizer(
    ngram_range=(3, 5), # range of n-grams
    analyzer="char_wb", # word, char, char_wb
    normalize=True, # lowercase and strip accents
    stop_words=["based"], # list of stop words
)

You can fit the vectorizer and transform a list of texts into a sparse matrix of token counts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP CountVectorizer versus Sklearn CountVectorizer fit_transform with char analyzer.

TfidfVectorizer

The TfidfVectorizer converts a list of texts into a sparse matrix of tf-idf weights, implemented in Rust.

from lenlp import sparse

vectorizer = sparse.TfidfVectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP TfidfVectorizer versus Sklearn TfidfVectorizer fit_transform with char analyzer.

BM25Vectorizer

The BM25Vectorizer converts texts into a sparse matrix of BM25 weights, which are more accurate than tf-idf and count weights.

from lenlp import sparse

vectorizer = sparse.BM25Vectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP BM25Vectorizer versus LeNLP TfidfVectorizer fit_transform with char analyzer. BM25Vectorizer counterpart is not available in Sklearn.

FlashText

The flashtext module allows for efficient keyword extraction from texts. It implements the FlashText algorithm as described in the paper Replace or Retrieve Keywords In Documents At Scale.

from lenlp import flash

flash_text = flash.FlashText(
    normalize=True # remove accents and lowercase
) 

# Add keywords we want to retrieve:
flash_text.add(["paris", "bordeaux", "toulouse"])

Extract keywords and their positions from sentences:

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

flash_text.extract(sentences)

Output:

[[('toulouse', 0, 8), ('bordeaux', 60, 68), ('bordeaux', 74, 82)],
 [('paris', 0, 5), ('bordeaux', 62, 70), ('toulouse', 76, 84)]]

The FlashText algorithm is highly efficient, significantly faster than regular expressions for keyword extraction. LeNLP's implementation normalizes input documents by removing accents and converting to lowercase to enhance keyword extraction.

Benchmark:

LeNLP FlashText is benchmarked versus the official implementation of FlashText.

Extras

Counter

The counter module allows to convert a list of texts into a dictionary of token counts.

from lenlp import counter

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

counter.count(
    sentences,
    ngram_range=(1, 1), # Range of n-grams
    analyzer="word", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["its", "in", "is", "of", "the", "and", "to", "a"] # List of stop words
)

Output:

[{'compared': 1,
  'south': 1,
  'city': 1,
  'toulouse': 1,
  'bordeaux': 2,
  'france': 1},
 {'toulouse': 1,
  'france': 1,
  'capital': 1,
  'paris': 1,
  'north': 1,
  'compared': 1,
  'bordeaux': 1}]

Normalizer

The normalizer module allows to normalize a list of texts by removing accents and converting to lowercase.

from lenlp import normalizer

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

normalizer.normalize(sentences)

Output:

[
	'toulouse is a city in france its in the south compared to bordeaux and bordeaux',
 	'paris is the capital of france its in the north compared to bordeaux and toulouse',
]

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lenlp-1.0.5.tar.gz (10.2 kB view details)

Uploaded Source

Built Distributions

lenlp-1.0.5-cp311-cp311-win_amd64.whl (385.8 kB view details)

Uploaded CPython 3.11 Windows x86-64

lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl (510.3 kB view details)

Uploaded CPython 3.11 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.5-cp310-cp310-win_amd64.whl (386.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl (747.3 kB view details)

Uploaded CPython 3.10

lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl (509.4 kB view details)

Uploaded CPython 3.10 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.5-cp39-cp39-win_amd64.whl (386.3 kB view details)

Uploaded CPython 3.9 Windows x86-64

lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl (509.6 kB view details)

Uploaded CPython 3.9 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.5-cp38-cp38-win_amd64.whl (386.2 kB view details)

Uploaded CPython 3.8 Windows x86-64

lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl (509.7 kB view details)

Uploaded CPython 3.8 macOS 14.0+ universal2 (ARM64, x86-64)

File details

Details for the file lenlp-1.0.5.tar.gz.

File metadata

  • Download URL: lenlp-1.0.5.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5.tar.gz
Algorithm Hash digest
SHA256 49bd6ff5abd0bc9c124031813178fecf2a3132809936b3fee5e0b63f0ebd47c8
MD5 f0b8a7dcae525dcb89b6bd7a6f1a42ba
BLAKE2b-256 5154eebdd632966698028b29faf2309977e88e00d425a4841959e70f822c2407

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 385.8 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 bcb2d912adbd32b00450958bdfd42f554f9cd212fdd48fcdcea21b56ff079ce1
MD5 e17d953713c1a94b12b3d3cc03dbe630
BLAKE2b-256 b01da2e2229d33870027268a2b275615f84bbcca4eded71eaac7625978255245

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 3eeb9f3a221b992d91f69856c537135797345bd7500b714ad20b7c2ffdf7725d
MD5 b7901e96e6f3e8c8243ccd4bd6c72682
BLAKE2b-256 5aa4417113a962d4a0c1200187ac1650f479685daf977056dbd0ef72b860c838

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 386.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 83bc990ee4d0665974b75277aa046fa9c25b133732869387a6b13af302993893
MD5 4315a5520fe273f609456ea886703478
BLAKE2b-256 ac207c5ae911366fcf3b1869c67eed4fc669513bf043a44ac7f3858b8c998232

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 987f485a1c98c66b7de299d7ed2ae7da3b3067992e8945c18bb36b00def0da95
MD5 53d888a016878532dae3cd6f9cece5bd
BLAKE2b-256 85e8c73f65fe4a659f2eeea5cabe92d185df0dc35659623582003958de8f88e5

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 1b775acf609e893702eb2eecb4d7bcd7517d01382550bd5a1560a525f710a253
MD5 69795471a9f5bd0405be2bbf3fd077e4
BLAKE2b-256 068a7a40ef1a42f0c65115a466355642efb0f0955279d1e925a92c6d8d026e78

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 386.3 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 9ccbe7a2700d0f343a5434ef0d54f66437662ca91dcbb1580511216204648362
MD5 ddd92b5d4d6925c73fcd8df9fca80e9b
BLAKE2b-256 fe01ab184f9538168c6b76ad29af7ed9d9f6cabac076cfc30d15be55409d5841

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 f93f274691e99f7943b6746241fdeb4795ad2646aed0336d1b9ce9fc3de3c076
MD5 d3b2f57578809cd1a8cd01f18beec730
BLAKE2b-256 d91000602c5de62b15931d7fbc70b491bd0d0770c61e7cc69da45a61715043c2

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lenlp-1.0.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 386.2 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 881e50d94f512eae8bdb8bc0020dcd251c3a3a643c29921602123042724e907e
MD5 404dcb145ac50c22131491db830124bb
BLAKE2b-256 589f2073d3acfb17182e7019c6e678461943f04b2597d19a979c158d2593a93c

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl.

File metadata

File hashes

Hashes for lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl
Algorithm Hash digest
SHA256 2ff0a63640336d7fddb500e25d0e139c6c0b3d35dd49c7aecd7de918e09d7044
MD5 196515c3c12b0954d956945e2a54d5ad
BLAKE2b-256 f3e535c949efc4a7b5933e6ae7f241e416893d5fe890456a78536b62cb1ab46f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page