No project description provided

These details have not been verified by PyPI

Project links

Project description

LeNLP

Natural Language Processing toolbox for Python with Rust

LeNLP is a toolkit dedicated to natural language processing (NLP). It provides optimized and parallelized functions in Rust for use in Python, offering high performance and ease of integration.

Installation

We can install LeNLP using:

pip install lenlp

Quick Start

Sparse Module

The sparse module offers a variety of vectorizers and transformers for text data. These sparse matrices are scipy.sparse.csr_matrix objects, optimized for memory usage and speed. They can be used as drop-in replacements for scikit-learn vectorizers.

CountVectorizer

The CountVectorizer converts a list of texts into a sparse matrix of token counts. This is a Rust implementation of the CountVectorizer from scikit-learn.

from lenlp import sparse

vectorizer = sparse.CountVectorizer(
    ngram_range=(3, 5), # range of n-grams
    analyzer="char_wb", # word, char, char_wb
    normalize=True, # lowercase and strip accents
    stop_words=["based"], # list of stop words
)

You can fit the vectorizer and transform a list of texts into a sparse matrix of token counts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP CountVectorizer versus Sklearn CountVectorizer fit_transform with char analyzer.

TfidfVectorizer

The TfidfVectorizer converts a list of texts into a sparse matrix of tf-idf weights, implemented in Rust.

from lenlp import sparse

vectorizer = sparse.TfidfVectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP TfidfVectorizer versus Sklearn TfidfVectorizer fit_transform with char analyzer.

BM25Vectorizer

The BM25Vectorizer converts texts into a sparse matrix of BM25 weights, which are more accurate than tf-idf and count weights.

from lenlp import sparse

vectorizer = sparse.BM25Vectorizer(
    ngram_range=(3, 5), # Range of n-grams
    analyzer="char_wb", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["based"] # List of stop words
)

Fit the vectorizer and transform texts:

X = [
    "Hello World", 
    "Rust based vectorizer"
]

matrix = vectorizer.fit_transform(X)

Or use separate calls:

vectorizer.fit(X)
matrix = vectorizer.transform(X)

Benchmark:

LeNLP BM25Vectorizer versus LeNLP TfidfVectorizer fit_transform with char analyzer. BM25Vectorizer counterpart is not available in Sklearn.

FlashText

The flashtext module allows for efficient keyword extraction from texts. It implements the FlashText algorithm as described in the paper Replace or Retrieve Keywords In Documents At Scale.

from lenlp import flash

flash_text = flash.FlashText(
    normalize=True # remove accents and lowercase
) 

# Add keywords we want to retrieve:
flash_text.add(["paris", "bordeaux", "toulouse"])

Extract keywords and their positions from sentences:

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

flash_text.extract(sentences)

Output:

[[('toulouse', 0, 8), ('bordeaux', 60, 68), ('bordeaux', 74, 82)],
 [('paris', 0, 5), ('bordeaux', 62, 70), ('toulouse', 76, 84)]]

The FlashText algorithm is highly efficient, significantly faster than regular expressions for keyword extraction. LeNLP's implementation normalizes input documents by removing accents and converting to lowercase to enhance keyword extraction.

Benchmark:

LeNLP FlashText is benchmarked versus the official implementation of FlashText.

Extras

Counter

The counter module allows to convert a list of texts into a dictionary of token counts.

from lenlp import counter

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

counter.count(
    sentences,
    ngram_range=(1, 1), # Range of n-grams
    analyzer="word", # Options: word, char, char_wb
    normalize=True, # Lowercase and strip accents
    stop_words=["its", "in", "is", "of", "the", "and", "to", "a"] # List of stop words
)

Output:

[{'compared': 1,
  'south': 1,
  'city': 1,
  'toulouse': 1,
  'bordeaux': 2,
  'france': 1},
 {'toulouse': 1,
  'france': 1,
  'capital': 1,
  'paris': 1,
  'north': 1,
  'compared': 1,
  'bordeaux': 1}]

Normalizer

The normalizer module allows to normalize a list of texts by removing accents and converting to lowercase.

from lenlp import normalizer

sentences = [
    "Toulouse is a city in France, it's in the south compared to bordeaux, and bordeaux",
    "Paris is the capital of France, it's in the north compared to bordeaux, and toulouse",
]

normalizer.normalize(sentences)

Output:

[
	'toulouse is a city in france its in the south compared to bordeaux and bordeaux',
 	'paris is the capital of france its in the north compared to bordeaux and toulouse',
]

References

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

Jun 2, 2024

1.1.0

Jun 1, 2024

1.0.6

Jun 1, 2024

This version

1.0.5

Jun 1, 2024

1.0.4

May 31, 2024

1.0.3

May 26, 2024

1.0.2

May 26, 2024

1.0.1

May 26, 2024

1.0.0

May 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lenlp-1.0.5.tar.gz (10.2 kB view details)

Uploaded Jun 1, 2024 Source

Built Distributions

lenlp-1.0.5-cp311-cp311-win_amd64.whl (385.8 kB view details)

Uploaded Jun 1, 2024 CPython 3.11 Windows x86-64

lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl (510.3 kB view details)

Uploaded Jun 1, 2024 CPython 3.11 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.5-cp310-cp310-win_amd64.whl (386.7 kB view details)

Uploaded Jun 1, 2024 CPython 3.10 Windows x86-64

lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl (747.3 kB view details)

Uploaded Jun 1, 2024 CPython 3.10

lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl (509.4 kB view details)

Uploaded Jun 1, 2024 CPython 3.10 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.5-cp39-cp39-win_amd64.whl (386.3 kB view details)

Uploaded Jun 1, 2024 CPython 3.9 Windows x86-64

lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl (509.6 kB view details)

Uploaded Jun 1, 2024 CPython 3.9 macOS 14.0+ universal2 (ARM64, x86-64)

lenlp-1.0.5-cp38-cp38-win_amd64.whl (386.2 kB view details)

Uploaded Jun 1, 2024 CPython 3.8 Windows x86-64

lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl (509.7 kB view details)

Uploaded Jun 1, 2024 CPython 3.8 macOS 14.0+ universal2 (ARM64, x86-64)

File details

Details for the file lenlp-1.0.5.tar.gz.

File metadata

Download URL: lenlp-1.0.5.tar.gz
Upload date: Jun 1, 2024
Size: 10.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`49bd6ff5abd0bc9c124031813178fecf2a3132809936b3fee5e0b63f0ebd47c8`
MD5	`f0b8a7dcae525dcb89b6bd7a6f1a42ba`
BLAKE2b-256	`5154eebdd632966698028b29faf2309977e88e00d425a4841959e70f822c2407`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp311-cp311-win_amd64.whl.

File metadata

Download URL: lenlp-1.0.5-cp311-cp311-win_amd64.whl
Upload date: Jun 1, 2024
Size: 385.8 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`bcb2d912adbd32b00450958bdfd42f554f9cd212fdd48fcdcea21b56ff079ce1`
MD5	`e17d953713c1a94b12b3d3cc03dbe630`
BLAKE2b-256	`b01da2e2229d33870027268a2b275615f84bbcca4eded71eaac7625978255245`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl.

File metadata

Download URL: lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl
Upload date: Jun 1, 2024
Size: 510.3 kB
Tags: CPython 3.11, macOS 14.0+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp311-cp311-macosx_14_0_universal2.whl
Algorithm	Hash digest
SHA256	`3eeb9f3a221b992d91f69856c537135797345bd7500b714ad20b7c2ffdf7725d`
MD5	`b7901e96e6f3e8c8243ccd4bd6c72682`
BLAKE2b-256	`5aa4417113a962d4a0c1200187ac1650f479685daf977056dbd0ef72b860c838`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp310-cp310-win_amd64.whl.

File metadata

Download URL: lenlp-1.0.5-cp310-cp310-win_amd64.whl
Upload date: Jun 1, 2024
Size: 386.7 kB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`83bc990ee4d0665974b75277aa046fa9c25b133732869387a6b13af302993893`
MD5	`4315a5520fe273f609456ea886703478`
BLAKE2b-256	`ac207c5ae911366fcf3b1869c67eed4fc669513bf043a44ac7f3858b8c998232`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

Download URL: lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl
Upload date: Jun 1, 2024
Size: 747.3 kB
Tags: CPython 3.10
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp310-cp310-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`987f485a1c98c66b7de299d7ed2ae7da3b3067992e8945c18bb36b00def0da95`
MD5	`53d888a016878532dae3cd6f9cece5bd`
BLAKE2b-256	`85e8c73f65fe4a659f2eeea5cabe92d185df0dc35659623582003958de8f88e5`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl.

File metadata

Download URL: lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl
Upload date: Jun 1, 2024
Size: 509.4 kB
Tags: CPython 3.10, macOS 14.0+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp310-cp310-macosx_14_0_universal2.whl
Algorithm	Hash digest
SHA256	`1b775acf609e893702eb2eecb4d7bcd7517d01382550bd5a1560a525f710a253`
MD5	`69795471a9f5bd0405be2bbf3fd077e4`
BLAKE2b-256	`068a7a40ef1a42f0c65115a466355642efb0f0955279d1e925a92c6d8d026e78`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp39-cp39-win_amd64.whl.

File metadata

Download URL: lenlp-1.0.5-cp39-cp39-win_amd64.whl
Upload date: Jun 1, 2024
Size: 386.3 kB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`9ccbe7a2700d0f343a5434ef0d54f66437662ca91dcbb1580511216204648362`
MD5	`ddd92b5d4d6925c73fcd8df9fca80e9b`
BLAKE2b-256	`fe01ab184f9538168c6b76ad29af7ed9d9f6cabac076cfc30d15be55409d5841`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl.

File metadata

Download URL: lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl
Upload date: Jun 1, 2024
Size: 509.6 kB
Tags: CPython 3.9, macOS 14.0+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp39-cp39-macosx_14_0_universal2.whl
Algorithm	Hash digest
SHA256	`f93f274691e99f7943b6746241fdeb4795ad2646aed0336d1b9ce9fc3de3c076`
MD5	`d3b2f57578809cd1a8cd01f18beec730`
BLAKE2b-256	`d91000602c5de62b15931d7fbc70b491bd0d0770c61e7cc69da45a61715043c2`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp38-cp38-win_amd64.whl.

File metadata

Download URL: lenlp-1.0.5-cp38-cp38-win_amd64.whl
Upload date: Jun 1, 2024
Size: 386.2 kB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`881e50d94f512eae8bdb8bc0020dcd251c3a3a643c29921602123042724e907e`
MD5	`404dcb145ac50c22131491db830124bb`
BLAKE2b-256	`589f2073d3acfb17182e7019c6e678461943f04b2597d19a979c158d2593a93c`

See more details on using hashes here.

File details

Details for the file lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl.

File metadata

Download URL: lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl
Upload date: Jun 1, 2024
Size: 509.7 kB
Tags: CPython 3.8, macOS 14.0+ universal2 (ARM64, x86-64)
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for lenlp-1.0.5-cp38-cp38-macosx_14_0_universal2.whl
Algorithm	Hash digest
SHA256	`2ff0a63640336d7fddb500e25d0e139c6c0b3d35dd49c7aecd7de918e09d7044`
MD5	`196515c3c12b0954d956945e2a54d5ad`
BLAKE2b-256	`f3e535c949efc4a7b5933e6ae7f241e416893d5fe890456a78536b62cb1ab46f`

See more details on using hashes here.

lenlp 1.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LeNLP

Installation

Sections

Quick Start

Sparse Module

CountVectorizer

TfidfVectorizer

BM25Vectorizer

FlashText

Extras

Counter

Normalizer

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes