Skip to main content

A fast NLP tokenizer that detects tokens and remove duplications and punctuations

Project description

doc2term

Build Status pypi license

A fast NLP tokenizer that detects sentences, words, numbers, urls, hostnames, emails, filenames, and phone numbers. Tokenize integrates and standardize the documents, remove the punctuations and duplications.

Installation

pip install doc2term

Compilation

The installation requires to compile the original C code using gcc.

Usage

Example notebook: doc2term

Example

>>> import doc2term

>>> doc2term.doc2term_str("Actions speak louder than words. ... ")
"Actions speak louder than words ."
>>> doc2term.doc2term_str("You can't judge a book by its cover. ... from thoughtcatalog.com")
"You can't judge a book by its cover . from thoughtcatalog.com"

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc2term-0.1.tar.gz (8.3 kB view details)

Uploaded Source

File details

Details for the file doc2term-0.1.tar.gz.

File metadata

  • Download URL: doc2term-0.1.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.2

File hashes

Hashes for doc2term-0.1.tar.gz
Algorithm Hash digest
SHA256 1b684765faecccd53c1c1509e09e476b4c5bbd2698520dbad2cac25bc540e654
MD5 240a7a0fa821a2c959dbd1047a2589fb
BLAKE2b-256 ab40088c628f964db9636e947ab8fe5686f0b47865ff44edfd5af81f9638e68d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page