Skip to main content

Pre-process documents for Natural Language Processing using spaCy models

Project description

document_processing

This package provides functions to pre-process text for various NLP tasks. It uses spaCy and its models to analyse the text.

Behaviour

The entry point of this package is process_dcouments in which you put the Series of documents to process and the spaCy model name that will be loaded to transform the texts.

From a document, you can extract tokens, lemmas and entities with the get_tokens_lemmas_entities_from_document function, giving it the document returned by the previous function, and the preprocessing function, as described below.

Pre-processing functions

  • preprocess_list_of_texts: process tokens, remove stopwords, non-standard characters, etc.
  • preprocess_list_of_tweets: same as above, and remove all token that seem to be HTTP links, which are often present in Tweets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_processing-1.0.0.202203292014.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file document_processing-1.0.0.202203292014.tar.gz.

File metadata

  • Download URL: document_processing-1.0.0.202203292014.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for document_processing-1.0.0.202203292014.tar.gz
Algorithm Hash digest
SHA256 03e81381f7ce2a3392669a7465343ea2d7d4781903a1f61047bea15da8813c0d
MD5 de0d337851c7cf9ae80b6cf73010f7ab
BLAKE2b-256 68a98fbcf1a8215da5e54eea6aa9ab6d7db7a105ee877b63344280040da6160e

See more details on using hashes here.

File details

Details for the file document_processing-1.0.0.202203292014-py3-none-any.whl.

File metadata

  • Download URL: document_processing-1.0.0.202203292014-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for document_processing-1.0.0.202203292014-py3-none-any.whl
Algorithm Hash digest
SHA256 d39be4b3c83e3cf107ce224e1bbf6ceac95111ba076a4fb5b1f3e58f8337950b
MD5 e8c0375b7d501b3e15313e6ac62a283a
BLAKE2b-256 8f5e95593952a480a55f2dd32f9736cb9c7304d7208bb60989011a6c974e9d3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page