Skip to main content

Text preprocessing package for use in NLP tasks

Project description

TextCL

Python CI codecov License: MIT

Introduction

The TextCL package aims to clean text data for later use in Natural Language Processing tasks. It can be used as an initial step in text analysis as well as in predictive, classification or text generation models.

The quality of the models strongly depends on the quality of the input data. Common problems in the data sets include:

  • If data are coming from a optical character recognition (OCR) platform, text in tables and columns is usually not processed correctly and will add noise to the models.
  • Some parts of large texts scopes may contain sentences from different languages rather than the target language of the model and have to be filtered out.
  • Real-world texts often have duplicated sentences due to the use of templates. In text generation tasks, this can cause model overfitting and duplications in generated texts or summaries.
  • Data sets may contain text that is different from the main topic, such as a weather forecast in an accounting report.

Features

The TextCL package allows the user to perform the following text pre-processing tasks:

  • Split texts into sentences.
  • Language filtering, for removing sentences from text not in the target language.
  • Perplexity filtering, for removing linguistically unconnected sentences, that can be produced by OCR modules. For example: Sustainability Report 2019 36 3%?!353? 1. 5В°C 1} 33%.
  • Duplicate sentences filtering using Jaccard similarity, for removing duplicate sentences from the text.
  • Unsupervised outlier detection for revealing texts that are outside of the main data set topic distribution. Four methods are included with package for this purpose:

Documentation

Requirements

  • Python >= 3.8
  • pytorch_pretrained_bert >= 0.6.2
  • langdetect >= 1.0.8
  • numpy >= 1.21.3
  • pandas >= 1.4.3
  • lxml >= 4.6.2
  • protobuf >= 3.14.0
  • nltk >= 3.4.5

How to install

From PyPI

pip install textcl

From source/GitHub

pip install git+https://github.com/alinapetukhova/textcl.git#egg=textcl

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textcl-1.0.1.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

textcl-1.0.1-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file textcl-1.0.1.tar.gz.

File metadata

  • Download URL: textcl-1.0.1.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for textcl-1.0.1.tar.gz
Algorithm Hash digest
SHA256 91744c9f120bfc84a1f313e4ab7918489b33f1d16d784cd46460609015c208a5
MD5 1e518c7fdec8aeeb32aa7caf2435ee34
BLAKE2b-256 a4d8372dd197fbbd2ed2d7b78b3303ac13eaa4e5de16ed32d8cc8d7f80d304fe

See more details on using hashes here.

File details

Details for the file textcl-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: textcl-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.5

File hashes

Hashes for textcl-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2840991c87c6641adf03a4ef7c05b403b4380f4b2e483cdf706bb5a0d0b46509
MD5 e2a0f8b18efb832deec4b45e0230a884
BLAKE2b-256 4ae84f0fcb0e782ef8c3ccf7a2537124646daca56ed32c26ce65e5055e69b4da

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page