Skip to main content

A library of scikit compatible text transformers, that are ready to be integrated in an NLP pipeline for various classification tasks.

Project description

Build Status

NLPkit - Transformers for text classification

A library of scikit compatible text transformers, that are ready to be integrated in an NLP pipeline for various classification tasks.

Project structure

.
├── nlpkit
│   ├── __init__.py
│   └── nlp_feature_extraction
│       ├── __init__.py
│       ├── liwc
│       ├── word_embeddings_features.py
│       ├── syntax_features.py
│       ├── ner_features.py
│       ├── pos_features.py
│       ├── liwc_features.py
│       ├── text_statistics_features.py
│       └── tests
│           ├── __init__.py
│           ├── test_data
│           ├── test_liwc_feature_extraction.py
│           ├── test_ner_feature_extraction.py
│           ├── test_pos_feature_extraction.py
│           ├── test_syntax_feature_extraction.py
│           └── test_word_embeddings_feature_extraction.py
├── examples
├── README.md
├── LICENCE
├── requirements.txt
└── setup.py

Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

  1. Python 3.6
  2. Stanford CoreNLP Server (for some transformers)
  3. Pre-trained word vectors for the w2v transformer

Stanford CoreNLP Server with Docker

Stanford CoreNLP is required for constituency parsing, POS and NER tagging.

The easiest way is to have a CoreNLP Server running is to use Docker. You can find a Dockerfile and instructions to have the server running at Stanford CoreNLP Server - Docker.

Word vectors

If you don't want to train your own word embeddings, you can download pre-trained word vectors from the Stanford GloVe project. For example, the Wikipedia model has a vocabulary of 400K words, represented using 300 dimensional vectors. The word vectors come in the GloVe format and need to be converted into the word2vec format. While the formats are almost identical, you can use gensim to do the conversion.

python -m gensim.scripts.glove2word2vec -i glove.txt -o word2vec.txt

List of transformers

  • POSTagPreprocessor: Pre-processes text documents by tagging each word in the form of word_TAG_ e.g. what_WP. Can be used to generate POS tagged n-grams
  • NERPreprocessor: Pre-processes text documents by replacing named entities with generic tags e.g. PERSON, LOCATION
  • WordEmbedsDocVectorizer: Converts text documents to word2vec based document vector representations. It maps the words of a document to word2vec vectors, and averages them across dimensions to produce a document vector representation
  • POSExtractor: Extracts Parts of Speech (POS) counts for a collection of text documents
  • CFGExtractor: Extracts the Context Free Grammar (CFG) production rules found in a collection of text documents
  • NamedEntitiesCounter: Extracts Named Entity counts per entity type (e.g. PERSON) for a collection of text documents
  • LIWCExtractor: Extracts proportions of words that fall in the various LIWC categories for a collection of text documents
  • TextStatsExtractor: Calculates various text statistics and readability scores for a collection of text documents

Usage

All the custom transformers extend the BaseEstimator and TransformerMixin and implement the fit and transform methods.

# POSExtractor
sf_parser = CoreNLPParser(url='http://localhost:9000/', tagtype='pos')
pos_extractor = POSExtractor(sf_parser)
X = pos_extractor.fit_transform(corpus)

They can also be used in pipelines e.g.

pipeline = Pipeline([
                ('pre', TextPreprocessor(stemming=False)),
                ('w2v', WordEmbedsDocVectorizer(self._word2vec, tfidf_weights=True)),
                ('clf', SVC(kernel='linear', C=1, probability=True))
           ])

For more, you can run the examples included in the examples folder.

Tests

The Pytest framework is used for unit testing. All of the custom text transformers produced in this project come with an extensive set of unit tests. To run the tests use:
pytest src

Project repository

https://github.com/evanll/nlpkit-ml

Author

Written by Evan Lalopoulos evan.lalopoulos.2017@my.bristol.ac.uk as part of his thesis in Fake News detection using NLP.

Evan Lalopoulos - evanll

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpkit-ml-1.0.0.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

nlpkit_ml-1.0.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file nlpkit-ml-1.0.0.tar.gz.

File metadata

  • Download URL: nlpkit-ml-1.0.0.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for nlpkit-ml-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bcf73a58a599b910351cb745426472995c8be33a65b3f6e9e2131164703975a1
MD5 183be783d6f7f19439f2772d78af806f
BLAKE2b-256 71acae32d441ee0a8d1aa49aeefa0c39fffb43ae6ddc3ab749a270e1ac1f8a6a

See more details on using hashes here.

File details

Details for the file nlpkit_ml-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: nlpkit_ml-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for nlpkit_ml-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8801012ccfb2e2becbf0ba6dc3459d99072fb8feb6e8ca9f8acd2b33c00bd6eb
MD5 0d8b88cf5df25e902d1427d38f8aa6c6
BLAKE2b-256 d11db9af15783576505339ff83db35da69884aeab7a4745cb4d30ce0e9a63ca1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page