Skip to main content

NTAP - CSSL

Project description

Neural Text Analysis Pipeline (NTAP)

(Project is currently in pre-release phase)

A python-based pipeline for applying neural methods to text analysis.

Overview

This pipeline is for the wider application of advanced methodologies for text analysis. It uses python packages sklearn, gensim, and tensorflow for the development of established and cutting-edge machine learning methods, respectively.

Installation

  1. NTAP requires python3.(4-6) to be installed download 3.6
  2. It is recommended to use a virtual environment to manage python libraries and dependencies (but not required) To install with pip: $ pip install virtualenv or $ sudo pip install virtualenv

Set up a virtualenv environment and install packages from requirements.txt (or requirements-gpu.txt):

$ virtualenv myenv -p path/to/python/interpreter
$ source myenv/bin/activate
$ pip install -r requirements.txt

External Data

NTAP makes use of a number of external resources, such as word vectors and Stanford's CoreNLP. Download them first, and set the appropriate environment variables (see below)

  1. Word2vec download Set environment variable (bash): export WORD2VEC_PATH=path/to/GoogleNews-vectors-negative300.bin.gz
  2. GloVe Vectors download export GLOVE_PATH=path/to/glovefile.txt
  3. CoreNLP download export CORENLP=path/to/stanford-corenlp-full-YYYY-MM-DD/
  4. Dictionaries Set up a directory to contain any directories you want to use in NTAP, such as Moral Foundations Dictionary (MFD) or LIWC categories. export DICTIONARIES=path/to/dictionaries/directory/
  5. Access Keys To use the entity-tagging and linking system (provided by Tagme) sign up for the service and set up your access key: export TAGME="<my_access_key"

Processing Pipeline

This component takes raw text as input and produces data ready for entry into either baseline or other machine learning methods.

  • cleaning Functionality includes text cleaning,

Baseline Pipeline

The baseline implements methods which separately generate features for supervised models.

Implemented feature methods:

  • TFIDF
  • LDA
  • Dictionary (Word Count)
  • Distributed Dictionary Representations
  • Bag of Means (averaged word embeddings)
    • Word2Vec (skipgram)
    • Glove (300-d trained on Wikipedia)
    • FastText (currently not supported)

Given the prediction task (classification or regression) two baseline methods are implemented:

  • SVM classification
  • ElasticNet Regression

Methods Pipeline

Evaluation and Analysis Pipeline

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ntap-1.0.3.tar.gz (17.2 kB view hashes)

Uploaded Source

Built Distribution

ntap-1.0.3-py3-none-any.whl (16.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page