Project description

ntap: Neural Text Analysis Pipeline

ntap is a python package built on top of tensorflow, sklearn, pandas, gensim, nltk, and other libraries to facilitate the core functionalities of text analysis using modern methods from NLP.

Data loading and Text featurization

All ntap functionalities use the Dataset object class, which is responsible for loading datasets from file, cleaning text, transforming text into features, and saving results to file.

Dataset(source, tokenizer="wordpunct", vocab_size=5000, embed="glove",
		min_token=5, stopwords=None, stem=False, lower=True, max_len=100,
		include_nums=False, include_symbols=False, num_topics=100, 


  • source: str, path to single data file. Supported formats: newline-delimited .json, .csv, .tsv, saved Pandas DataFrame as .pkl file
  • tokenizer: str, select which tokenizer to use. if None, will tokenize based on white-space. Options are based on nltk word tokenizers: "wordpunct", ... (others not currently supported)
  • vocab_size: int, keep the top vocab_size types, by frequency. Used in bag-of-words features, as well as neural methods. If None, use all of vocabulary.
  • embed: str, select which word embedding to use for initialization of embedding layer. Currently only glove is supported
  • min_token: int, indicates the minimum size, by number of tokens, for a document to be included after calling clean.
  • stopwords: iterable or str, set of words to exclude. Default is None, which excludes no words. Options include lists/sets, as well as strings indicating the use of a saved list: nltk is the only currently supported option, and indicates the default nltk English list
  • stem: bool or str, if False then do not stem/lemmatize, otherwise follow the stemming procedure named by stem. Options are snowball
  • lower: bool, if True then cast all alpha characters to lowercase
  • max_len: int, maximum length, by number of valid tokens, for a document to be included during modeling. None will result in the maximum length being calculated by the existing document set
  • include_nums: bool, if True, then do not discard tokens which contain numeric characters. Examples of this include dates, figures, and other numeric datatypes.
  • include_symbols: bool, if True, then do not discard tokens which contain non-alphanumeric symbols
  • num_topics: int, sets default number of topics to use if lda method is called at a later point.
  • lda_max_iter: int, sets default number of iterations of Gibbs sampling to run during LDA model fitting


The Dataset class has a number of methods for control over the internal functionality of the class, which are called by Method objects. The most important stand-alone methods are the following:

  • Dataset.set_params(**kwargs):
    • Can be called at any time to reset a subset of the parameters in Dataset
    • TODO: call specific refitting (i.e. __learn_vocab)
  • Dataset.clean(column, remove=["hashtags", "mentions", "links"], mode="remove"):
    • Removes any tokens (before calling tokenizer) matching the descriptions in the remove list. Then tokenizes documents in column, defines the vocabulary, the prunes documents from the Dataset instance that do not match the length criteria. All these are defined by the stored parameters in Dataset
    • column: str, indicates the column name of the text file
    • remove: list of str, each item indicates a type of token to remove. If None or list is empty, no tokens are removed
    • mode: str, for later iterations, could potentially store hashtag or links. Currently only option is remove

The Dataset object supports a number of feature methods (e.g. LDA, TFIDF), which can be called directly by the user, or implicitly during a Method construction (see Method documentation)

  • Dataset.lda(column, method="mallet", save_model=None, load_model=None):
    • Uses gensim wrapper of Mallet java application. Currently only this is supported, though other implementations of LDA can be added. save_model and load_model are currently unsupported
    • column: str, text column
    • method: only "mallet" is supported
    • save_model: str, indicate path to save trained topic model. Not yet implemented
    • load_model: str, indicate path to load trained topic model. Not yet implemented
  • Dataset.ddr(column, dictionary, **kwargs):
    • Only method which must be called in advance (currently; advanced versions will store dictionary internally
    • column: column in Dataset containing text. Does not have to be tokenized.
    • dictionary: str, path to dictionary file. Current supported types are .json and .csv. .dic to be added in a later version
    • possible kwargs include embed, which can be used to set the embedding source (i.e. embed="word2vec", but this feature has not yet been added)
  • Dataset.tfidf(column):
    • uses gensim TFIDF implementation. If vocab has been learned previously, uses that. If not, relearns and computes DocTerm matrix
    • column: str, text column
  • Later methods will include BERT, GLOVE embedding averages


Below are a set of use-cases for the Dataset object. Methods like SVM are covered elsewhere, and are included here only for illustrative purposes.

from import Dataset
from ntap.models import RNN, SVM

gab_data = Dataset("./my_data/gab.tsv")
other_gab_data = Dataset("./my_data/gab.tsv", vocab_size=20000, stem="snowball", max_len=1000)
other_gab_data.clean() # using stored parameters
other_gab_data.set_params(include_nums=True) # reset parameter
other_gab_data.clean() # rerun using updated parameters

gab_data.set_params(num_topics=50, lda_max_iter=100)
base_gab = SVM("hate ~ lda(text)", data=gab_data)
base_gab2 = SVM("hate ~ lda(text)", data=other_gab_data)

Base Models

For supervised learning tasks, ntap provides two (currently) baseline methods, SVM and LM. SVM uses sklearn's implementation of Support Vector Machine classifier, while LM uses either ElasticNet (supporting regularized linear regression) or LinearRegression from sklearn. Both models support the same type of core modeling functions: CV, train, and predict, with CV optionally supporting Grid Search.

All methods are created using an R-like formula syntax. Base models like SVM and LM only support single target models, while other models support multiple targets.


SVM(formula, data, C=1.0, class_weight=None, dual=False, penalty='l2', loss='squared_hinge', tol=0.0001, max_iter=1000, random_state=None)

LM(formula, data, alpha=0.0, l1_ratio=0.5, max_iter=1000, tol=0.001, random_state=None)


  • formula: str, contains a single ~ symbol, separating the left-hand side (the target/dependent variable) from the right-hand side (a series of +-delineated text tokens). The right hand side tokens can be either a column in Dataset object given to the constructor, or a feature call in the following form: <featurename>(<column>).
  • data: Dataset, an existing Dataset instance
  • tol: float, stopping criteria (difference in loss between epochs)
  • max_iter: int, max iterations during training
  • random_state: int


  • C: float, corresponds to the sklearn "C" parameter in SVM Classifier
  • dual: bool, corresponds to the sklearn "dual" parameter in SVM Classifier
  • penalty: string, regularization function to use, corresponds to the sklearn "penalty" parameter
  • loss: string, loss function to use, corresponds to the sklearn "loss" parameter


  • alpha: float, controls regularization. alpha=0.0 corresponds to Least Squares regression. alpha=1.0 is the default ElasticNet setting
  • l1_ratio: float, trade-off between L1 and L2 regularization. If l1_ratio=1.0 then it is LASSO, if l1_ratio=0.0 it is Ridge


A number of functions are common to both LM and SVM

  • set_params(**kwargs)
  • CV:
    • Cross validation that implicitly support Grid Search. If a list of parameter values (instead of a single value) is given, CV runs grid search over all possible combinations of parameters
    • LM: CV(data, num_folds=10, metric="r2", random_state=None)
    • SVM: CV(data, num_epochs, num_folds=10, stratified=True, metric="accuracy")
      • num_epochs: number of epochs/iterations to train. This should be revised
      • num_folds: number of cross folds
      • stratified: if true, split data using stratified folds (even split with reference to target variable)
      • metric: metric on which to compare different CV results from different parameter grids (if no grid search is specified, no comparison is done and metric is disregarded)
    • Returns: An instance of Class CV_Results
      • Contains information of all possible classification (or regression) metrics, for each CV fold and the mean across folds
      • Contains saved parameter set
  • train
    • Not currently advised for user application. Use CV instead
  • `predict
    • Not currently advised for user application. Use CV instead


from import Dataset
from ntap.models import SVM

data = Dataset("./my_data.csv")
model = SVM("hate ~ tfidf(text)", data=data)
basic_cv_results = model.CV(num_folds=5)
model.set_params(C=[1., .8, .5, .2, .01]) # setting param
grid_searched = model.CV(num_folds=5)


One basic model has been implemented for ntap: RNN. Later models will include CNN and other neural variants. All model classes (CNN, RNN, etc.) have the following methods: CV, predict, and train.

Model formulas using text in a neural architecture should use the following syntax: "<dependent_variable> ~ seq(<text_column>)"


RNN(formula, data, hidden_size=128, cell="biLSTM", rnn_dropout=0.5, embedding_dropout=None,
	optimizer='adam', learning_rate=0.001, rnn_pooling='last', embedding_source='glove', 


  • formula
    • similar to base methods, but supports multiple targets (multi-task learning). The format for this would be: "hate + moral ~ seq(text)"
  • data: Dataset object
  • hidden_size: int, number of hidden units in the 1-layer RNN-type model\
  • cell: str, type of RNN cell. Default is a bidirectional Long Short-Term Memory (LSTM) cell. Options include biLSTM, LSTM, GRU, and biGRU (bidirectional Gated Recurrent Unit)
  • rnn_dropout: float, proportion of parameters in the network to randomly zero-out during dropout, in a layer applied to the outputs of the RNN. If None, no dropout is applied (not advised)
  • embedding_dropout: str, not implemented
  • optimizer: str, optimizer to use during training. Options are: adam, sgd, momentum, and rmsprop
  • learning_rate: learning rate during training
  • rnn_pooling: str or int. If int, model has self-attention, and a Feed-Forward layer of size rnn_pooling is applied to the outputs of the RNN layer in order to produce the attention alphas. If string, possible options are last (default RNN behavior, where the last hidden vector is taken as the sentence representation and prior states are removed) mean (average hidden states across the entire sequence) and max (select the max hidden vector)
  • embedding_source: str, either glove or (other not implemented)
  • random_state: int


  • CV(data, num_folds, num_epochs, comp='accuracy', model_dir=None)
    • Automatically performs grid search if multiple values are given for a particular parameter
    • data: Dataset on which to perform CV
    • num_folds: int
    • comp: str, metric on which to compare different parameter grids (does not apply if no grid search)
    • model_dir: if None, trained models are saved in a temp directory and then discarded after script exits. Otherwise, CV attempts to save each model in the path given by model_dir.
    • Returns: CV_results instance with best model stats (if grid search), and best parameters (not supported)
  • train(data, num_epochs=30, batch_size=256, indices=None, model_path=None)
    • method called by CV, can be called independently. Can train on all data (indices=None) or a specified subset. If model_path is None, does not save model, otherwise attempt to save model at model_path
    • indices: either None (train on all data) or list of int, where each value is an index in the range (0, len(data) - 1)
  • predict(data, model_path, indices=None, batch_size=256, retrieve=list())
    • Predicts on new data. Requires a saved model to exist at model_path.
    • indices: either None (train on all data) or list of int, where each value is an index in the range (0, len(data) - 1)
    • retrieve: contains list of strings which indicate which model variables to retrieve during prediction. Includes: rnn_alpha (if attention model) and hidden_states (any model)
    • Returns: dictionary with {variable_name: value_list}. Contents are predicted values for each target variable and any model variables that are given in retrieve.
from import Dataset
from ntap.models import RNN

data = Dataset("./my_data.csv")
base_lstm = RNN("hate ~ seq(text)", data=data)
attention_lstm = RNN("hate ~ seq(text)", data=data, rnn_pooling=100) # attention
context_lstm = RNN("hate ~ seq(text) + speaker_party", data=data) # categorical variable
base_model.set_params({"hidden"=[200, 50], lr=[0.01, 0.05]}) # enable grid search during CV

# Grid search and print results from best parameters
base_results = base_model.CV()

# Train model and save. Predict for 6 specific instances and get alphas
attention_lstm.train(data, model_path="./trained_model")
predictions = attention_lstm.predict(data, model_path="./trained_model",
							indices=[0,1,2,3,4,5], retrieve=["rnn_alphas"])
for alphas in predictions["rnn_alphas"]:
	print(alphas)  # prints list of floats, each the weight of a word in the ith document

Coming soon...

MIL(formula, data, ...)

  • not implemented

HAN(formula, data, ...)

  • not implemented


  • not implemented

Not implemented Tagme(token="system", p=0.15, tweet=False)

  • token (str): Personal Tagme token. Users can retrieve token by Creating Account. Default behavior ("system") assumes Tagme token has been set during installation of NTAP. Members:
  • get_tags(list-like of strings)
    • Stores abstracts and categories as member variables
  • reset()
  • abstracts: dictionary of {entity_id: abstract text ...}
  • categories: dictionary of {entity_id: [category1, category2, }
data = Dataset("path.csv")
abstracts, categories = data.get_tagme(tagme_token=ntap.tagme_token, p=0.15, tweet=False)
# tagme saved as data object at data.entities
data.background_features(method='pointwise-mi', ...)  # assumes data.tagme is set; creates features
saves features at data.background

background_mod = RNN("purity ~ seq(words) + background", data=data)

not implemented. Wrapper around TACIT instance TACIT(path_to_tacit_directory, params to create tacit session)

