NTAP - CSSL
Project description
ntap: Neural Text Analysis Pipeline
ntap
is a python package built on top of tensorflow
, sklearn
, pandas
, gensim
, nltk
, and other libraries to facilitate the core functionalities of text analysis using modern methods from NLP.
Data loading and Text featurization
All ntap
functionalities use the Dataset object class, which is responsible for loading datasets from file, cleaning text, transforming text into features, and saving results to file.
ntap.data.Dataset
Dataset(source, tokenizer="wordpunct", vocab_size=5000, embed="glove",
min_token=5, stopwords=None, stem=False, lower=True, max_len=100,
include_nums=False, include_symbols=False, num_topics=100,
lda_max_iter=500)
Parameters
source
: str, path to single data file. Supported formats: newline-delimited.json
,.csv
,.tsv
, saved Pandas DataFrame as.pkl
filetokenizer
: str, select which tokenizer to use. ifNone
, will tokenize based on white-space. Options are based onnltk
word tokenizers: "wordpunct", ... (others not currently supported)vocab_size
: int, keep the topvocab_size
types, by frequency. Used in bag-of-words features, as well as neural methods. IfNone
, use all of vocabulary.embed
: str, select which word embedding to use for initialization of embedding layer. Currently onlyglove
is supportedmin_token
: int, indicates the minimum size, by number of tokens, for a document to be included after callingclean
.stopwords
: iterable or str, set of words to exclude. Default isNone
, which excludes no words. Options include lists/sets, as well as strings indicating the use of a saved list:nltk
is the only currently supported option, and indicates the defaultnltk
English liststem
: bool or str, ifFalse
then do not stem/lemmatize, otherwise follow the stemming procedure named bystem
. Options aresnowball
lower
: bool, ifTrue
then cast all alpha characters to lowercasemax_len
: int, maximum length, by number of valid tokens, for a document to be included during modeling.None
will result in the maximum length being calculated by the existing document setinclude_nums
: bool, ifTrue
, then do not discard tokens which contain numeric characters. Examples of this include dates, figures, and other numeric datatypes.include_symbols
: bool, ifTrue
, then do not discard tokens which contain non-alphanumeric symbolsnum_topics
: int, sets default number of topics to use iflda
method is called at a later point.lda_max_iter
: int, sets default number of iterations of Gibbs sampling to run during LDA model fitting
Methods
The Dataset class has a number of methods for control over the internal functionality of the class, which are called by Method objects. The most important stand-alone methods are the following:
Dataset.set_params(**kwargs)
:- Can be called at any time to reset a subset of the parameters in
Dataset
- TODO: call specific refitting (i.e.
__learn_vocab
)
- Can be called at any time to reset a subset of the parameters in
Dataset.clean(column, remove=["hashtags", "mentions", "links"], mode="remove")
:- Removes any tokens (before calling tokenizer) matching the descriptions in the
remove
list. Then tokenizes documents incolumn
, defines the vocabulary, the prunes documents from the Dataset instance that do not match the length criteria. All these are defined by the stored parameters in Dataset column
: str, indicates the column name of the text fileremove
: list of str, each item indicates a type of token to remove. IfNone
or list is empty, no tokens are removedmode
: str, for later iterations, could potentially store hashtag or links. Currently only option isremove
- Removes any tokens (before calling tokenizer) matching the descriptions in the
The Dataset object supports a number of feature methods (e.g. LDA, TFIDF), which can be called directly by the user, or implicitly during a Method construction (see Method documentation)
Dataset.lda(column, method="mallet", save_model=None, load_model=None)
:- Uses
gensim
wrapper ofMallet
java application. Currently only this is supported, though other implementations of LDA can be added.save_model
andload_model
are currently unsupported column
: str, text columnmethod
: only "mallet" is supportedsave_model
: str, indicate path to save trained topic model. Not yet implementedload_model
: str, indicate path to load trained topic model. Not yet implemented
- Uses
Dataset.ddr(column, dictionary, **kwargs)
:- Only method which must be called in advance (currently; advanced versions will store dictionary internally
column
: column in Dataset containing text. Does not have to be tokenized.dictionary
: str, path to dictionary file. Current supported types are.json
and.csv
..dic
to be added in a later version- possible
kwargs
includeembed
, which can be used to set the embedding source (i.e.embed="word2vec"
, but this feature has not yet been added)
Dataset.tfidf(column)
:- uses
gensim
TFIDF implementation. Ifvocab
has been learned previously, uses that. If not, relearns and computes DocTerm matrix column
: str, text column
- uses
- Later methods will include BERT, GLOVE embedding averages
Examples
Below are a set of use-cases for the Dataset object. Methods like SVM
are covered elsewhere, and are included here only for illustrative purposes.
from ntap.data import Dataset
from ntap.models import RNN, SVM
gab_data = Dataset("./my_data/gab.tsv")
other_gab_data = Dataset("./my_data/gab.tsv", vocab_size=20000, stem="snowball", max_len=1000)
gab_data.clean()
other_gab_data.clean() # using stored parameters
other_gab_data.set_params(include_nums=True) # reset parameter
other_gab_data.clean() # rerun using updated parameters
gab_data.set_params(num_topics=50, lda_max_iter=100)
base_gab = SVM("hate ~ lda(text)", data=gab_data)
base_gab2 = SVM("hate ~ lda(text)", data=other_gab_data)
Base Models
For supervised learning tasks, ntap
provides two (currently) baseline methods, SVM
and LM
. SVM
uses sklearn
's implementation of Support Vector Machine classifier, while LM
uses either ElasticNet
(supporting regularized linear regression) or LinearRegression
from sklearn
. Both models support the same type of core modeling functions: CV
, train
, and predict
, with CV
optionally supporting Grid Search.
All methods are created using an R
-like formula syntax. Base models like SVM
and LM
only support single target models, while other models support multiple targets.
ntap.models.SVM
SVM(formula, data, C=1.0, class_weight=None, dual=False, penalty='l2', loss='squared_hinge', tol=0.0001, max_iter=1000, random_state=None)
LM(formula, data, alpha=0.0, l1_ratio=0.5, max_iter=1000, tol=0.001, random_state=None)
Parameters
- formula: str, contains a single
~
symbol, separating the left-hand side (the target/dependent variable) from the right-hand side (a series of+
-delineated text tokens). The right hand side tokens can be either a column in Dataset object given to the constructor, or a feature call in the following form:<featurename>(<column>)
. data
: Dataset, an existing Dataset instancetol
: float, stopping criteria (difference in loss between epochs)max_iter
: int, max iterations during trainingrandom_state
: int
SVM:
C
: float, corresponds to thesklearn
"C" parameter in SVM Classifierdual
: bool, corresponds to thesklearn
"dual" parameter in SVM Classifierpenalty
: string, regularization function to use, corresponds to thesklearn
"penalty" parameterloss
: string, loss function to use, corresponds to thesklearn
"loss" parameter
LM:
alpha
: float, controls regularization.alpha=0.0
corresponds to Least Squares regression.alpha=1.0
is the default ElasticNet settingl1_ratio
: float, trade-off between L1 and L2 regularization. Ifl1_ratio=1.0
then it is LASSO, ifl1_ratio=0.0
it is Ridge
Functions
A number of functions are common to both LM
and SVM
set_params(**kwargs)
CV
:- Cross validation that implicitly support Grid Search. If a list of parameter values (instead of a single value) is given,
CV
runs grid search over all possible combinations of parameters LM
:CV(data, num_folds=10, metric="r2", random_state=None)
SVM
:CV(data, num_epochs, num_folds=10, stratified=True, metric="accuracy")
num_epochs
: number of epochs/iterations to train. This should be revisednum_folds
: number of cross foldsstratified
: if true, split data using stratified folds (even split with reference to target variable)metric
: metric on which to compare different CV results from different parameter grids (if no grid search is specified, no comparison is done andmetric
is disregarded)
- Returns: An instance of Class
CV_Results
- Contains information of all possible classification (or regression) metrics, for each CV fold and the mean across folds
- Contains saved parameter set
- Cross validation that implicitly support Grid Search. If a list of parameter values (instead of a single value) is given,
train
- Not currently advised for user application. Use
CV
instead
- Not currently advised for user application. Use
- `predict
- Not currently advised for user application. Use
CV
instead
- Not currently advised for user application. Use
Examples
from ntap.data import Dataset
from ntap.models import SVM
data = Dataset("./my_data.csv")
model = SVM("hate ~ tfidf(text)", data=data)
basic_cv_results = model.CV(num_folds=5)
basic_cv_results.summary()
model.set_params(C=[1., .8, .5, .2, .01]) # setting param
grid_searched = model.CV(num_folds=5)
basic_cv_results.summary()
basic_cv_results.params
Models
One basic model has been implemented for ntap
: RNN
. Later models will include CNN
and other neural variants. All model classes (CNN
, RNN
, etc.) have the following methods: CV
, predict
, and train
.
Model formulas using text in a neural architecture should use the following syntax:
"<dependent_variable> ~ seq(<text_column>)"
ntap.models.RNN
RNN(formula, data, hidden_size=128, cell="biLSTM", rnn_dropout=0.5, embedding_dropout=None,
optimizer='adam', learning_rate=0.001, rnn_pooling='last', embedding_source='glove',
random_state=None)
Parameters
formula
- similar to base methods, but supports multiple targets (multi-task learning). The format for this would be:
"hate + moral ~ seq(text)"
- similar to base methods, but supports multiple targets (multi-task learning). The format for this would be:
data
: Dataset objecthidden_size
: int, number of hidden units in the 1-layer RNN-type model\cell
: str, type of RNN cell. Default is a bidirectional Long Short-Term Memory (LSTM) cell. Options includebiLSTM
,LSTM
,GRU
, andbiGRU
(bidirectional Gated Recurrent Unit)rnn_dropout
: float, proportion of parameters in the network to randomly zero-out during dropout, in a layer applied to the outputs of the RNN. IfNone
, no dropout is applied (not advised)embedding_dropout
: str, not implementedoptimizer
: str, optimizer to use during training. Options are:adam
,sgd
,momentum
, andrmsprop
learning_rate
: learning rate during trainingrnn_pooling
: str or int. If int, model has self-attention, and a Feed-Forward layer of sizernn_pooling
is applied to the outputs of the RNN layer in order to produce the attention alphas. If string, possible options arelast
(default RNN behavior, where the last hidden vector is taken as the sentence representation and prior states are removed)mean
(average hidden states across the entire sequence) andmax
(select the max hidden vector)embedding_source
: str, eitherglove
or (other not implemented)random_state
: int
Functions
CV(data, num_folds, num_epochs, comp='accuracy', model_dir=None)
- Automatically performs grid search if multiple values are given for a particular parameter
data
: Dataset on which to perform CVnum_folds
: intcomp
: str, metric on which to compare different parameter grids (does not apply if no grid search)model_dir
: ifNone
, trained models are saved in a temp directory and then discarded after script exits. Otherwise,CV
attempts to save each model in the path given bymodel_dir
.- Returns: CV_results instance with best model stats (if grid search), and best parameters (not supported)
train(data, num_epochs=30, batch_size=256, indices=None, model_path=None)
- method called by
CV
, can be called independently. Can train on all data (indices=None
) or a specified subset. Ifmodel_path
isNone
, does not save model, otherwise attempt to save model atmodel_path
indices
: eitherNone
(train on all data) or list of int, where each value is an index in the range(0, len(data) - 1)
- method called by
predict(data, model_path, indices=None, batch_size=256, retrieve=list())
- Predicts on new data. Requires a saved model to exist at
model_path
. indices
: eitherNone
(train on all data) or list of int, where each value is an index in the range(0, len(data) - 1)
retrieve
: contains list of strings which indicate which model variables to retrieve during prediction. Includes:rnn_alpha
(if attention model) andhidden_states
(any model)- Returns: dictionary with {variable_name: value_list}. Contents are predicted values for each target variable and any model variables that are given in
retrieve
.
- Predicts on new data. Requires a saved model to exist at
from ntap.data import Dataset
from ntap.models import RNN
data = Dataset("./my_data.csv")
base_lstm = RNN("hate ~ seq(text)", data=data)
attention_lstm = RNN("hate ~ seq(text)", data=data, rnn_pooling=100) # attention
context_lstm = RNN("hate ~ seq(text) + speaker_party", data=data) # categorical variable
base_model.set_params({"hidden"=[200, 50], lr=[0.01, 0.05]}) # enable grid search during CV
# Grid search and print results from best parameters
base_results = base_model.CV()
base_results.summary()
# Train model and save. Predict for 6 specific instances and get alphas
attention_lstm.train(data, model_path="./trained_model")
predictions = attention_lstm.predict(data, model_path="./trained_model",
indices=[0,1,2,3,4,5], retrieve=["rnn_alphas"])
for alphas in predictions["rnn_alphas"]:
print(alphas) # prints list of floats, each the weight of a word in the ith document
Coming soon...
MIL(formula, data, ...)
- not implemented
HAN(formula, data, ...)
- not implemented
CNN()
- not implemented
NTAP.data.Tagme
Not implemented
Tagme(token="system", p=0.15, tweet=False)
token
(str
): PersonalTagme
token. Users can retrieve token by Creating Account. Default behavior ("system") assumesTagme
token has been set during installation ofNTAP
. Members:- get_tags(list-like of strings)
- Stores
abstracts
andcategories
as member variables
- Stores
- reset()
abstracts
: dictionary of {entity_id
:abstract text ...
}categories
: dictionary of {entity_id
:[category1, category2,
}
data = Dataset("path.csv")
data.tokenize(tokenizer='tweettokenize')
abstracts, categories = data.get_tagme(tagme_token=ntap.tagme_token, p=0.15, tweet=False)
# tagme saved as data object at data.entities
data.background_features(method='pointwise-mi', ...) # assumes data.tagme is set; creates features
saves features at data.background
background_mod = RNN("purity ~ seq(words) + background", data=data)
background_mod.CV(kfolds=10)
NTAP.data.TACIT
not implemented. Wrapper around TACIT instance
TACIT(path_to_tacit_directory, params to create tacit session)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.