Content extractor for files containing text
Project description
content-extractor-pi is a Python module which aims to extract a certain piece of content defined by the user in a set of documents. This piece of content can be a paragraph that deals with a certain topic, headers, page numbers et cetera. content-extractor-pi does need some examples of the desired content, supplied by a domain expert, but our focus on few shot learning means ~10 examples is usually enough out a corpus that may contain 1000s of documents.
Installation
The easiest way to install content-extractor-pi is using pip:
pip install content-extractor-pi
Documentation
The main object of content-extractor-pi is ContentExtractor and its only attribute that it expects is a pre-trained word embedding model. In the following example I'm using the pre-trained google news word-2-vec model available here.
from content_extractor import contextractor as cte
from gensim.models import KeyedVectors
W2V_MODEL = KeyedVectors.load_word2vec_format('/your/path/to/GoogleNews-vectors-negative300.bin.gz',
binary=True)
cont_ext = cte.ContentExtractor(W2V_MODEL)
ContentExtractor.train_model method
The train_model method extracts and scales features for the provided text examples contained in train_df, creates synthetic samples of the target class, and trains the model at the core of content_extractor.
Parameters
- train_df: pandas DataFrame containing the text examples in one column and the corresponding labels in the other one
- train_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in train_df
- y_name, default="label": column name of train_df where the labels are stored
- text_name, default="text": column name of train_df where the text examples are stored
- use_pca, default=False: apply Principal component analysis to the scaled extracted features, more info can be find here.
- gamma, default=1: Kernel coefficient for sklearn.svm.SVC
- C, default=0.1: Regularization parameter for sklearn.svm.SVC
ContentExtractor.extract_content method
The extract_content method
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for content_extractor-pi-0.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6e133fa5e4fe4bbaca4e6d54929dc888f139b3e95a3b3dee8d105fce02af064 |
|
MD5 | fc14f604e4aa618d34ac281f1faad970 |
|
BLAKE2b-256 | 8bef149ed08061e12ea3c0a1fcefc92df0574f39cb5f84af7a3dbd92f7b3c039 |