Content extractor for files containing text
Project description
content-extractor-pi is a Python module which aims to extract a certain piece of content defined by the user in a set of documents. This piece of content can be a paragraph that deals with a certain topic, headers, page numbers et cetera. content-extractor-pi does need some examples of the desired content, supplied by a domain expert, but our focus on few shot learning means ~10 examples is usually enough out a corpus that may contain 1000s of documents.
Installation
The easiest way to install content-extractor-pi is using pip:
pip install content-extractor-pi
Documentation
The main object of content-extractor-pi is ContentExtractor and its only attribute that it expects is a pre-trained word embedding model. In the example I'm using the pre-trained google news word-2-vec model available here.
ContentExtractor.train_model method
The train_model method extracts and scales features for the provided text examples contained in train_df, creates synthetic samples of the target class, and trains the model at the core of content_extractor.
Parameters
- train_df: pandas DataFrame containing the text examples in one column and the corresponding labels in the other one
- train_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in train_df
- y_name, default="label": column name of train_df where the labels are stored
- text_name, default="text": column name of train_df where the text examples are stored
- use_pca, default=False: apply Principal component analysis to the scaled extracted features, more info can be find here.
- gamma, default=1: Kernel coefficient for sklearn.svm.SVC
- C, default=0.1: Regularization parameter for sklearn.svm.SVC
ContentExtractor.extract_content method
The extract_content method extracts and scales features for the provided text examples contained in target_df and returns the ones labeled as 1 by the model.
Parameters
- target_df: pandas DataFrame containing all the text examples that we have at disposal
- target_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in target_df
- text_name, default="text": column name of target_df where the text examples are stored
Example
In the following example we have a full implementation that leads to extracting the desired content from a set of more than 62000 paragraphs originated from the World Bank Loan Agreements Corpus, using just 11 examples manually labeled as 1. The desired paragraphs are stored in the target_examples list.
from content_extractor import contextractor as cte
from gensim.models import KeyedVectors
import pandas as pd
W2V_MODEL = KeyedVectors.load_word2vec_format('/your/path/to/GoogleNews-vectors-negative300.bin.gz',
binary=True)
train_df = pd.read_csv("data/train_df.csv")
target_df = pd.read_csv("data/target_df.csv")
cont_ext = cte.ContentExtractor(W2V_MODEL)
cont_ext.train_model(train_df)
target_examples = cont_ext.extract_content(target_df)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for content_extractor-pi-0.0.7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69113132380c9bfb22e3dd1e8381938592c4f0a7035879538e95d360875da838 |
|
MD5 | 144ef18eb25a3643c56450e8ba30cbc0 |
|
BLAKE2b-256 | a990701121804f4473476c57b0166ac6cff9dc1322d626dd0cfad54381b38335 |