content-extractor-pi

Content extractor for files containing text

Project description

content-extractor-pi is a Python module which aims to extract a certain piece of content defined by the user in a set of documents. This piece of content can be a paragraph that deals with a certain topic, headers, page numbers et cetera. content-extractor-pi does need some examples of the desired content, supplied by a domain expert, but our focus on few shot learning means ~10 examples is usually enough out a corpus that may contain 1000s of documents.

Installation

The easiest way to install content-extractor-pi is using pip:

pip install content-extractor-pi

Documentation

The main object of content-extractor-pi is ContentExtractor and its only attribute that it expects is a pre-trained word embedding model. In the example I'm using the pre-trained google news word-2-vec model available here.

ContentExtractor.train_model method

The train_model method extracts and scales features for the provided text examples contained in train_df, creates synthetic samples of the target class, and trains the model at the core of content_extractor.

Parameters

train_df: pandas DataFrame containing the text examples in one column and the corresponding labels in the other one
train_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in train_df
y_name, default="label": column name of train_df where the labels are stored
text_name, default="text": column name of train_df where the text examples are stored
use_pca, default=False: apply Principal component analysis to the scaled extracted features, more info can be find here.
gamma, default=1: Kernel coefficient for sklearn.svm.SVC
C, default=0.1: Regularization parameter for sklearn.svm.SVC

ContentExtractor.extract_content method

The extract_content method extracts and scales features for the provided text examples contained in target_df and returns the ones labeled as 1 by the model.

Parameters

target_df: pandas DataFrame containing all the text examples that we have at disposal
target_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in target_df
text_name, default="text": column name of target_df where the text examples are stored

Example

In the following example we have a full implementation that leads to extracting the desired content from a set of more than 62000 paragraphs originated from the World Bank Loan Agreements Corpus, using just 11 examples manually labeled as 1. The desired paragraphs are stored in the target_examples list.

from content_extractor import contextractor as cte
from gensim.models import KeyedVectors
import pandas as pd

W2V_MODEL = KeyedVectors.load_word2vec_format('/your/path/to/GoogleNews-vectors-negative300.bin.gz',
                                              binary=True)
train_df = pd.read_csv("data/train_df.csv")
target_df = pd.read_csv("data/target_df.csv")

cont_ext = cte.ContentExtractor(W2V_MODEL)
cont_ext.train_model(train_df)
target_examples = cont_ext.extract_content(target_df)

Project details

Release history Release notifications | RSS feed

0.1.0

Aug 25, 2021

0.0.9

Aug 24, 2021

0.0.8

Aug 24, 2021

This version

0.0.7

Aug 23, 2021

0.0.6

Aug 23, 2021

0.0.5

Aug 23, 2021

0.0.4

Aug 23, 2021

0.0.3

Aug 23, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

content_extractor-pi-0.0.7.tar.gz (4.6 kB view hashes)

Uploaded Aug 23, 2021 Source

Hashes for content_extractor-pi-0.0.7.tar.gz

Hashes for content_extractor-pi-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`69113132380c9bfb22e3dd1e8381938592c4f0a7035879538e95d360875da838`
MD5	`144ef18eb25a3643c56450e8ba30cbc0`
BLAKE2b-256	`a990701121804f4473476c57b0166ac6cff9dc1322d626dd0cfad54381b38335`