content-extractor-pi·PyPI

Content extractor for files containing text

Project description

content-extractor-pi is a Python module which aims to extract a certain piece of content defined by the user in a set of documents. This piece of content can be a paragraph that deals with a certain topic, headers, page numbers et cetera. content-extractor-pi does need some examples of the desired content, supplied by a domain expert, but our focus on few shot learning means ~10 examples is usually enough out a corpus that may contain 1000s of documents.

Installation

The easiest way to install content-extractor-pi is using pip:

pip install content-extractor-pi

Documentation

The main object of content-extractor-pi is ContentExtractor and its only attribute that it expects is a pre-trained word embedding model. In the example I'm using the pre-trained google news word-2-vec model available here.

ContentExtractor.train_model method

The train_model method extracts and scales features for the provided text examples contained in train_df, creates synthetic samples of the target class, and trains the model at the core of content_extractor.

Parameters

train_df: pandas DataFrame containing the text examples in one column and the corresponding labels in the other one
train_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in train_df
y_name, default="label": column name of train_df where the labels are stored
text_name, default="text": column name of train_df where the text examples are stored
use_pca, default=False: apply Principal component analysis to the scaled extracted features, more info can be find here.
gamma, default=1: Kernel coefficient for sklearn.svm.SVC
C, default=0.1: Regularization parameter for sklearn.svm.SVC

ContentExtractor.extract_content method

The extract_content method extracts and scales features for the provided text examples contained in target_df and returns the ones labeled as 1 by the model.

Parameters

target_df: pandas DataFrame containing all the text examples that we have at disposal
target_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in target_df
text_name, default="text": column name of target_df where the text examples are stored

Example

In the following example we have a full implementation that leads to extracting the desired content from a set of more than 62000 paragraphs originated from the World Bank Loan Agreements Corpus, using just 11 examples manually labeled as 1. We aim to extract the objective paragraph that is a short segment describing how the the loan it's going to be spent, below you can find an example.

The desired paragraphs are stored in the target_examples list. You can find train_df.csv and target_df.csv here

from content_extractor import contextractor as cte
from gensim.models import KeyedVectors
import pandas as pd

W2V_MODEL = KeyedVectors.load_word2vec_format('/your/path/to/GoogleNews-vectors-negative300.bin.gz',
                                              binary=True)
train_df = pd.read_csv("data/train_df.csv")
target_df = pd.read_csv("data/target_df.csv")

cont_ext = cte.ContentExtractor(W2V_MODEL)
cont_ext.train_model(train_df)
target_examples = cont_ext.extract_content(target_df)

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Aug 25, 2021

0.0.9

Aug 24, 2021

0.0.8

Aug 24, 2021

0.0.7

Aug 23, 2021

0.0.6

Aug 23, 2021

0.0.5

Aug 23, 2021

0.0.4

Aug 23, 2021

0.0.3

Aug 23, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

content_extractor-pi-0.1.0.tar.gz (5.1 kB view details)

Uploaded Aug 25, 2021 Source

File details

Details for the file content_extractor-pi-0.1.0.tar.gz.

File metadata

Download URL: content_extractor-pi-0.1.0.tar.gz
Upload date: Aug 25, 2021
Size: 5.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.2

File hashes

Hashes for content_extractor-pi-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`bb3ba6b056bce4abda3f4647b1ba094520e89151d681b4ceef2df8e96b397c0b`
MD5	`70b1e4a7baa0b82efb7701c18604ebbc`
BLAKE2b-256	`6addbe571d2efb313b1d884c1caf0076b0a11b3fb169c2f3801506405ee50916`

See more details on using hashes here.

content-extractor-pi 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Installation

Documentation

ContentExtractor.train_model method

Parameters

ContentExtractor.extract_content method

Parameters

Example

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes