Skip to main content

Content extractor for files containing text

Project description

content-extractor-pi is a Python module which aims to extract a certain piece of content defined by the user in a set of documents. This piece of content can be a paragraph that deals with a certain topic, headers, page numbers et cetera. content-extractor-pi does need some examples of the desired content, supplied by a domain expert, but our focus on few shot learning means ~10 examples is usually enough out a corpus that may contain 1000s of documents.

Installation

The easiest way to install content-extractor-pi is using pip:

pip install content-extractor-pi

Documentation

The main object of content-extractor-pi is ContentExtractor and its only attribute that it expects is a pre-trained word embedding model. In the example I'm using the pre-trained google news word-2-vec model available here.

ContentExtractor.train_model method

The train_model method extracts and scales features for the provided text examples contained in train_df, creates synthetic samples of the target class, and trains the model at the core of content_extractor.

Parameters

  • train_df: pandas DataFrame containing the text examples in one column and the corresponding labels in the other one
  • train_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in train_df
  • y_name, default="label": column name of train_df where the labels are stored
  • text_name, default="text": column name of train_df where the text examples are stored
  • use_pca, default=False: apply Principal component analysis to the scaled extracted features, more info can be find here.
  • gamma, default=1: Kernel coefficient for sklearn.svm.SVC
  • C, default=0.1: Regularization parameter for sklearn.svm.SVC

ContentExtractor.extract_content method

The extract_content method extracts and scales features for the provided text examples contained in target_df and returns the ones labeled as 1 by the model.

Parameters

  • target_df: pandas DataFrame containing all the text examples that we have at disposal
  • target_additional_features, default=None: pandas DataFrame containing additional features describing the text examples contained in target_df
  • text_name, default="text": column name of target_df where the text examples are stored

Example

In the following example we have a full implementation that leads to extracting the desired content from a set of more than 62000 paragraphs originated from the World Bank Loan Agreements Corpus, using just 11 examples manually labeled as 1. We aim to extract the objective paragraph that is a short segment describing how the the loan it's going to be spent, below you can find an example.

The desired paragraphs are stored in the target_examples list. You can find train_df.csv and target_df.csv here

from content_extractor import contextractor as cte
from gensim.models import KeyedVectors
import pandas as pd

W2V_MODEL = KeyedVectors.load_word2vec_format('/your/path/to/GoogleNews-vectors-negative300.bin.gz',
                                              binary=True)
train_df = pd.read_csv("data/train_df.csv")
target_df = pd.read_csv("data/target_df.csv")

cont_ext = cte.ContentExtractor(W2V_MODEL)
cont_ext.train_model(train_df)
target_examples = cont_ext.extract_content(target_df)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

content_extractor-pi-0.1.0.tar.gz (5.1 kB view details)

Uploaded Source

File details

Details for the file content_extractor-pi-0.1.0.tar.gz.

File metadata

  • Download URL: content_extractor-pi-0.1.0.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.2

File hashes

Hashes for content_extractor-pi-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bb3ba6b056bce4abda3f4647b1ba094520e89151d681b4ceef2df8e96b397c0b
MD5 70b1e4a7baa0b82efb7701c18604ebbc
BLAKE2b-256 6addbe571d2efb313b1d884c1caf0076b0a11b3fb169c2f3801506405ee50916

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page