Skip to main content

No project description provided

Project description

Build Status PyPI version License: MIT

DiscoverPagination

A python package for discovering numbered page delineation in documents.

Repository

https://github.com/wharton/DiscoverPagination

Background

In the Research and Analytics Department we are asked for several different types of text processing assignments. These usually take the form of "please extract the X section from Y document type 10k times" Some of these have a Table of Contents, but it is difficult to use the ToC because we do not know which pages are which.

This package is designed to discover where pages are marked, and then reference those page numbers to get sections of text. Much of the work we do involves SEC filings, which are in a type of XML format. This is optimized for that type of document, but should do well in other cases.

Requirements

  • Python 3.6
  • fuzzywuzzy: Fuzzy matching
  • python-Levenshtein: Speeds up fuzzy matching library

Quickstart

Install

$ pip install DiscoverPagination

Usage

$ python
>> from discoverpagination import *
>> with open('./example_texts/0001193125-08-010038.txt') as inputfile:
...       doc = PaginatedDocument(inputfile, clean_xml=True)
>> pages = doc[20:22]
>> print(pages)
[' <P><FONT>19 </FONT></P>\n', '\n', '\n', '<p>\n', '<HR>\n', '\n', ' <P><FONT>The ...

Methods

The way the pages are discovered takes several steps and relies on a few assumptions.

Assumptions

  1. Pages are marked
  2. Page markings are in sequential order
  3. Page markings use numeric characters
  4. Pages are numbered at the end of page
  5. Page numbers do not occur mixed with text. (There is an attempt to handle this case.)

Steps

  1. Document is read from file
  2. (OPTIONAL) XML documents are cleaned of tag attributes.
  3. Document is scanned for page markers line by line, starting with "1". (Configurable)
  4. As each number is found, the line index and text is stored in a Dict keyed to page number.
  5. The page is incremented after each number is found until no more document lines remain.
  6. The document is rescanned in reverse order to find page markers.
  7. Page markers that are the same or nearby to each other are kept.
  8. A common "best_match" format is determined by ranking each type of line.
  9. The missing page numbers are scanned for with this "best_match" in the areas they should be. E.g. A missing page 5 is searched for between pages 4 and 6 with the best pattern.
  10. If there are still missing pages it uses fuzzy matching to guess based on placement and pattern.
  11. The document is returned and can be referenced by slicing. doc[10:12] gets lines for pages 10 to 12.

Tests

python setup.py test

Reference

fuzzywuzzy
python-Levenshtein
SEC EDGAR

Contributors

Douglas H. King

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DiscoverPagination-0.1.4.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

DiscoverPagination-0.1.4-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file DiscoverPagination-0.1.4.tar.gz.

File metadata

  • Download URL: DiscoverPagination-0.1.4.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.3

File hashes

Hashes for DiscoverPagination-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3228467e9d68ffb7a3b273362e1c2527bf6ce0d64cb5a805eefe19adb228a5e8
MD5 67c9da3439e7355f3102cc6c92ddabe4
BLAKE2b-256 53f1aacb5f7623a8fba1a5d74000fed2f55f990d93ed609bbe57141219d6f86d

See more details on using hashes here.

File details

Details for the file DiscoverPagination-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: DiscoverPagination-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.3

File hashes

Hashes for DiscoverPagination-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f3d9fe2efdc5f195629aea11f27e19ba9dba46165403e26436c31ee809a2c9de
MD5 88363b96272ea69cb2fb96be3b65850d
BLAKE2b-256 e54a8c324ce81e37153102c46f3b1702e798b00a23a65be96351eded624e21a5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page