Skip to main content

A simple information retrieval system for pdf documents

Project description

README

Presentation

irspdf is a simple textual information retrieval system for pdf documents.

Text is extracted from pdf with pdfplumber.

Standard text preprocessing for information retrieval is applied:

  • StopWord removal
  • Stemming
  • Punctuation removal
  • Lowercase conversion

The ranking function used is BM25.

Installation

Install with pip

pip install irspdf

OR install from github

git clone https://github.com/Jibril-Frej/irspdf.git
cd irspdf && python setup.py install

Usage

Build a collection

from irspdf import build
build(folder_path, collection_path)

folder_path : path of the folder that contains all the pdf files to include to the collection.

collection_path : file where the collection will be saved

Query the collection

from irspdf import query
query(collection_path)

collection_path : file where the collection is saved

Update the collection

from irspdf import update
update(folder_path, collection_path)

folder_path : path of the folder that contains all the pdf files to add to the collection.

collection_path : file where the original collection is saved

Useful links

Documentation: https://irspdf.readthedocs.io/en/latest/.

Source Code: https://github.com/Jibril-Frej/irspdf

Package: https://pypi.org/project/irspdf/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

irspdf-0.4.0.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

irspdf-0.4.0-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file irspdf-0.4.0.tar.gz.

File metadata

  • Download URL: irspdf-0.4.0.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for irspdf-0.4.0.tar.gz
Algorithm Hash digest
SHA256 a2300fb79a06eeb3060627dce6caed81ff6adc9dc590c75ab6992671344688b1
MD5 23efc3c7b025ff36e7db634f92e24848
BLAKE2b-256 357f908030e6c174274e38ae2441603e429adac72981f148162bade88da93345

See more details on using hashes here.

File details

Details for the file irspdf-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: irspdf-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 5.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for irspdf-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6eaf2f64a18cf7baf00dc04998ebf255c202c10e970569e05c799d03732e9da5
MD5 bcee44e23737203dfc03dc07a5839457
BLAKE2b-256 46be9585c6a05bd98914f12712ea5e16540493167ccbb30bcb11831d61a1eb6f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page