Skip to main content

Converter from urls,pdfs,wikipages to clean text document one sentence per line.

Project description

sentify is a simple and fast open source Python toolkit that aggregates in one step the tedious task of fetching, converting to text and segmenting documents into one sentence per line clean text files

I put it together thinking that it is an often unavoidable "stepping stone" for getting quickly to the really interesting NLP and AI tasks we care about these days.

The collected clean sentences are ready for NLP and ML tasks, including passing them to Generative AI for summarization, relation extraction and QA.

It handles local and remote txt and pdf files and urls as well as Wikipedia pages given by their title.

See code at

https://github.com/ptarau/sentify/blob/main/sentify/main.py

for the simple, all in one API.

Get it from github or fetch it from pypi with

pip3 install sentify

See tests/tests.py for testing out the API on several use cases.

Enjoy,

Paul Tarau

January, 2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentify-1.0.2.tar.gz (8.3 kB view details)

Uploaded Source

File details

Details for the file sentify-1.0.2.tar.gz.

File metadata

  • Download URL: sentify-1.0.2.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for sentify-1.0.2.tar.gz
Algorithm Hash digest
SHA256 8bc1db647e370d7febcf6d7d65350e79c81d68c8538660a50b3020a0ce10e310
MD5 10f8c09e3a0ac0d84abf7db57c95f0b4
BLAKE2b-256 8fd8ab8170bfd59c1862982cdd195f8f4a940540a27c3d14b39cb62bb26d2673

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page