Skip to main content

Converter from urls,pdfs,wikipages to clean text document one sentence per line.

Project description

sentify is a simple and fast open source Python toolkit that aggregates in one step the tedious task of fetching, converting to text and segmenting documents into one sentence per line clean text files

I put it together thinking that it is an often unavoidable "stepping stone" for getting quickly to the really interesting NLP and AI tasks we care about these days.

The collected clean sentences are ready for NLP and ML tasks, including passing them to Generative AI for summarization, relation extraction and QA.

It handles local and remote txt and pdf files and urls as well as Wikipedia pages given by their title.

See code at

https://github.com/ptarau/sentify/blob/main/sentify/main.py

for the simple, all in one API.

Get it from github or fetch it from pypi with

pip3 install sentify

See tests/tests.py for testing out the API on several use cases.

Enjoy,

Paul Tarau

January, 2024

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentify-0.9.3.tar.gz (7.2 kB view details)

Uploaded Source

File details

Details for the file sentify-0.9.3.tar.gz.

File metadata

  • Download URL: sentify-0.9.3.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for sentify-0.9.3.tar.gz
Algorithm Hash digest
SHA256 71950899e3a3bb571ae1cbcf5d657f873193230b47beab06a6be858f7b7a688f
MD5 01ca9a2490237d3de69e7cbcead90b19
BLAKE2b-256 e45d9e530452e0149867cc8480db60228879c4bd3e35dcc8d3cb9e7d246b5d7f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page