Skip to main content

A tool to extract logic from document

Project description

docusense

A python library for extracting main logic from document using NLP transformers and stop words. With multilingual transformers the extraction process should work well enough on many types of documents.

The purpose of this library is to extract main logical sense from documents without needing to open them.

Features

  • summary extraction
  • questions & answers from the document
  • strongest keywords from document

Setup

Install using pip

pip install docusense

Terminal usages

Possible parameters

  • path: document text for summarization.
  • text: document text for summarization.
  • lang: a language used for stopwords. Removes unnecessary stopwords for selected language.
  • question: Question to look for the answer in text.
  • min_length: Min length of generated summary.
  • max_length: Max length of generated summary.
  • max_answer_len: Max length of answer.
  • n_keywords: Number of keywords to return.

Simple case

Logging summary, answer to asked question and keywords.

python extract.py --path "path/to/file"

Examples as python code

Simple case

from docusense.sense import SenseExtractor

text = """
Good evening,
Carriers constantly violate the passenger and baggage transport rules.
In the evenings from 21:00 to 24:00 drivers are specially late to leave
around the ring for 3-4 minutes (all this time the buses are standing still in the ring with the engines running,
at the same time creating additional air and noise pollution close to residential houses in the evenings
and during the night).
Also, some buses often arrive a few minutes earlier than indicated
in the schedule. Apparently, the transporters live and work in a parallel world, somewhere at night
traffic jams occur in the district.
The company "Communication Services" has been informing about violations for several months, but
does not take any action, although it is required to carry out the control of public transport carriers and
ensure compliance with passenger and baggage regulations.
The director of the municipal administration also does not carry out any control, rules 3
point - To instruct the Director of Administration to control how this is carried out
solution.
Please provide an answer: why are the same violations repeated every night and how
it is ensured that the carriers comply with the passenger and baggage transport rules.
"""

extractor = SenseExtractor()
output = extractor(text=text)

Divided use

You can use each part of the extractor separate.

Summarizer

text = """
Good evening,
Carriers constantly violate the passenger and baggage transport rules.
In the evenings from 21:00 to 24:00 drivers are specially late to leave
around the ring for 3-4 minutes (all this time the buses are standing still in the ring with the engines running,
at the same time creating additional air and noise pollution close to residential houses in the evenings
and during the night).
Also, some buses often arrive a few minutes earlier than indicated
in the schedule. Apparently, the transporters live and work in a parallel world, somewhere at night
traffic jams occur in the district.
The company "Communication Services" has been informing about violations for several months, but
does not take any action, although it is required to carry out the control of public transport carriers and
ensure compliance with passenger and baggage regulations.
The director of the municipal administration also does not carry out any control, rules 3
point - To instruct the Director of Administration to control how this is carried out
solution.
Please provide an answer: why are the same violations repeated every night and how
it is ensured that the carriers comply with the passenger and baggage transport rules.
"""

#summarizer part
from docusense.summary import Summarizer
summarizer = Summarizer()
summary = summarizer(text, min_length=20)

#questions and answers
from docusense.qa import QAExtractor
qa_extractor = QAExtractor()
answer = qa_extractor(text, question="What is the question in the text?")

#keywords
from docusense.keywords import KeywordsExtractor
keywords_extractor = KeywordsExtractor()
keywords = keywords_extractor(text, lang='english', n_keywords=3)

Development

  1. Install poetry https://python-poetry.org/docs/#installation depending on your machine
  2. poetry install

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docusense-0.0.1.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

docusense-0.0.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file docusense-0.0.1.tar.gz.

File metadata

  • Download URL: docusense-0.0.1.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.9.16 Darwin/22.5.0

File hashes

Hashes for docusense-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9377ecc14fa2f78948667725fabb002253b43563f8563d5bbe4817e57912fe29
MD5 b1b0ea6f84af2d387f42e55e69594b2a
BLAKE2b-256 24edf926a511ff9ec24de44cecb719698fca9f964db851ba9cf20e285e3a286d

See more details on using hashes here.

File details

Details for the file docusense-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: docusense-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.9.16 Darwin/22.5.0

File hashes

Hashes for docusense-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 221e8090a59c722f6af642fcdc0d46b7409b716d77a4869bc7ec248211935c3a
MD5 d253523e23b0d626bc54efafe27dd42d
BLAKE2b-256 00e4944997ca519fc4b0c59cce4cffec1a1b1bfba2e2fdb4a266c576ebf95788

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page