Skip to main content

Pre processing tools for documents with legal content.

Project description

Legal Pre-processing

Pre processing tools for documents with legal content. Authors: Daniel Henrique Arruda Boeing and Israel Oliveira.

Python 3.7 Python 3.8 Python 3.9

Usage:

Donwload the JSON files that could be used as examples.

$ mkdir -p data_dicts && cd data_dicts

$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalRegExPatterns.json

$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalStopwords.json

$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/TesauroRevisado.json

Load helper class and laod dictionaries.

>>> from  legal_pre_processing.utils import LoadDicts
>>>
>>> dicts = LoadDicts('legal_dicts/')
>>> dicts.List
['LegalRegExPatterns', 'TesauroRevisado', 'LegalStopwords']

Load the class LegalPreprocess and and instantiate it.

>>> from legal_pre_processing.legal_pre_processing import LegalPreprocess
>>>
>>> model = LegalPreprocess(domain_stopwords=dicts.LegalStopwords, tesauro=dicts.TesauroRevisado, regex_pattern=dicts.LegalRegExPatterns)

Load a PDF file with PyMuPDF (or other extractor) and do some tests:

>>> import fitz
>>>
>>> doc = fitz.open('some_pdf_file_with_legal_content.pdf')
>>> page = doc[page_number-1].get_text()
>>> print(page)
"...Com a concordância das partes foi utilizada prova emprestada em relação aos depoimentos de algumas testemunhas de defesa (decisões de 28/10/2016,  07/11/2016, de 10/11/2016 e de 09/02/2017, nos eventos 114, 175 e 199, e depoimentos nos eventos 187, 200, 287 e 513)...."
>>> page_preprocess = model.ProcessText(page)
>>> print(page_preprocess)
"...concordancia utilizada PROVA_EMPRESTADA relacao depoimentos algumas testemunhas defesa decisoes eventos depoimentos eventos..."

Use heuristics available:

>>> from heuristics import Heuristics
>>> path_pdf = 'example-of-rotated-text-in-latex.pdf'
>>> h = Heuristics(path_pdf)
>>> h.set_all_heuristics()
>>> txt = h.Extract()

Class Heuristics, input parameters:

pdf_path : str
    Path to PDF file.
th_font : float, optional
    Threshold (between 0 and 1) for filter outliers of font types.
    (default is 0.9)
th_size : float, optional
    Threshold (between 0 and 1) for filter outliers of font sizes.
    (default is 0.9)
filter_font_by_cum : bool, optional
    Filters outliers by the accumulated sum, for font types.
    If False, filter by indivual counting. (default is True)
filter_size_by_cum : bool, optional
    Filters outliers by the accumulated sum, for font sizes.
    If False, filter by indivual counting. (default is True)
  • Remove duplicated phrases:
>>> h.set_filter_duplicated_phrases()
  • Let only horizontal text:
>>> h.set_let_horinzontal_text()
  • Remove text with more rare used font types:
>>> h.set_filter_outlier_font_types()
  • Remove text with more rare used font sizes:
>>> h.set_filter_outlier_font_sizes()

TODO:

  • Update README with a project's image and shields (see random-forest-mc).
  • Active LGTM (see random-forest-mc).

Refences:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

legal-pre-processing-0.3.2.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

legal_pre_processing-0.3.2-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page