Pre processing tools for documents with legal content.
Project description
Legal Pre-processing
Pre processing tools for documents with legal content. Authors: Daniel Henrique Arruda Boeing and Israel Oliveira.
Usage:
Donwload the JSON files that could be used as examples.
$ mkdir -p data_dicts && cd data_dicts
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalRegExPatterns.json
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalStopwords.json
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/TesauroRevisado.json
Load helper class and laod dictionaries.
>>> from legal_pre_processing.utils import LoadDicts
>>>
>>> dicts = LoadDicts('legal_dicts/')
>>> dicts.List
['LegalRegExPatterns', 'TesauroRevisado', 'LegalStopwords']
Load the class LegalPreprocess and and instantiate it.
>>> from legal_pre_processing.legal_pre_processing import LegalPreprocess
>>>
>>> model = LegalPreprocess(domain_stopwords=dicts.LegalStopwords, tesauro=dicts.TesauroRevisado, regex_pattern=dicts.LegalRegExPatterns)
Load a PDF file with PyMuPDF (or other extractor) and do some tests:
>>> import fitz
>>>
>>> doc = fitz.open('some_pdf_file_with_legal_content.pdf')
>>> page = doc[page_number-1].get_text()
>>> print(page)
"...Com a concordância das partes foi utilizada prova emprestada em relação aos depoimentos de algumas testemunhas de defesa (decisões de 28/10/2016, 07/11/2016, de 10/11/2016 e de 09/02/2017, nos eventos 114, 175 e 199, e depoimentos nos eventos 187, 200, 287 e 513)...."
>>> page_preprocess = model.ProcessText(page)
>>> print(page_preprocess)
"...concordancia utilizada PROVA_EMPRESTADA relacao depoimentos algumas testemunhas defesa decisoes eventos depoimentos eventos..."
Use heuristics available:
>>> from heuristics import Heuristics
>>> path_pdf = 'example-of-rotated-text-in-latex.pdf'
>>> h = Heuristics(path_pdf)
>>> h.set_all_heuristics()
>>> txt = h.Extract()
Class Heuristics, input parameters:
pdf_path : str
Path to PDF file.
th_font : float, optional
Threshold (between 0 and 1) for filter outliers of font types.
(default is 0.9)
th_size : float, optional
Threshold (between 0 and 1) for filter outliers of font sizes.
(default is 0.9)
filter_font_by_cum : bool, optional
Filters outliers by the accumulated sum, for font types.
If False, filter by indivual counting. (default is True)
filter_size_by_cum : bool, optional
Filters outliers by the accumulated sum, for font sizes.
If False, filter by indivual counting. (default is True)
- Remove duplicated phrases:
>>> h.set_filter_duplicated_phrases()
- Let only horizontal text:
>>> h.set_let_horinzontal_text()
- Remove text with more rare used font types:
>>> h.set_filter_outlier_font_types()
- Remove text with more rare used font sizes:
>>> h.set_filter_outlier_font_sizes()
Refences:
- PyMuPDF documentation (based on version
1.18.15
). - Legal Thesaurus (*Vocabulário Jurídico *).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for legal-pre-processing-0.3.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32e6a7220b57a85740a4e3809673a64d5446827f6fe6b745e0eb9162518d7310 |
|
MD5 | d5982d67f7528ee8dadbc2ebea971310 |
|
BLAKE2b-256 | bdcfe7b922a25cba02c28f1e2d2e4eda9d8f69f2806b0247d62edb2127b8a639 |
Close
Hashes for legal_pre_processing-0.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8f33b17373f31c02edbbc1dc885d3055077f8735693ccea0c72b8fbe86755a7 |
|
MD5 | e6de4e6eba6ce48934b1176b2cab63d3 |
|
BLAKE2b-256 | 86d6fe4e62d9fc5f7a7012e89906d782323a266ca4315ce75ffd1c810c5bad2e |