Pre processing tools for documents with legal content.
Project description
Legal Pre-processing
Pre processing tools for documents with legal content. Authors: Daniel Henrique Arruda Boeing and Israel Oliveira.
Usage:
Donwload the JSON files that could be used as examples.
$ mkdir -p data_dicts && cd data_dicts
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalRegExPatterns.json
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalStopwords.json
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/TesauroRevisado.json
Load helper class and laod dictionaries.
>>> from legal_pre_processing.utils import LoadDicts
>>>
>>> dicts = LoadDicts('legal_dicts/')
>>> dicts.List
['LegalRegExPatterns', 'TesauroRevisado', 'LegalStopwords']
Load the class LegalPreprocess and and instantiate it.
>>> from legal_pre_processing.legal_pre_processing import LegalPreprocess
>>>
>>> model = LegalPreprocess(domain_stopwords=dicts.LegalStopwords, tesauro=dicts.TesauroRevisado, regex_pattern=dicts.LegalRegExPatterns)
Load a PDF file with PyMuPDF (or other extractor) and do some tests:
>>> import fitz
>>>
>>> doc = fitz.open('some_pdf_file_with_legal_content.pdf')
>>> page = doc[page_number-1].get_text()
>>> print(page)
"...Com a concordância das partes foi utilizada prova emprestada em relação aos depoimentos de algumas testemunhas de defesa (decisões de 28/10/2016, 07/11/2016, de 10/11/2016 e de 09/02/2017, nos eventos 114, 175 e 199, e depoimentos nos eventos 187, 200, 287 e 513)...."
>>> page_preprocess = model.ProcessText(page)
>>> print(page_preprocess)
"...concordancia utilizada PROVA_EMPRESTADA relacao depoimentos algumas testemunhas defesa decisoes eventos depoimentos eventos..."
Use heuristics available:
>>> from heuristics import Heuristics
>>> path_pdf = 'example-of-rotated-text-in-latex.pdf'
>>> h = Heuristics(path_pdf)
>>> h.set_all_heuristics()
>>> txt = h.Extract()
Class Heuristics, input parameters:
pdf_path : str
Path to PDF file.
th_font : float, optional
Threshold (between 0 and 1) for filter outliers of font types.
(default is 0.9)
th_size : float, optional
Threshold (between 0 and 1) for filter outliers of font sizes.
(default is 0.9)
filter_font_by_cum : bool, optional
Filters outliers by the accumulated sum, for font types.
If False, filter by indivual counting. (default is True)
filter_size_by_cum : bool, optional
Filters outliers by the accumulated sum, for font sizes.
If False, filter by indivual counting. (default is True)
- Remove duplicated phrases:
>>> h.set_filter_duplicated_phrases()
- Let only horizontal text:
>>> h.set_let_horinzontal_text()
- Remove text with more rare used font types:
>>> h.set_filter_outlier_font_types()
- Remove text with more rare used font sizes:
>>> h.set_filter_outlier_font_sizes()
TODO:
- Update README with a project's image and shields (see
random-forest-mc
). - Active LGTM (see
random-forest-mc
).
Refences:
- PyMuPDF documentation (based on version
1.18.15
). - Legal Thesaurus (*Vocabulário Jurídico *).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for legal-pre-processing-0.3.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee90f60ff9d5163165fa43ac0408e2cab75cce9d422536085a77b7ff39ac898d |
|
MD5 | 9b570d5efa2915e854c66c54cddf4df9 |
|
BLAKE2b-256 | 38e1593d0b8cefca16ec2376f67b7a6fb1adef7eb36107393e806fc18e9abbd3 |
Close
Hashes for legal_pre_processing-0.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70ea9ff235db854d91d67a97e0b2092e5bf69b5e134d82e450c45d903006de64 |
|
MD5 | 935c751a6f76631507f7c1c25bc70843 |
|
BLAKE2b-256 | b19c369f6d5e40af2d5002162903db14e4cecd3023242a10a543a094e18de8f8 |