Pre processing tools for documents with legal content.
Project description
Legal Pre-processing
Pre processing tools for documents with legal content. Authors: Daniel Henrique Arruda Boeing and Israel Oliveira.
Usage:
Donwload the JSON files that could be used as examples.
$ mkdir -p data_dicts && cd data_dicts
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalRegExPatterns.json
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/LegalStopwords.json
$ wget https://gitlab.com/israel.oliveira.softplan/legal-pre-processing/-/raw/master/data/TesauroRevisado.json
Load helper class and laod dictionaries.
>>> from legal_pre_processing.utils import LoadDicts
>>>
>>> dicts = LoadDicts('legal_dicts/')
>>> dicts.List
['LegalRegExPatterns', 'TesauroRevisado', 'LegalStopwords']
Load the class LegalPreprocess and and instantiate it.
>>> from legal_pre_processing.legal_pre_processing import LegalPreprocess
>>>
>>> model = LegalPreprocess(domain_stopwords=dicts.LegalStopwords, tesauro=dicts.TesauroRevisado, regex_pattern=dicts.LegalRegExPatterns)
Load a PDF file with PyMuPDF (or other extractor) and do some tests:
>>> import fitz
>>>
>>> doc = fitz.open('some_pdf_file_with_legal_content.pdf')
>>> page = doc[page_number-1].get_text()
>>> print(page)
"...Com a concordância das partes foi utilizada prova emprestada em relação aos depoimentos de algumas testemunhas de defesa (decisões de 28/10/2016, 07/11/2016, de 10/11/2016 e de 09/02/2017, nos eventos 114, 175 e 199, e depoimentos nos eventos 187, 200, 287 e 513)...."
>>> page_preprocess = model.ProcessText(page)
>>> print(page_preprocess)
"concordancia utilizada PROVA_EMPRESTADA relacao depoimentos algumas testemunhas defesa decisoes eventos depoimentos eventos"
Use heuristics available:
>>> from heuristics import Heuristics
>>> path_pdf = 'example-of-rotated-text-in-latex.pdf'
>>> h = Heuristics(path_pdf)
>>> h.set_all_heuristics()
>>> txt = h.Extract()
- Remove duplicated phrases:
>>> h.set_filter_duplicated_phrases()
- Let only horizontal text:
>>> h.set_let_horinzontal_text()
- Remove text with more rare used font types:
>>> h.set_filter_outlier_font_types()
- Remove text with more rare used font sizes:
>>> h.set_filter_outlier_font_sizes()
Refences:
- PyMuPDF documentation (based on version
1.18.15
). - Legal Thesaurus (*Vocabulário Jurídico *).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for legal-pre-processing-0.2.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e341e815bd9332a3b6befb38c79b03e6d0e158c8d77e837d5dd984ace3b09f3 |
|
MD5 | 5bbd299ad073f4f3088a0f96ef3ac5fd |
|
BLAKE2b-256 | e01d3710e01f211edd32c7c7cc8b406354eec053c06571f4a6f756944a0670db |
Close
Hashes for legal_pre_processing-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f55a48562a88733fd83cbe080daaef7d3a083c97f644a5d984b165102bab638 |
|
MD5 | 7316a16c74187ce1017c56e285fb6978 |
|
BLAKE2b-256 | 1828ca343d09b14e91f5c31218f5a567186af8e8d671e89480eceb729918a58b |