beta
Project description
___ __ __
/ | / /___ _________ _________/ /__ _____
/ /| | / / __ `/ ___/ __ \/ ___/ __ / _ \/ ___/
/ ___ |/ / /_/ / /__/ /_/ / / / /_/ / __/ /
/_/ |_/_/\__,_/\___/\____/_/ \__,_/\___/_/
ALACORDER beta 72
Getting Started with Alacorder
GitHub | PyPI | Report an issue
Alacorder processes case detail PDFs into data tables suitable for research purposes. Alacorder also generates compressed text archives from the source PDFs to speed future data collection from the same set of cases.
Installation
Alacorder can run on most devices. If your device can run Python 3.11 or later, it can run Alacorder.
- To install on Windows and Mac, open Command Prompt (Terminal) and enter
pip install alacorder
orpip3 install alacorder
.- To start the interface, enter
python -m alacorder
orpython3 -m alacorder
.
- To start the interface, enter
- On Mac, open the Terminal and enter
pip install alacorder
orpip3 -m alacorder
.- To start the interface, enter
python -m alacorder
orpython3 -m alacorder
.
- To start the interface, enter
- Install Anaconda Distribution to install Alacorder if the above methods do not work, or if you would like to open an interactive browser notebook equipped with Alacorder on your desktop.
- After installation, create a virtual environment, open a terminal, and then repeat these instructions. If your copy of Alacorder is corrupted, use
pip uninstall alacorder
orpip3 uninstall alacorder
and then reinstall it. There may be a newer version available.
- After installation, create a virtual environment, open a terminal, and then repeat these instructions. If your copy of Alacorder is corrupted, use
Alacorder should automatically download and install dependencies upon setup, but you can also install the full list of dependencies yourself with
pip
:pandas
,numpy
,PyPDF2
,openpyxl
,xlrd
,xlwt
,build
,setuptools
,xarray
,jupyter
,numexpr
,tabulate
,matplotlib
, andbottleneck
.
pip uninstall -y alacorder
pip install alacorder
Using the guided interface
Once you have a Python environment up and running, you can launch the guided interface in two ways:
-
Import the library from your command line: Depending on your Python configuration, enter
python -m alacorder
orpython3 -m alacorder
to launch the command line interface in module mode. -
Import the
alacorder
module in Python: Use the import statementfrom alacorder import __main__
to start the command line interface.
Alacorder can be used without writing any code, and exports to common formats like Excel (.xls
, .xlsx
), Stata (.dta
), CSV (.csv
), and JSON (.json
).
-
Alacorder compresses case text into
pickle
archives (.pkl.xz
) to save storage and processing time. If you need to unpack apickle
archive without importingalac
, use a.xz
compression tool, then read thepickle
into Python with the standard library modulepickle
. -
Once installed, enter
python -m alacorder
orpython3 -m alacorder
to start the interface. If you are usingiPython
, launch theiPython
shell and enterfrom alacorder import __main__
to launch the guided interface.
from alacorder import __main__
Special Queries with alac
For more advanced queries, the alac
module can extract fields and tables from case records with just a few lines of code.
-
Call
alac.config(input_path, table_path = '', archive_path = '')
and assign it to a variable to hold your configuration object. This tells the imported Alacorder methods where and how to input and output. Iftable_path
andarchive_path
are left blank,alac.parse…()
methods will print to console and return the DataFrame object. -
Call
alac.writeArchive(config)
to export a full text archive. It's recommended that you create a full text archive (.pkl.xz
) file before making tables from your data. Full text archives can be scanned faster than PDF directories and require significantly less storage. Full text archives can be imported to Alacorder the same way as PDF directories. -
Call
alac.parseTables(config)
to export detailed case information tables. If export type is.xls
or.xlsx
, thecases
,fees
, andcharges
tables will be exported. -
Call
alac.parseCharges(config)
to exportcharges
table only. -
Call
alac.parseFees(config)
to exportfee
tables only. -
Call
alac.parseCases(config)
to export case information without charge information.
import warnings
warnings.filterwarnings('ignore')
from alacorder import alac
pdf_directory = "/Users/crimson/Desktop/Tutwiler/"
archive = "/Users/crimson/Desktop/Tutwiler.pkl.xz"
tables = "/Users/crimson/Desktop/Tutwiler.xlsx"
# make full text archive from PDF directory
c = alac.config(pdf_directory, archive)
alac.writeArchive(c)
print("Full text archive complete. Now processing case information into tables at " + tables)
# then scan full text archive for spreadsheet
d = alac.config(archive, tables)
alac.parseTables(d)
Custom Parsing with alac.parse()
If you need to conduct a custom search of case records, Alacorder has the tools you need to extract custom fields from case PDFs without any fuss. Try out alac.parse()
to search thousands of cases in seconds.
from alacorder import alac
import re
archive = "/Users/crimson/Desktop/Tutwiler.pkl.xz"
tables = "/Users/crimson/Desktop/Tutwiler.xlsx"
def findName(text):
name = ""
if bool(re.search(r'(?a)(VS\.|V\.{1})(.+)(Case)*', text, re.MULTILINE)) == True:
name = re.search(r'(?a)(VS\.|V\.{1})(.+)(Case)*', text, re.MULTILINE).group(2).replace("Case Number:","").
strip()
else:
if bool(re.search(r'(?:DOB)(.+)(?:Name)', text, re.MULTILINE)) == True:
name = re.search(r'(?:DOB)(.+)(?:Name)', text, re.MULTILINE).group(1).replace(":","").replace("Case Number:","").strip()
return name
c = alac.config(archive, tables)
alac.parse(c, findName)
Method | Description |
---|---|
getPDFText(path) -> text |
Returns full text of case |
getCaseInfo(text) -> [case_number, name, alias, date_of_birth, race, sex, address, phone] |
Returns basic case details |
getFeeSheet(text, cnum = '') -> [total_amtdue, total_balance, total_d999, feecodes_w_bal, all_fee_codes, table_string, feesheet: pd.DataFrame()] |
Returns fee sheet and summary as str and pd.DataFrame |
getCharges(text, cnum = '') -> [convictions_string, disposition_charges, filing_charges, cerv_eligible_convictions, pardon_to_vote_convictions, permanently_disqualifying_convictions, conviction_count, charge_count, cerv_charge_count, pardontovote_charge_count, permanent_dq_charge_count, cerv_convictions_count, pardontovote_convictions_count, charge_codes, conviction_codes, all_charges_string, charges: pd.DataFrame()] |
Returns charges table and summary as str , int , and pd.DataFrame |
getCaseNumber(text) -> case_number |
Returns case number |
getName(text) -> name |
Returns name |
getFeeTotals(text) -> [total_row, tdue, tpaid, tbal, tdue] |
Return totals without parsing fee sheet |
Working with case data in Python
Out of the box, Alacorder exports to .xls
, .xlsx
, .csv
, .json
, and .dta
. But you can use alac
, pandas
, and other python libraries to create your own data collection workflows and design custom exports.
The snippet below prints the fee sheets from a directory of case PDFs as it reads them.
from alacorder import alac
c = alac.config("/Users/crimson/Desktop/Tutwiler/","/Users/crimson/Desktop/Tutwiler.xls")
for path in c['contents']:
text = alac.getPDFText(path)
cnum = alac.getCaseNumber(text)
charges_outputs = alac.getCharges(text, cnum)
if len(charges_outputs[0]) > 1:
print(charges_outputs[0])
Extending Alacorder with pandas
and other tools
Alacorder runs on pandas
, a python library you can use to perform calculations, process text data, and make tables and charts. pandas
can read from and write to all major data storage formats. It can connect to a wide variety of services to provide for easy export. When Alacorder table data is exported to .pkl.xz
, it is stored as a pd.DataFrame
and can be imported into other python modules and scripts with pd.read_pickle()
like below:
import pandas as pd
contents = pd.read_pickle("/path/to/pkl")
If you would like to visualize data without exporting to Excel or another format, create a jupyter notebook
and import a data visualization library like matplotlib
to get started. The resources below can help you get started. jupyter
is a Python kernel you can use to create interactive notebooks for data analysis and other purposes. It can be installed using pip install jupyter
or pip3 install jupyter
and launched using jupyter notebook
. Your device may already be equipped to view .ipynb
notebooks.
Resources
pandas
cheat sheet- regex cheat sheet
- anaconda (tutorials on python data analysis)
- The Python Tutorial
jupyter
introduction
© 2023 Sam Robson
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file alacorder-72.tar.gz
.
File metadata
- Download URL: alacorder-72.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 61a498087ee9fc592499e5b52b7e32362b78d708b99d601653f90f850ee56d25 |
|
MD5 | 67918c9aa1031a1565abb536e6de1516 |
|
BLAKE2b-256 | fa7d98a6aea7886d3209d649ec9ce78893179e879541596b37fcc40ad50092a2 |
File details
Details for the file alacorder-72-py3-none-any.whl
.
File metadata
- Download URL: alacorder-72-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dbd46d440d48de38f310dc6186e2bbc69df0040ead216dd8755a85a3570571d3 |
|
MD5 | a4578ae53e18949d4791eaf781c59532 |
|
BLAKE2b-256 | 95131d272ff1d584da589dc033b3b48247aee264a84a45c98160f93e52205d52 |