Excel spreadsheet crawler and table parser for data discovery, extraction, and querying
Project description
eparse
Description
Excel spreadsheet crawler and table parser for data extraction and querying.
Features
Command-line interface
Recursive Excel file discovery
Sub-tabular data extraction (logical tables)
SQLite and PostgreSQL database interfaces
CLI query tool
Summary data metrics
Installation
To install eparse, you can use pip and the latest version on PyPI:
$ pip install eparse
Or you can clone this repo and install from source:
$ git clone https://github.com/ChrisPappalardo/eparse.git
$ cd eparse
$ pip install .
If you plan to use the postgres interface, you also need to install the postgres package psycopg2. Instructions can be found here. This package is optional, and you can use the other interfaces such as the SQLite3 interface without having to install psycopg2.
The easiest way to install the psycopg2 package for your particular environment may be to install the pre-compiled binary driver as follows:
$ pip install psycopg2-binary
If you see an error while trying to use a postgres endpoint such as postgres://user:pass@host:port/my_db that mentions the postgres driver is missing, then you know you haven’t properly installed (and compiled) psycopg2.
Usage
eparse can be used as either a python library or from the command-line. You can view supported CLI commands and usage with --help as follows:
$ eparse --help
Usage: eparse [OPTIONS] COMMAND [ARGS]...
excel parser
Options:
-i, --input TEXT input source
-o, --output TEXT output destination
-f, --file TEXT file(s) or dir(s) to target
-d, --debug use debug mode
-l, --loose find tables loosely
-r, --recursive find files recursively
-t, --truncate truncate dataframe output
-v, --verbose increase output verbosity
--help Show this message and exit.
Commands:
migrate migrate eparse table
parse parse table(s) found in sheet for target(s)
query query eparse output
scan scan for excel files in target
You can also use eparse from python like so:
from eparse.core import get_df_from_file
print([table for table in get_df_from_file('myfile.xlsx')])
102 Date Principal Repayment Date Principal Repayment
103 44834 700757.679004 44926 430013.148303
104 44926 71957.776108 45016 100576.127808
105 45016 147578.19262 45107 898008.340095
106 45107 32801.363072 45199 841656.13896
...
Scan
To scan one or more directories for Excel files with descriptive information, you can use the scan command like so:
$ eparse -v -f <path_to_files> scan
Increase the verbosity with additional flags, such as -vvv, for more descriptive information about the file(s), including sheet names.
Parse
Excel files can be parsed as follows:
$ eparse -v -f <path_to_files> parse
This mode will list each table found in each Excel file to the command-line. This mode is useful for initial discovery for parseable data.
eparse uses a simple algorithm for identifying tables. Table “corners” are identified as cells that contain empty cells above and to the right (or sheet boundaries). A densely or sparsely populated 2x2 table must follow in order for data to be extracted in relation to that cell. eparse will automatically adjust for rowspan labels and empty table corners and the dense vs. sparse criterion can be controlled with the --loose flag.
eparse was written to accomodate various types of output formats and endpoints, including null:///, stdout:///, sqlite3:///db_name, and postgres://user:password@host:port/db_name.
null
This mode is useful for validating files and generating descriptive info, and is the default. The command above with -v is an example of this mode, which lists out the tables found.
stdout
This mode is good for viewing data extracted from Excel files in the console. For example, you could view all tables found in Sheet1 with the following command:
$ eparse -f <path_to_files> -o stdout:/// parse -s "Sheet1"
eparse uses pandas to handle table data. You can view larger tables without truncation using the -t flag as follows:
$ eparse -t -f <path_to_files> -o stdout:/// parse -s "Sheet1"
Data in table format is useful for human viewing, but a serialized form is better for data interfacing. Serialize your output with the -z flag as follows:
$ eparse -t -f <path_to_files> -o stdout:/// parse -z
Each cell of extracted table data is serialized as follows:
row - 0-indexed table row number
column - 0-indexed table column number
value - the value of the cell as a str
type - the implied python type of the data found
c_header - the column header
r_header - the row header
excel_RC - the RC reference from the spreadsheet (e.g. B10)
sheet - the name of the sheet
f_name - the name of the file
sqlite3
eparse uses the peewee package for ORM and database integration. The interfaces module contains an ExcelParse model that provides data persistence and a common interface.
To create a SQLite3 database with your parsed Excel data, use the following command:
$ mkdir .files
$ eparse -f <path_to_files> -o sqlite3:/// parse -z
This command will automatically generate a unique database filename using the uuid python package in the .files/ sub-directory of the working directory. You may need to create this directory before running this command, as shown.
You can also specify a path and filename of your choosing, like so:
$ mkdir .files
$ eparse -f <path_to_files> -o sqlite3:///path/filename.db parse -z
postgres
eparse also supports postgresql integrations. As mentioned above, you will need psycopg2 installed for postgresql integrations to work. The eparse BaseDatabaseInterface abstracts the implementation details, so you would use this interface the same way you use the others, with the exception of the endpoint.
To use a postgresql database as the source and/or destination of your data, you would supply an --input and/or --output endpoint to the tool as follows:
$ eparse -o postgres://user:password@host:port/db_name ...
Where details like user, host, port are provided to you by your db administrator. eparse will create the necessary table(s) and indexes for you when inserting data into the database.
Query
Once you have stored parsed data, you can begin to query it using the peewee ORM. This can be done with the tool or directly with the database.
For example, query distinct column header names from a generated SQLite3 database as follows:
$ eparse -i sqlite3:///.files/<db_file> -o stdout:/// query -m get_c_header
c_header Total Rows Data Types Distinct Values
0 ABC-col 150 2 76
1 DEF-col 3981 3 15
2 GHI-col 20 1 2
.. ... ... ... ...
This command will give descriptive information of each distinct c_header found, including total rows, unique data types, and distinct values.
You can also get raw un-truncated data as follows:
$ eparse -t -i sqlite3:///.files/<db_file> -o stdout:/// query
Filtering data on content is easy. Use the --filter option as follows:
$ eparse -i sqlite3:///.files/<db_file> -o stdout:/// query --filter f_name "somefile.xlsx"
The above command will filter all rows from an Excel file named somefile.xlsx. You can use any of the following django-style filters:
__eq equals X
__lt less than X
__lte less than or equal to X
__gt greater than X
__gte greater than or equal to X
__ne not equal to X
__in X is in
__is is X
__like like expression, such as %somestr%, case sensitive
__ilike like expression, such as %somestr%, case insensitive
__regexp regular expression matching such as ^.*?foo.*?$
Filters are applied to the ORM fields like so:
--filter row__gte 4 all extracted table rows >= 5
--filter f_name__ilike "%foo%" all data from filenames with foo
--filter value__ne 100 all data with values other than 100
Queried data can even be stored into a new database for creating curated data subsets, as follows:
$ eparse -i sqlite3:///.files/<db_file> \
-o sqlite3:///.files/<subq_db_file> \
query --filter f_name "somefile.xlsx"
Since database files the tool generates when using sqlite3:/// are SQLite native, you can also use SQLite database client tools and execute raw SQL like so:
$ sudo apt-get install -y sqlite3-tools
$ sqlite3 .files/<db_file>
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE IF NOT EXISTS "excelparse" ("id" INTEGER NOT NULL PRIMARY KEY, "row" INTEGER NOT NULL, "column" INTEGER NOT NULL, "value" VARCHAR(255) NOT NULL, "type" VARCHAR(255) NOT NULL, "c_header" VARCHAR(255) NOT NULL, "r_header" VARCHAR(255) NOT NULL, "excel_RC" VARCHAR(255) NOT NULL, "name" VARCHAR(255) NOT NULL, "sheet" VARCHAR(255) NOT NULL, "f_name" VARCHAR(255) NOT NULL);
sqlite> .header on
sqlite> SELECT * FROM excelparse LIMIT 1;
id|row|column|value|type|c_header|r_header|excel_RC|name|sheet|f_name
1|0|0|ABC|<class 'str'>|SomeCol|SomeRow|B2|MyTable|Sheet1|myfile.xlsm
Migrate
eparse wouldn’t be a solid tool without the ability to migrate your eparse databases for future code changes. You can apply migrations that ship with future versions of eparse as follows:
$ eparse -i sqlite3:///.files/<db_file> migrate -m <migration>
applied <migration>
It is up to you to determine the migrations you need based on the eparse version you are upgrading from and to. Migrations can be found in eparse/migrations.py
Unstructured
If you would like to use eparse to partition xls[x] files alongside unstructured, you can do so with our contributed partition and partition_xlsx modules. Simply import the partition function from eparse.contrib.unstructured.partition and use it instead of partition from unstructured.partition.auto like so:
from eparse.contrib.unstructured.partition import partition
elements = partition(filename='some_file.xlsx', eparse_mode='...')
Valid eparse_mode settings are available in eparse.contrib.unstructured.xlsx._eparse_modes.
Contributing
As an open-source project, contributions are always welcome. Please see Contributing for more information.
License
eparse is licensed under the MIT License. See the LICENSE file for more details.
Contact
Thanks for your support of eparse. Feel free to contact me at cpappala@gmail.com or connect with me on Github.
History
0.7.3 (2023-08-23)
Updated partition and partition_xlsx for latest unstructured
0.7.2 (2023-08-22)
Relaxed unstructured dependency requirement for compatibility
0.7.1 (2023-08-20)
Added unstructured contrib to core package
Uopdaed README with instructions
0.7.0 (2023-08-18)
Added html conversion functions and an html data interface
Added get_table_digest to core
0.6.3 (2023-06-30)
Updated get_df_from_file to accept filename or io object
0.6.1 (2023-06-22)
Added get_df_from_file helper function
Added contrib with unstructured partition handler
0.5.0 (2023-06-14)
Optimized database interfaces
0.4.0 (2023-06-14)
Added postgresql interface
0.3.0 (2023-06-14)
Added inteface classes
Added migrations
0.2.0 (2023-06-12)
Added migrate command
Added 0.1.2 to 0.2.0 migration
0.1.2 (2023-06-07)
Updated README
0.1.1 (2023-06-06)
Updated requirements
Updated README
0.1.0 (2023-06-06)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file eparse-0.7.3.tar.gz
.
File metadata
- Download URL: eparse-0.7.3.tar.gz
- Upload date:
- Size: 82.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b0dab30ff0683939545a9926656f5a245ceade2988b011818a055d8ee276cb1 |
|
MD5 | dd520d71e242a256243cbb34908ad56c |
|
BLAKE2b-256 | c3a4c3a7cf1f6cfe980ee0c957a629c5e07705d6a94ce06c81b361d4faaa59d8 |
File details
Details for the file eparse-0.7.3-py2.py3-none-any.whl
.
File metadata
- Download URL: eparse-0.7.3-py2.py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4b104174f34c7b983d35d2ce161219428348239c83d87e2c8bffd79b68f172b |
|
MD5 | 4af60541a6f26804a87c3abd2cefb122 |
|
BLAKE2b-256 | 26e8acf68d42e11c192225db3c32be6b6690841d26caff9ee3482b6ac0cd37b4 |