Save large amounts of data to small xlsx files

These details have not been verified by PyPI

Project links

Project description

Yet another Python XLSX writer

... yes, this is reinventing the wheel again and again, but ...

So I decided to write my own xlsx export for two reasons:

First and foremost, the two existing engines I use (openpyxl, xlsxwriter) available in pandas do not store large files efficiently. The problem is when I must load large number of records, sometimes beyond Excel limit (2^20 = 1048576) and then send it over email or Teams (this is quite often the easies way to share data...). The files get way too big.

Secondly, I just want to understand xlsx internals and use the simples possible code to handle files. As a side effect, it is simpler and faster than using some other libraries.

As a simple benchmark consider a sample file of 700+k records and 18 columns. Standard pandas creates files of about 40MB. The simple_xls_writer's file is as small as 8MB which makes it more email friendly.

(Of course when saving modified file it gets much bigger but that's not the point).

Usage

The project consists of submodules:

writer
postgresql_handler
oracle_handler

writer

This generic module exposes function(s) to write raw data (array of arrays) into Excel file.

This should be clear when reading the helper function:

def write_dummy(base_path: str, target_name: str) -> None:
    data = [["A", "B", "C"], ["TEST", 1.23, "2024-10-01 12:34:56"], ["TEST", 200, "2024-10-01 12:34:56"]]
    write_raw_data(base_path, target_name, data)

Note that the only supported data types are: str, int and float, which relates to the way data is saved in xlsx file.

So you may have to prepare the input array yourself or use other submodules (see below).

There's a helper function write_dummy that saves predefined tiny file under given name.

postgresql_handler

If you use PostgreSQL database you can use helper method that reads query result into required structure.

First of all you may wish to verify connection. I prefer to do it this way:

print("db time: "+postgresql_handler.get_sysdate(username,password,host,port,dbname).strftime("%Y-%m-%d %H:%M:%S"))

To save query results run:

postgresql_handler.write_query(query, base_path, "all_tables_pg", username, password, host, port, dbname)

oracle_handler

If you use Oracle database the usage is very similar to PostgreSQL, except for naming and parameter list.

You may verify connection using :

print("db time: "+oracle_handler.get_sysdate(username,password,dsn).strftime("%Y-%m-%d %H:%M:%S"))

To save query results run:

oracle_handler.write_query(query,base_path, "all_tables",username,password,dh_url)

Example

See: main_pg.py

...

username = input("username: ")
password = getpass.getpass()
host = input("host: ")
port = int(input("port: "))
dbname = input("database name: ")

# verify connection
print("db time: "+postgresql_handler.get_sysdate(username,password,host,port,dbname).strftime("%Y-%m-%d %H:%M:%S"))

# fetch all tables' metadata
query = "select * from information_schema.tables"
base_path = os.path.dirname(__file__)
postgresql_handler.write_query(query, base_path, "all_tables_pg", username, password, host, port, dbname)

...

See: main_ora.py

...    

username = input("username: ")
password = getpass.getpass()
dsn = input("DSN: ")

# verify connection
print("db time: "+oracle_handler.get_sysdate(username,password,dsn).strftime("%Y-%m-%d %H:%M:%S"))

# fetch all tables' metadata
query = "select * from all_tables"
base_path = os.path.dirname(__file__)
oracle_handler.write_query(query,base_path, "all_tables",username,password,dsn)

...

CSV conversion

From time to time I come across large CSV files. When you want to load it to Excel using Power Query it generally works fine, but what if you have to load it over the slow network. The solution is to load all files to relatively small xlsx file.

Example

...

custom_params = {
    "debug_info_every_rows": 100000,
    "row_limit": 1000000,
    "row_limit_exceed_strategy": "sheets" # truncate / files / sheets
}
base_path = os.path.dirname(__file__)
writer.convert_csv("c:\\some_path\\A_VERY_LARGE_CSV.csv", base_path,
                   "converted_from_csv.xlsx", debug=True, custom_params=custom_params)

...

The debug option is particularly useful when you want to ensure that data load progresses.

Exceeding Excel row limit

If your input data exceeds Excel row limit, or you just want to divide it into smaller chunks, you may use two strategies (see: configuration):

use files - save each chunk into a separate xlsx file
use sheets - save each chunk into a separate worksheet within a single xlsx file

Configuration

You may customize operations using custom_params parameter. The available options are:

Parameter	Default value	Description
sheet_name	data	sheet name or sheet name prefix for multiple sheets (numbered from 1)
python_date_format	%Y-%m-%d	date format when converting data loaded from database
python_datetime_format	%Y-%m-%d %H:%M:%S	as above but for datatime format
python_datetime_remove_zeros	True	should empty time be removed (to save space and improve readability)
python_datetime_remove_zeros_pattern	" 00:00:00"	the pattern of empty time to be removed
headers	True	does input data contain header row
row_limit	1048576-1 (2^20-1)	row limit (unchecked!)
row_limit_exceed_strategy	truncate	what to do when dealing with data exceeding row limit
		- truncate - truncate data beyond row limit
		- files - generate multiple files (numbered from 1)
		- sheets - generate multiple sheets (numbered from 1)
debug_info_every_rows	10000	print debug info every X rows, applies only when debug=True
csv_delimiter	, (comma)	CSV file delimiter
csv_quote	" (double quote)	CSV quote character
csv_encoding	utf-8	CSV file encoding
file_encoding	utf-8	target file encoding, may be useful when using special charasters
oracle_client_mode	thin	Oracle client mode, thick may be required in some setups

You provide custom options as Python dict:

custom_params = {
    "sheet_name": "events_",
    "debug_info_every_rows": 100000,
    "row_limit": 1000000,
    "row_limit_exceed_strategy": "sheets"
}

Installation

Install package using pip:

pip install simple-xlsx-writer

If you wish to use Oracle and/or PostgrweSQL connectivity, add option(s):

pip install simple-xlsx-writer[postgresql]
pip install simple-xlsx-writer[oracle]

These add required dependencies: psycopg and oracledb respectively. You can always install them yourself.

To verify installation run:

import os
from simple_xlsx_writer import writer

base_path = os.path.dirname(__file__) # or provide explicit path in interactive mode
writer.write_dummy(base_path, "dummy01")

You should find dummy01.xlsx file in a given containing:

A	B	C
TEST	1,23	2024-10-01 12:34:56
TEST	200	2024-10-01 12:34:56

Memory considerations

Keep in mind that the module is performance and size optimized, NOT memory optimized. The complete dataset (either from database of CSV file) is loaded to memory in order to calculate shared strings.

For extremely large files (as for Excel terms, not extremely in general) the memory requirements are substantial.

One real-life example is conversion of 1GB CSV file (2.5M rows and 60 columns with lots of text repetitions, Windows 11).

Loading the file to memory takes ~6GB, slicing it to separate sheets bumps memory for a brief moment up to even 10GB. The resulting single xlsx file is about 60MB (6% of original). The process takes about 8 minutes (i5 13gen).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2

Mar 3, 2025

0.3.1

Jan 24, 2025

0.3.0

Jan 24, 2025

0.2.1

Jan 20, 2025

0.1.0

Jan 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_xlsx_writer-0.3.2.tar.gz (12.6 kB view details)

Uploaded Mar 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

simple_xlsx_writer-0.3.2-py3-none-any.whl (13.3 kB view details)

Uploaded Mar 3, 2025 Python 3

File details

Details for the file simple_xlsx_writer-0.3.2.tar.gz.

File metadata

Download URL: simple_xlsx_writer-0.3.2.tar.gz
Upload date: Mar 3, 2025
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for simple_xlsx_writer-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`72d7a4050c3e46267015729b4b232bb93e9dcdb8bdef1bc2b5a4cf6edabb0259`
MD5	`bb7cb89057843a91cbcf825a40fb8c3f`
BLAKE2b-256	`d78914bced31b6d1410d2e9f72ef018d4f9858ee26226d8a21f61fea47e98f7e`

See more details on using hashes here.

File details

Details for the file simple_xlsx_writer-0.3.2-py3-none-any.whl.

File metadata

Download URL: simple_xlsx_writer-0.3.2-py3-none-any.whl
Upload date: Mar 3, 2025
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.9

File hashes

Hashes for simple_xlsx_writer-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c4337082c84e903454f26c2b032a356e0299e9a88f50ae3c113a761c676ff9ea`
MD5	`4faf2325f4b46d86f22fb1d6086ca1e5`
BLAKE2b-256	`54e84971a60feb5068bfe8474a192a2e39f9dc80b712b6667a4441dee9318d11`

See more details on using hashes here.

simple-xlsx-writer 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Yet another Python XLSX writer

Usage

writer

postgresql_handler

oracle_handler

Example

CSV conversion

Example

Exceeding Excel row limit

Configuration

Installation

Memory considerations

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes