Skip to main content

Facade to collect rows one-by-one into a Polars DataFrame (in the least-bad way)

Project description

polars-row-collector

PyPI Python License CI Docs Built for Polars

Facade to collect rows one-by-one into a Polars DataFrame (in the least-bad way)

Getting Started Example

Add the library to your dependencies: uv add polars_row_collector

import polars as pl
from polars_row_collector import PolarsRowCollector

collector = PolarsRowCollector(
    # Note: Schema is optional, but recommended.
    schema={"col1": pl.Int64, "col2": pl.Float64}
)

for item in items:
    row = {
        "col1": item.value1,
        "col2": item.value2,
    }
    collector.add_row(row)

df = collector.to_df()

You can think of collector as filling the same niche as the following alternatives: * list_of_dfs: list[pl.DataFrame] * list_of_dicts: list[dict[str, Any]], then pl.from_dicts(list_of_dicts)

Features

  • Highly performant and memory-optimized.
    • 93% lower memory usage compares to a list-of-dicts approach.
  • Optionally supply a schema for the incoming rows.
  • Thread-safe (when GIL is enabled - default in Python <= 3.15).
  • Configuration arguments for safety vs. performance tradeoffs:
    • Behaviour if there are missing columns: Enforce all columns present or allow missing columns.
    • Behaviour if there are extra columns: Drop silently or raise.
    • Maintain insertion order.

Example Applications

  • Gathering data in a web scraping/parsing tool.
  • Gathering/batching incoming log messages or event logs before writing in bulk to some destination.
  • Gathering data in a markup/document parsing pipeline (e.g., XML with lots of conditionals).

Benchmarks

  • Benchmark: Collecting 50M rows. Each row has 3 columns.
    • Average Speed: 0.42µs/row for both (consistent).
      • Conclusion: No additional elapsed runtime overhead.
    • Peak memory usage: 93% decrease compared to a naive implementation.
      • Baseline (list-of-dicts): 26,011.93 MiB
      • PolarsRowCollector: 1,860.16 MiB

Baseline (list-of-dicts)

> COLLECT_MODE=dicts uv run perf_scripts/perf_test_script.py

Collected DataFrame. Current RSS: 26,011.93 MiB | Peak RSS: 26,011.93 MiB
Final overall time per row: 0.42µs/row

PolarsRowCollector

> COLLECT_MODE=prc uv run perf_scripts/perf_test_script.py

Collected DataFrame. Current RSS: 1,860.16 MiB | Peak RSS: 1,860.16 MiB
Final overall time per row: 0.42µs/row

Future Features

  • Intermediate to-disk storage to temporary parquet files to larger-than-memory collections.
  • Further optimize appending many rows at once.
  • Read the dataframe so-far, in the middle of gathering rows.
  • Documentation.

Disclaimer

As the project's description says, this is the "least-bad way" to accomplish this pattern.

If you can implement your code in such a way that you're not collecting individual rows of a dataframe, you are likely better-off doing it that way (e.g., collecting a list[pl.DataFrame]).

However, there are always exceptions to the best practices. In those cases, this library is an ideal choice, and is significantly more memory-efficient than collecting into a list[dict[str, Any]] then converting to a DataFrame later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_row_collector-0.3.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polars_row_collector-0.3.0-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file polars_row_collector-0.3.0.tar.gz.

File metadata

  • Download URL: polars_row_collector-0.3.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for polars_row_collector-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c74276b63a101a9a8231b52064a004f8b0a86c2618c3df9d5ab24879f0382e99
MD5 f70a7b79f376d62e6609c502d8955d78
BLAKE2b-256 1502204ace8f0261144d3867585409650014cd9d5a37717b72b65d1e68729917

See more details on using hashes here.

File details

Details for the file polars_row_collector-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: polars_row_collector-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for polars_row_collector-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 152049640703d81194f4291d0124c5b5034f9ef99f2bfc02a549158bba67f83e
MD5 686847cdfda3ba00bf8822581b4cc867
BLAKE2b-256 0985f115d2699d0ebbb3965ed9fc646d841b6200cf1c1f6d0088c879ce5663be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page