Skip to main content

Log preserved history contents of csv or json file in duckdb

Project description

git-logger

PyPI Tests Changelog License

Python package and CLI tool to log the historical versions of a file preserved in git repository to a DuckDB database.

The package is influenced and inspired heavily by Simon Willison's git-history package.

Installation

Install this library using pip:

pip install git-logger

Python API Usage

The GitLogger class provides a set of functions to log the history of a file in a Git repository to a DuckDB database.

Here's an example of how to use the GitLogger class:

from git_logger.git_history import GitLogger

# Initialize the GitLogger
logger = GitLogger(
    db_name='my_database.db',
    table_name='my_table',
    filepath='path/to/your/file.txt',
    repo_path='path/to/your/git/repo',
    data_type='json'  # or 'csv'
)

# Log the git history to the DuckDB database
logger.log_git_history()

The GitLogger class takes the following parameters:

  • db_name: The name of the DuckDB database file.
  • table_name: The name of the table to store the git history.
  • filepath: The path to the file in the Git repository.
  • repo_path: The path to the Git repository (default is the current directory).
  • data_type: The format of the file, either 'json' or 'csv' (default is 'json').

The log_git_history() method retrieves the git history of the specified file, parses the content of the file, and inserts the data into the DuckDB database. The method also creates the database and table if they don't already exist.

The utils.py file provides two helper functions:

  • parse_schema(d: dict): This function takes a dictionary of data and returns a dictionary of the inferred data types for each key.
  • parse_csv(data): This function takes a byte or string representation of CSV data and returns a list of lists.

The get_hash(db_name: str, tbl_name: str) function retrieves a list of unique commit hashes from the specified table in the DuckDB database.

Format data callback

The GitLogger class in git_logger/git_history.py provides a way to add custom callbacks to format the data before it is inserted into the DuckDB database. The callback method in the GitLogger class allows you to register these callbacks.

Here's an example of how you can add a custom callback to format the data:

class MyCallback(Callback):
    order = 0  # The order in which the callback is executed

    def format_data(self, data):
        # Customize the data format here
        data.data = [row for row in data.data if row['some_column'] > 0]

logger = GitLogger(
    db_name='my_database.db',
    table_name='my_table',
    filepath='path/to/your/file.txt',
    repo_path='path/to/your/git/repo',
    data_type='json',
    cbs=[MyCallback()]
)

logger.log_git_history()

In this example, we define a MyCallback class that inherits from the Callback class. The order attribute determines the order in which the callback is executed (lower values are executed first).

The format_data method is the callback that is executed when the callback('format_data') method is called in the GitLogger class. In this example, we filter the data to only include rows where the some_column value is greater than 0.

You can add multiple callbacks by passing a list of callback instances to the cbs parameter of the GitLogger constructor.

CLI Usage

The git-log cli utility allows you to retrieve the git history of a specified file, parse its content, and insert the data into a DuckDB database from the command line.

Usage:

git-log [OPTIONS] FILE_PATH DB_NAME

You can run the git-log utility without installation using uvx tool from uv like so:

uvx --from git-logger git-log path/to/your/file.json file_history.db --table_name my_table

Arguments:

  • FILE_PATH: The path to the file in the Git repository.
  • DB_NAME: The name of the DuckDB database file.

Options:

  • --table_name TEXT: The name of the table to store the data (default is "hist").
  • --repo_path TEXT: The path to the Git repository (default is the current directory).
  • --flexible-schema: Store JSON data as JSON column type instead of individual columns (default is False).
  • --version: Show the version and exit.

Examples:

git-log path/to/your/file.json file_history.db --table_name my_table

This will retrieve the git history of the file.json file in the Git repository located at the current directory, parse the JSON content, and insert the data into the my_table table in the file_history.db DuckDB database.

git-log path/to/your/file.json file_history.db --table_name my_table --flexible-schema

This will use the flexible schema mode, storing JSON data with inconsistent structures as a single JSON column instead of creating individual columns. This is useful when your JSON file contains objects with varying keys across different git commits.

Flexible Schema Mode

When working with JSON files that have inconsistent schemas across different git commits (e.g., some objects have different keys), you can use the --flexible-schema flag. This mode:

  • Creates a simple table structure with only timestamp (t), hash (h), and JSON data (data) columns
  • Stores the entire JSON array as a single JSON column in DuckDB
  • Avoids binding errors when objects have different numbers of keys
  • Allows you to query the JSON data using DuckDB's native JSON functions

This is particularly useful when your JSON file structure evolves over time in your git history.

Development

To contribute to this library, first checkout the code. Then create a new virtual environment:

cd git-logger
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

git_logger-0.4.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

git_logger-0.4-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file git_logger-0.4.tar.gz.

File metadata

  • Download URL: git_logger-0.4.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for git_logger-0.4.tar.gz
Algorithm Hash digest
SHA256 846b0694eac072ea6a98e507b7283384e81c6782bbec5cef52757a150e9f7d1f
MD5 aa78c7ca0161a3297364d59c12215148
BLAKE2b-256 fc7971d6481ed42dd345e65a84d6dc9b1770a88c80315cad77e45ed3932d0153

See more details on using hashes here.

Provenance

The following attestation bundles were made for git_logger-0.4.tar.gz:

Publisher: publish.yml on LVG77/git-logger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file git_logger-0.4-py3-none-any.whl.

File metadata

  • Download URL: git_logger-0.4-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for git_logger-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 836a473246ee531510dd6b2ec7cd418974eb015b4d3219cee3176d06f38d4407
MD5 36f1d5dcafa54a0e5974beed15c0a48c
BLAKE2b-256 2d3924d07d8d4297bd40c12b054eb99c473ca9955d4ea7b01d4c79029804845a

See more details on using hashes here.

Provenance

The following attestation bundles were made for git_logger-0.4-py3-none-any.whl:

Publisher: publish.yml on LVG77/git-logger

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page