Skip to main content

Merger is a tool that scans a directory, filters files using customizable patterns, and merges readable content into a single output file.

Project description

Merger CLI

Python License: MIT PyPI

Merger is a command-line utility for developers that scans a directory, filters files using customizable ignore patterns, and merges all readable content into a single structured JSON output file. It supports custom file parsers, making it easily extendable for formats such as .pdf or any domain-specific format.


Summary

  1. Core Features
  2. Dependencies
  3. Installation with PyPI
  4. Build and Install Locally
  5. Usage
  6. Custom Parsers
  7. CLI Options
  8. License

Core Features

  • Recursive merge of all readable files under a root directory.
  • Glob-based ignore patterns using .gitignore-style syntax.
  • Automatic binary validation and parsing.
  • Modular parser system for custom formats.
  • CLI support for installation, removal, and listing of custom parsers.
  • Structured JSON merged output, including a file tree.

Dependencies

Component Version Notes
Python ≥ 3.8 Required

All dependencies are listed in requirements.txt.


Installation with PyPI

pip install merger-cli

Build and Install Locally

1. Clone the repository

git clone https://github.com/diogotoporcov/merger-cli.git
cd merger-cli

2. Create and activate a virtual environment

Linux / macOS

python -m venv .venv
source .venv/bin/activate

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\Activate.ps1

3. Install dependencies

pip install -r requirements.txt

4. Install as CLI tool

pip install .

Usage

Basic merge

merger ./src

Custom ignore patterns

merger ./project ./output.json --ignore "*.log" "__pycache__" "*.tmp"

Custom ignore file

merger . ./output.json --merger-ignore "C:\Users\USER\Desktop\merger.ignore"

Verbose output

merger ./src ./merger.json --log-level DEBUG

Custom Parsers

Merger uses parser strategies to support parsing of non-text file formats.


Parser Abstract Class

All parsers must inherit from Parser:

from merger.parsing.parser import Parser

Required structure:

  • EXTENSIONS: Set[str]
  • CHUNK_BYTES_FOR_VALIDATION: Optional[int]
  • validate(cls, file_chunk_bytes, *, file_path=None, logger=None) -> bool
  • parse(cls, file_bytes, *, file_path=None, logger=None) -> str

Installing a Custom Parser

merger --install-module path/to/parser.py

To uninstall a module:

merger --uninstall-module <module_id>

To remove all modules:

merger --uninstall-module *

To list installed modules:

merger --list-modules

Custom Parser Example (PDF)

import logging
from pathlib import Path
from typing import Union, Optional, Any, Set, Type

import fitz

from merger.parsing.parser import Parser


class PdfParser(Parser):
    EXTENSIONS: Set[str] = {".pdf"}
    CHUNK_BYTES_FOR_VALIDATION: Optional[int] = None

    @classmethod
    def validate(
        cls,
        file_chunk_bytes: Union[bytes, bytearray],
        *,
        file_path: Optional[Path] = None,
        logger: Optional[logging.Logger] = None
    ) -> bool:
        """
        Validate that the given file represents a readable PDF document.

        Args:
            file_chunk_bytes: Binary contents of the file being validated, sufficient to perform validation.
            file_path: Path of the file being validated.
            logger: Optional logger instance for logging.

        Returns:
            bool: True if the file is a readable PDF, False otherwise.
        """
        try:
            with fitz.open(file_path) as doc:
                _ = doc[0]
            return True

        except Exception:
            return False

    @classmethod
    def parse(
        cls,
        file_bytes: Union[bytes, bytearray],
        *,
        file_path: Optional[Path] = None,
        logger: Optional[logging.Logger] = None,
    ) -> str:
        """
        Extracts and concatenates text from all pages of a PDF file.

        Args:
            file_bytes: Binary contents of the file being parsed.
            file_path: Path of the file being parsed.
            logger: ptional logger instance for logging.

        Returns:

        """
        texts = []
        with fitz.open(stream=file_bytes) as doc:
            for page in doc:
                text = page.get_text()
                if text:
                    text = text.replace("\n\n", "")
                    texts.append(text)

        full_text = " ".join(texts)
        return full_text


parser_cls: Type[Parser] = PdfParser

The module must expose a parser_cls object referencing the parser class.

This implementation is available at examples/custom_parsers/pdf_parser.py.


Output Format

The merged result is a single JSON file containing:

  • Directory tree
  • File and directory names
  • Relative paths to the root
  • Extracted text content

CLI Options

Option Description
input_dir Root directory to scan for files
output_path Path to save merged JSON output (default: ./merger.json)
-i, --install-module Install a custom parser module
-u, --uninstall-module Uninstall a parser module by ID (* removes all)
-l, --list-modules List installed parser modules
--ignore Glob-style ignore patterns
-mi, --merger-ignore Ignore file (default: ./merger.ignore)
--version Show installed version
-ll, --log-level Set logging verbosity

License

This project is licensed under the MIT License — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

merger_cli-2.0.1.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

merger_cli-2.0.1-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file merger_cli-2.0.1.tar.gz.

File metadata

  • Download URL: merger_cli-2.0.1.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for merger_cli-2.0.1.tar.gz
Algorithm Hash digest
SHA256 a3a223acf88120d312f053e1a70ae86e4f3061a8cd86d274e33b1f4a4f697032
MD5 665cad61ebc3faf699a63413e2bd026f
BLAKE2b-256 bd67af17c8e9005bf1c7630ce6331ef3ab9cb97a19b118c577303655a703835a

See more details on using hashes here.

Provenance

The following attestation bundles were made for merger_cli-2.0.1.tar.gz:

Publisher: publish.yml on diogotoporcov/merger-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file merger_cli-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: merger_cli-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for merger_cli-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8544c08d640bfc606033f64ea091d0ededeb9fd83a289377206b08cd682aada9
MD5 839f6e8df011bf35365bcc1dfec3a364
BLAKE2b-256 83cf1e6f843d9735f243c7b1803c93d66ab81ad9b9d3c4f7055307aff5fefee3

See more details on using hashes here.

Provenance

The following attestation bundles were made for merger_cli-2.0.1-py3-none-any.whl:

Publisher: publish.yml on diogotoporcov/merger-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page