Skip to main content

Separate two file into lines observed in both-/first_only-/second_only. Programmed using Cython.

Project description

The filediffs package

filediffs takes two files and separates them into

  1. lines found in both files
  2. lines found only in file 1
  3. lines found only in file 2

Code inspired by https://www.splinter.com.au/reconcilingcomparing-huge-data-sets-with-c/

Installation

For package installation, Poetry is used.

Inside pyproject.toml the python version, requirements, build instructions, package descriptions etc. are defined.

You can create a virtual environment for the package with poetry by

  1. installing poetry pip install poetry
  2. calling poetry install to install from poetry.lock

To create a .tar.gz file and wheeles for publishing the package one can use poetry build

To publish the package, poetry publish can be used. Though the pipy credentials have to be set (see https://python-poetry.org/docs/repositories/#configuring-credentials).

Implementation:

Implemented in Cython.

Lines found in both files are not kept in memory but written to disk every 5.000.000 lines to preserve memory.

This way, even very large files can be separated. Only the diff has to fit in memory.

The file build_cython_setup.py defines the cython build process. The cpp files can be build using python build_cython_setup.py build_ext --inplace.

Usage:

To use the method in python in interaction with cython, the file paths have to passed to the function as bytestrings.

from filediffs.filediffs import file_diffs
lines_only_in_file_1, lines_only_in_file_2 = file_diffs(
    filename_1=b'path/to/file1.txt',
    filename_2=b'path/to/file2.txt',
    outpath_lines_present_in_both_files=b'output_path/to/lines_in_both.txt',
    outpath_lines_present_only_in_file1=b'output_path/to/lines_only_in_file1.txt',
    outpath_lines_present_only_in_file2=b'output_path/to/lines_only_in_file2.txt',
)

Inside the package directory, an example script filediffs_script.py is provided.

It can be used to separate files from the terminal:

# To separate two files, simply pass the filepath to `filediffs/filediffs_script.py`
python filediffs/filediffs_script.py filediffs/tests/data/file_1.txt filediffs/tests/data/file_2.txt

# If you want to define the filenames of the separated files, optional arguments are provided for the script. 
python filediffs/filediffs_script.py filediffs/tests/data/file_1.txt filediffs/tests/data/file_2.txt --out_filename_both out_both.txt --out_filename_only_in_file1 out_file1_only.txt --out_filename_only_in_file2 out_file2_only.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filediffs-0.0.1.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filediffs-0.0.1-py3-none-any.whl (42.2 kB view details)

Uploaded Python 3

File details

Details for the file filediffs-0.0.1.tar.gz.

File metadata

  • Download URL: filediffs-0.0.1.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.9 CPython/2.7.18rc1 Linux/5.4.0-37-generic

File hashes

Hashes for filediffs-0.0.1.tar.gz
Algorithm Hash digest
SHA256 10599a66556782bff6e20a14344f1fa5452fc0778adedbdd8ede06f450ca7d9f
MD5 2b443763956b26d04a5f7f1338907d17
BLAKE2b-256 0e310177fe9b178db337820a439f21da99d77001c17e4e08ee80e240c0e2a4e4

See more details on using hashes here.

File details

Details for the file filediffs-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: filediffs-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 42.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.9 CPython/2.7.18rc1 Linux/5.4.0-37-generic

File hashes

Hashes for filediffs-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 45b95bc639b77107f77b3a3002c727d9f5de2431a95f32c8eff9321bcf62a872
MD5 7e4d0a6dc1f5d86e4956a4265091bf66
BLAKE2b-256 0a7b5988b24bb16bdce7abb867590edee6cc0221901c8f1fa0188e169abaa535

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page