Skip to main content

filedep: A small python tool to check file dependency

Project description

filedep: A small python tool to check file dependency

Motivation

When doing empirical analysis, you may encounter the following issue about file dependency.

  • Suppose code.py (or code.do, code.sas, code.m, code.R, etc) reads data from indata.csv, does some data cleaning, and then saves the intermediary data as outdata.csv.

  • After using outdata.csv to run some statistical tests, you want to change the data cleaning procedure a bit, so you modify code.py.

  • If there are only one code file and two data files, you will easily remember to re-run code.py to update the output data outdata.csv.

  • However, suppose that outdata.csv is then used by code2.py to write finaldata.csv. Then, people may easily forget to re-run code2.py as well to update finaldata.csv.

  • As a result, this may cause the illusion that results change after you run the same set of code twice. For example, you forget to update finaldata.csv initially, but then accidentally update finaldata.csv some time later. Then, you find that results change after you run the same set of code.

To resolve this issue, I build this simple package to check file dependencies based on last modified time. Users can define file dependencies, such as code.py using pre1.csv and pre2.csv as input to write post1.csv and post2.csv. Then, the function in the package will check if the last modified times of both pre1.csv and pre2.csv are before that of code.py and the last modified times of both post1.csv and post2.csv are after that of code.py. If any file dependency is broken, the broken ones will be printed or saved to a file.

Installation

Use pip to install the package as follows:

pip install filedep

Usage

Import the package using

import filedep

The key function is check_dep(deps, outfile=sys.stdout, reterr=False). The first argument is a list of dependencies (defined below). The second argument specifies where to print error information if any file dependency is broken. The default is sys.stdout. The third argument specifies if broken dependencies are returned from the function. This is mainly for testing purposes. The default is False, i.e., broken dependencies are only printed.

The file dependencies have to be provided by the user using the format defined below. In the template folder, there is a template to define dependencies and use check_dep() function to check.

Example 1. No broken file dependencies

The following code creates several empty files:

import filedep
import time
import os
from os.path import join as pj

PATH = r'C:\test_check_dep'
if not os.path.exists(PATH):
    os.mkdir(PATH)

def touch(filepath):
    if os.path.exists(filepath):
        os.utime(filepath)
    else:
        with open(filepath, 'a') as f:
            pass

# Touch files in a specific order
touch(pj(PATH, 'pre11.csv'))
time.sleep(.1)
touch(pj(PATH, 'pre12.csv'))
time.sleep(.1)
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post11.csv'))
time.sleep(.1)
touch(pj(PATH, 'post12.csv'))
time.sleep(.1)

# Define dependencies
deps = [
    (
        [
            pj(PATH, 'pre11.csv'),
            pj(PATH, 'pre12.csv')
        ], 
        pj(PATH, 'code1.py'), 
        [
            pj(PATH, 'post11.csv'),
            pj(PATH, 'post12.csv'),
        ],
    ),
]
filedep.check_dep(deps)

In deps, we define a single dependency as follows: under the directory C:\test_check_dep, code1.py reads pre11.csv and pre12.csv to produce post11.csv and post12.csv. Then, the last modified times of both pre11.csv and pre12.csv must be before that of code1.py and those of both post11.csv and post12.csv must be after that of code1.py

Since the dependency is satisfied by construction, the output is

All file dependencies are verified!

Example 2. Broken file dependencies

The following code creates several empty files and define two broken dependencies:

import filedep
import time
import os
from os.path import join as pj

PATH = r'C:\test_check_dep'
if not os.path.exists(PATH):
    os.mkdir(PATH)

def touch(filepath):
    if os.path.exists(filepath):
        os.utime(filepath)
    else:
        with open(filepath, 'a') as f:
            pass

# Touch files in a specific order
touch(pj(PATH, 'pre11.csv'))
time.sleep(.1)
touch(pj(PATH, 'pre12.csv'))
time.sleep(.1)
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post11.csv'))
time.sleep(.1)
# Note code1.py is newer than post11.csv
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post12.csv'))
time.sleep(.1)

# Define dependencies
deps = [
    (
        [
            pj(PATH, 'pre11.csv'),
        ], 
        pj(PATH, 'code1.py'), 
        [
            pj(PATH, 'post11.csv'),
        ],
    ),  
    (
        [
            pj(PATH, 'pre11.csv'),
            pj(PATH, 'pre12.csv'),
        ], 
        pj(PATH, 'code1.py'), 
        [
            pj(PATH, 'post11.csv'), 
            pj(PATH, 'post12.csv'),
        ],
    )
]
filedep.check_dep(deps)

Here, we define 2 dependencies. The second one is the same as that in the previous example, but the first one defines a simpler dependency: code1.py uses pre11.csv to produce post11.csv. Since by construction post11.csv is "touched" before code1.py, both dependencies are broken. Hence, the output is

There are 2 broken file dependencies!!! 
[1]
                                           Last Modified Time
  Input:
    C:\test_check_dep\pre11.csv      : 2021-10-14 14:25:11.011976
  Code:
    C:\test_check_dep\code1.py       : 2021-10-14 14:25:11.451668
  Output:
    C:\test_check_dep\post11.csv     : 2021-10-14 14:25:11.342247
[2]
                                           Last Modified Time
  Input:
    C:\test_check_dep\pre11.csv      : 2021-10-14 14:25:11.011976
    C:\test_check_dep\pre12.csv      : 2021-10-14 14:25:11.125543
  Code:
    C:\test_check_dep\code1.py       : 2021-10-14 14:25:11.451668
  Output:
    C:\test_check_dep\post11.csv     : 2021-10-14 14:25:11.342247
    C:\test_check_dep\post12.csv     : 2021-10-14 14:25:11.559796

where the last modified date of each file in each broken dependency is shown.

Format of file dependency

The first argument of check_dep() is a list of dependencies. Its format should be as follows:

  • It is a list of tuples.
  • Each tuple has three elements.
    • The first element is a list of str.
    • The second element is a str.
    • The third element is a list of str.
    • Each str is an absolute path of an existing file.

As an example, the following code defines two dependencies:

deps = [
    (
        ['pre1.txt'], 'code1.py', ['post1.txt']
    ),
    (
        ['pre21.txt', 'pre22.txt'], 'code2.py', ['post21.txt', 'post22.txt']
    )
]
  • The first one says that code1.py uses pre1.txt as input and outputs post1.txt. As a result, the last modified date of the three files should satisfy pre1.txt<=post1.txt and code1.py<=post1.txt.
  • The second one says that code2.py uses pre21.txt and pre22.txt as input and outputs post21.txt and post22.txt. As a result, the last modified date of the three files should satisfy max(pre21.txt,pre22.txt,code1.py)<=min(post21.txt,post22.txt) where max (min) represent the maximum (minimum) date.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filedep-0.0.4.tar.gz (5.8 kB view details)

Uploaded Source

File details

Details for the file filedep-0.0.4.tar.gz.

File metadata

  • Download URL: filedep-0.0.4.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/3.10.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.8.12

File hashes

Hashes for filedep-0.0.4.tar.gz
Algorithm Hash digest
SHA256 a6853f7c8248db315198a482864398005da53e98fdd872649bd0f464a20b3db1
MD5 1eb61e2ea1058838ad9b530b7c2eb07f
BLAKE2b-256 879084f632243e44ec8dafd21c46c77b53a5077b30e183661fe5c0dfeed0fca1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page