Skip to main content

Find and process files in a Pythonic way, without boilerplate code.

Project description

iterfiles

pypi build status coverage

Find and process files in a Pythonic way, without boilerplate code. Implements for_each_file and other common scenarios.

>>> from iterfiles import for_each_file
>>> for_each_file('example', print, pattern='*/*.txt')

This will print all *.txt file names in all first-level subdirectories of example.

Let’s say we have following directory structure:

example/
    shapes.txt
    aa/
        colors.dat    # not a txt!
        numbers.txt
        pets.txt
    bb/
        names.txt
        cc/
            cars.txt

The output will be:

example/aa/numbers.txt
example/aa/pets.txt
example/bb/names.txt

Filter directories and files via glob()

All syntax of pathlib.Path.glob is supported.

Print all *.txt files in all subdirectories:

>>> for_each_file('example', print, pattern='**/*.txt')
example/shapes.txt
example/aa/numbers.txt
example/aa/pets.txt
example/bb/names.txt
example/bb/cc/cars.txt

Print all *.txt files only in a top-level directory:

>>> for_each_file('example', print, pattern='*.txt')
example/shapes.txt

Files as an iterable

Iterate over pathlib.Path objects:

>>> from iterfiles import iter_files
>>> [x.name for x in iter_files('example', '**/*.txt')]
['shapes.txt', 'numbers.txt', 'pets.txt', 'names.txt', 'cars.txt']

…or over text file contents directly, for example combine the first words from each file:

>>> from iterfiles import iter_texts
>>> ', '.join(x.split(' ')[0] for x in iter_texts('example', pattern='**/*.txt'))
'Square, One, Cat, Alice, Toyota'

Pasting all files together into corpus

Use for_each_text to work with plain text contents directly.

>>> with open('corpus.txt', 'w') as corpus:
...   for_each_text('example', corpus.write, pattern='**/*.txt')

Convert files from one directory to another directory

Let’s say you want to extract OCR text from a large collection of *.pdf into *.txt files.

You have a wonderful function pdftotext(pdf_filename, txt_filename) from another package, it does the job well, but how to apply it to a nested directory tree?

>>> from iterfiles import convert_files
>>> convert_files('input_pdfs', 'output_txts', pdftotext, pattern='**/*.pdf', rename=lambda p: p.with_suffix('.txt'))

That’s all. You’ll have the same directory structure in output, and same file names, but with *.txt suffix instead of *.pdf.

Of course, convert_files can be used for any kind of conversion.

Convert text files

If both input and output is plain text, use convert_texts and forget about reading and writing files. For example, here’s a snippet which transforms all files into uppercase:

>>> from iterfiles import convert_texts
>>> convert_texts('example', 'output', str.upper, pattern='**/*.txt')

Gotchas and Limitations

  • Any unhandled exception raised from your function will break the loop. Make sure to suppress exceptions which are tolerable. Error handling (such as logging) is out of scope of this package.

  • Collecting list of files according to glob happens (almost) instantly before any processing takes place. If you add files to directory during long processing, these new files will not be detected on the fly. If you remove files during processing and before they had a chance to be processed, you will see an error.

  • Only files are considered. Directories are traversed in a search for files; and during conversion, directories are created when necessary; but that’s it. You can’t do anything custom with directories.

  • Package was not tested with symlinks, and behavior with symlinks is undefined.

Requirements

  • Python 3.6+

  • No dependencies

History

0.1.0 (2021-02-02)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iterfiles-0.1.0.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

iterfiles-0.1.0-py2.py3-none-any.whl (5.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file iterfiles-0.1.0.tar.gz.

File metadata

  • Download URL: iterfiles-0.1.0.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6

File hashes

Hashes for iterfiles-0.1.0.tar.gz
Algorithm Hash digest
SHA256 aed4f3ff614f13887ce623168c58ebd7e7d664d1b73bf8817fb3944d307b5665
MD5 ffe15b220e3df6a79a0b590d6a0efb80
BLAKE2b-256 bf31327174fc95bead497a4ddd29bc9b77989d4c9041e14972a3cc3d0ed2556e

See more details on using hashes here.

File details

Details for the file iterfiles-0.1.0-py2.py3-none-any.whl.

File metadata

  • Download URL: iterfiles-0.1.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6

File hashes

Hashes for iterfiles-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 f4ff293e75982b7063cb5a97f1bec7d79477297bb3a3b222ca2919ed03b2b76d
MD5 7d2891239ee8a83a2efc8dae72979369
BLAKE2b-256 cac0318d3248847dcc61768066b1d93509ee9af7acad498f01dfcc558afb134a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page