TextDirectory allows you to combine multiple text files into one.
Project description
TextDirectory
TextDirectory allows you to combine multiple text files into one aggregated file. TextDirectory also supports matching files for certain criteria and applying transformations to the aggregated text.
TextDirectory can be used as a mere tool (via the CLI) and as a Python library.
Of course, everything TextDirectory does could be achieved in bash or PowerShell. However, there are certain use-cases (e.g. when used as a library) in which it might be useful.
Free software: MIT license
Documentation: https://textdirectory.readthedocs.io.
Features
Aggregating multiple text files
Matching based on length (character, tokens), content, and random sampling
Transforming the aggregated text (e.g. transforming the text to lowercase)
Version |
Filters |
Transformations |
---|---|---|
0.1.0 |
filter_by_max_chars(n int); filter_by_min_chars(n int); filter_by_max_tokens(n int); filter_by_min_tokens(n int); filter_by_contains(str); filter_by_not_contains(str); filter_by_random_sampling(n int; replace=False) |
transformation_lowercase |
0.1.1 |
filter_by_chars_outliers(n sigmas int) |
transformation_remove_nl |
0.1.2 |
filter_by_filename_contains(str) |
transformation_usas_en_semtag; transformation_uppercase; transformation_postag(spaCy model) |
0.1.3 |
filter_by_similar_documents(reference_file str; threshold float) |
transformation_remove_non_ascii; transformation_remove_non_alphanumerical |
Quickstart
Install TextDirectory via pip: pip install textdirectory
TextDirectory, as exemplified below, works with a two-stage model. After loading in your data (directory) you can iteratively select the files you want to process. In a second step you can perform transformations on the text before finally aggregating it.
As a Command-Line Tool
TextDirectory comes equipped with a CLI.
The syntax for both the filters and tranformations works similarly. They are chained by adding slashes (/) and parameters are passed via commas (,): filter_by_min_tokens,5/filter_by_random_sampling,2.
Example 1: A Very Simple Aggregation
textdirectory --directory testdata --output_file aggregated.txt
This will take all files (.txt) in testdata and then aggregates the files into a file called aggregated.txt.
Example 2: Applying Filters and Transformations
In this example we want to filter the files based on their token count, perform a random sampling and finally transform all text to lowercase.
textdirectory --directory testdata --output_file aggregated.txt --filters filter_by_min_tokens,5/filter_by_random_sampling,2 --transformations transformation_lowercase
After passing two filters (filter_by_min_tokens and filter_by_random_sampling) we’ve applied the transform_lowercase transformation.
The resulting file will contain the content of two files that each have at least five tokens.
As a Python Library
In order to demonstrate TextDirectory as a Python library, we’ll recreate the second example from above:
import textdirectory
td = textdirectory.TextDirectory(directory='testdata')
td.load_files(recursive=False, filetype='txt', sort=True)
td.filter_by_min_tokens(5)
td.filter_by_random_sampling(2)
td.stage_transformation(['transform_lowercase'])
td.aggregate_to_file('aggregated.txt')
If we wanted to keep working with the actual aggregated text, we could have called text = td.aggregate_to_memory().
ToDo
Increasing test coverage
Writing better documentation
Adding better error handling (raw exception are, well …)
Adding logging
Implementing autodoc (via Sphinx)
Behaviour
We are not holding the actual texts in memory. This leads to much more disk read activity (and time inefficiency), but saves memory.
transformation_usas_en_semtag relies on the web versionof Paul Rayson’s USAS Tagger. Don’t use this transformation for large amounts of text, give credit, and consider using their commercial product Wmatrix.
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.1.0 (2018-04-26)
Initial release
First release on PyPI.
0.1.1 (2018-04-27)
added filter_by_chars_outliers
added transformation_remove_nl
0.1.2 (2018-04-29)
added transformation_postag
added transformation_usas_en_semtag
added transformation_uppercase
added filter_by_filename_contains
added parameter support for transformations
0.1.3 (2018-04-30)
filter_by_random_sampling now has a “replacement” option
changed from tabulate to an embedded function
added transformation_remove_non_ascii
added transformation_remove_non_alphanumerical
added filter_by_similar_documents
0.1.4 (2018-04-02)
fixed an object mutation problem in the tabulate function
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for textdirectory-0.1.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bd674a12328042cc8f55b65ca270805564a06aab1b9b32b9b84bd866cc805692 |
|
MD5 | 3f2c6b8e69090834b20985c855f830b4 |
|
BLAKE2b-256 | 229f938d8b38eaa2e04849e2ccf829951a9da2813a8b8db109f8b6264d5e6b45 |