Command-line interface (CLI) to select lines of a text file.
Project description
text-selection
Command-line interface (CLI) to select lines of a text file.
Features
- dataset
create
: create a dataset based on a text fileexport-statistics
: exporting statistics to a CSV
- subsets
add
: add subsetsremove
: remove subsetsrename
: rename subsetselect-all
: select all linesselect-fifo
: select lines FIFO-styleselect-greedily
: select lines greedily regarding unitsselect-greedily-ep
: select lines greedily regarding units (epoch-based)select-uniformly
: select lines with units uniformly distributedselect-randomly
: select lines randomlyfilter-duplicates
: filter duplicate linesfilter-by-regex
: filter lines by regexfilter-by-text
: filter lines by textfilter-by-weight
: filter lines by weightfilter-by-vocabulary
: filter lines by unit vocabularyfilter-by-count
: filter lines by global unit frequenciesfilter-by-unit-freq
: filter lines by unit frequencies per linefilter-by-line-nr
: filter lines by line numbersort-by-line-nr
: sort lines by line numbersort-by-text
: sort lines by textsort-by-weight
: sort lines by weightssort-by-shuffle
: shuffle linesreverse
: reverse linesexport
: export lines
- weights
create-from-file
: create weights from filecreate-uniform
: create uniform weightscreate-from-count
: create weights from unit countdivide
: divide weights
Roadmap
- add tests
- refactoring
- outsourcing greedy- and KLD-iterator
Installation
pip install text-selection --user
Usage
usage: text-selection-cli [-h] [-v] {dataset,subsets,weights} ...
CLI to select lines of a text file.
positional arguments:
{dataset,subsets,weights} description
dataset dataset commands
subsets subsets commands
weights weights commands
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Dependencies
tqdm
numpy
scipy
pandas
ordered_set>=4.1.0
Contributing
If you notice an error, please don't hesitate to open an issue.
Development setup
# update
sudo apt update
# install Python 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run
sudo apt install python3-pip \
python3.8 python3.8-dev python3.8-distutils python3.8-venv \
python3.9 python3.9-dev python3.9-distutils python3.9-venv \
python3.10 python3.10-dev python3.10-distutils python3.10-venv \
python3.11 python3.11-dev python3.11-distutils python3.11-venv
# install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user
# check out repo
git clone https://github.com/stefantaubert/text-selection.git
cd text-selection
# create virtual environment
python3.8 -m pipenv install --dev
Running the tests
# first install the tool like in "Development setup"
# then, navigate into the directory of the repo (if not already done)
cd text-selection
# activate environment
python3.8 -m pipenv shell
# run tests
tox
Final lines of test result output:
py38: commands succeeded
py39: commands succeeded
py310: commands succeeded
py311: commands succeeded
congratulations :)
License
MIT License
Acknowledgments
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
Citation
If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see About => Cite this repository).
Changelog
- v0.0.3 (2023-05-30)
- Changed
- Improved speed for filtering OOV/IV words by up to ~20k words/s
- Added
- Added
subsets select-randomly
- Added
subsets sort-by-shuffle
- Added
subsets add
option--skip-existing
- Added
- Bugfix
- Fixed evaluation of "from subsets" to ensure that the subsets exist
- Fixed
subsets remove
didn't worked
- Changed
- v0.0.2 (2023-01-13)
- Added
- Added creation of weights from lines
- Add
--limit
to select duplicates - Add exit code
- Changed
- Set
--limit
positional where applicable - Don't output expected warning from
numpy
on KLD selection
- Set
- Bugfixes
- Added
- v0.0.1 (2022-05-25)
- Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-selection-0.0.3.tar.gz
(112.6 kB
view details)
Built Distribution
File details
Details for the file text-selection-0.0.3.tar.gz
.
File metadata
- Download URL: text-selection-0.0.3.tar.gz
- Upload date:
- Size: 112.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7bce559179c8a1254059e29c32526da4435685ac8ff6ded63434ddd63ac4d6c9 |
|
MD5 | eaeed18a58a69e84525438adea981c75 |
|
BLAKE2b-256 | 0332434bd8f13bfb547fe1b75b0047949ec3377715221290f395fd901815d370 |
File details
Details for the file text_selection-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: text_selection-0.0.3-py3-none-any.whl
- Upload date:
- Size: 152.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ad75d5f83557f7e635dbcaae247ebf8f41067ea436cc5a2a04095f083adab3ff |
|
MD5 | c83d6beb1b98c747816a1086b3881b4f |
|
BLAKE2b-256 | bc36fb7b12e17fe9ea7638697d06291abf2f2ada109f9b68d24e2469b0ba2c60 |