A script for counting word families in a text file.
Project description
word_family_counter
A Python script for counting word families in a text file using advanced morphological analysis with spaCy.
Features
- Processes text files to count word families
- Uses spaCy for advanced linguistic analysis and lemmatization
- Handles contractions, compound words, and various text preprocessing tasks
- Supports multiple languages (depending on available spaCy models)
- Provides detailed output with word family frequencies
Installation
-
Clone the repository:
git clone https://github.com/BlueBirdBack/word_family_counter.git cd word_family_counter
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
-
Download the spaCy language model:
python -m spacy download en_core_web_sm
Usage
Run the script with a text file as an argument:
python src/word_family_counter/main.py path/to/your/text_file.txt
Optional arguments:
--verbose
: Increase output verbosity for debugging purposes--language
: Specify the spaCy model to use (default: en_core_web_sm)
Example:
python src/word_family_counter/main.py sample.txt --verbose --language en_core_web_md
Output
The script will display:
- Total number of words in the text
- Total number of unique word families
- A list of word families sorted by frequency (descending) and then alphabetically
License
This project is licensed under the MIT License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Contact
BlueBirdBack - avery@bluebirdback.com
Project Link: https://github.com/BlueBirdBack/word_family_counter
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file word_family_counter-0.1.0.tar.gz
.
File metadata
- Download URL: word_family_counter-0.1.0.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cf38cae80dd38d8530f326691bd1ca6d585fae941441fff92b0ebd0bffe3c71 |
|
MD5 | 88a4b67722e4457d6a80bc57b9765aaa |
|
BLAKE2b-256 | debfc638e956a22ce33734dd72433a82f65985bc1578f90b48c000298f7adf00 |
File details
Details for the file word_family_counter-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: word_family_counter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c4105237cdf6ad4aa6ce2aa6c073c68043ee508e7c7f9ed997ddb1477d4ebc8 |
|
MD5 | 3ec3a9f80ae01279de4b83bcc4f28a2d |
|
BLAKE2b-256 | fa731a9e6b0df873c99760847a74b4dacc7d14bcd653c697573390a11855f353 |