Text processing and analysis for HathiTrust Research Center
Project description
HTRC-Text-Processing Library
Tool to process pairtree format data in 17 million digitized works at HathiTrust.
Table of Contents
About htrc-text-processing
Library
Detailed Description goes here.
Installation
To install,
pip install htrc-text-processing
That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.
Usage
-
Function:
get_zips()
A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.
Inputs:
- Path (string) to directory that holds the pairtree.
- Path (string) to directory that will hold the folders from expanded zips.
htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
-
Function:
normalize_txt_file_names()
A function that clean and normalizes page file names.
Example: turns
39002088672754_000001.txt
into00000001.txt
htrc_text_processing.normalize_txt_file_names('txt path or dir to txts')
-
Function:
clean_vol()
Inputs:
- List of paths (strings) to directories that holds page files, one per volume
- Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
-
Function:
check_vol()
Inputs:
- Page directory List
- Cleaned vols output dir
Output
- Page directory list which is not cleaned yet
new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)
issues? Please file here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file htrc-text-processing-0.0.2.tar.gz
.
File metadata
- Download URL: htrc-text-processing-0.0.2.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccaf0d8439db1b9e7363b6c152cc0b306e7ec0f2654381bf348a8ad54b109441 |
|
MD5 | 9626469dc16aace3b8b2de10927184b4 |
|
BLAKE2b-256 | 9eb2c8513a11019ab8c1e81452e82ac0c2285e0794a0e7c4f9ae0b5961cca583 |
File details
Details for the file htrc_text_processing-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: htrc_text_processing-0.0.2-py3-none-any.whl
- Upload date:
- Size: 14.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93d48f18c175f232809dcb43d41de059d7b42568e0696b61382f0babe8e28933 |
|
MD5 | 183e3177871dd519e29e033a71a23178 |
|
BLAKE2b-256 | c236a0bec736653acf7e891e0762b30d220aa3c8d3d6c730d26c587dd2adcaf1 |