Text processing and analysis for HathiTrust Research Center
Project description
HTRC-Text-Processing Library
Tool to process pairtree format data in 17 million digitized works at HathiTrust.
Table of Contents
About htrc-text-processing
Library
Detailed Description goes here.
Installation
To install,
pip install htrc-text-processing
That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.
Usage
-
Function:
get_zips()
A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.
Inputs:
- Path (string) to directory that holds the pairtree.
- Path (string) to directory that will hold the folders from expanded zips.
htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
-
Function:
normalize_txt_file_names()
A function that clean and normalizes page file names.
Example: turns
39002088672754_000001.txt
into00000001.txt
htrc_text_processing.normalize_txt_file_names('txt path or dir to txts')
-
Function:
clean_vol()
Inputs:
- List of paths (strings) to directories that holds page files, one per volume
- Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
-
Function:
check_vol()
Inputs:
- Page directory List
- Cleaned vols output dir
Output
- Page directory list which is not cleaned yet
new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)
issues? Please file here
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for htrc-text-processing-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ccaf0d8439db1b9e7363b6c152cc0b306e7ec0f2654381bf348a8ad54b109441 |
|
MD5 | 9626469dc16aace3b8b2de10927184b4 |
|
BLAKE2b-256 | 9eb2c8513a11019ab8c1e81452e82ac0c2285e0794a0e7c4f9ae0b5961cca583 |
Hashes for htrc_text_processing-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93d48f18c175f232809dcb43d41de059d7b42568e0696b61382f0babe8e28933 |
|
MD5 | 183e3177871dd519e29e033a71a23178 |
|
BLAKE2b-256 | c236a0bec736653acf7e891e0762b30d220aa3c8d3d6c730d26c587dd2adcaf1 |