Skip to main content

Text processing and analysis for HathiTrust Research Center

Project description

HTRC

HTRC-Text-Processing Library

Tool to process pairtree format data in 17 million digitized works at HathiTrust.

Table of Contents

  1. About htrc-text-processing Library
  2. Installation
  3. Usage
  4. Examples

About htrc-text-processing Library

Detailed Description goes here.

Installation

To install,

pip install htrc-text-processing

That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.

Usage

  • Function: get_zips()

    A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.

    Inputs:

    1. Path (string) to directory that holds the pairtree.
    2. Path (string) to directory that will hold the folders from expanded zips.
    htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
    
  • Function: normalize_txt_file_names()

    A function that clean and normalizes page file names.

    Example: turns 39002088672754_000001.txt into 00000001.txt

    htrc_text_processing.normalize_txt_file_names('txt path or dir to txts') 
    
  • Function: clean_vol()

    Inputs:

    1. List of paths (strings) to directories that holds page files, one per volume
    2. Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
  • Function: check_vol()

    Inputs:

    1. Page directory List
    2. Cleaned vols output dir

    Output

    1. Page directory list which is not cleaned yet
    new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)
    

issues? Please file here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htrc-text-processing-0.0.2.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

htrc_text_processing-0.0.2-py3-none-any.whl (14.4 kB view details)

Uploaded Python 3

File details

Details for the file htrc-text-processing-0.0.2.tar.gz.

File metadata

  • Download URL: htrc-text-processing-0.0.2.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for htrc-text-processing-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ccaf0d8439db1b9e7363b6c152cc0b306e7ec0f2654381bf348a8ad54b109441
MD5 9626469dc16aace3b8b2de10927184b4
BLAKE2b-256 9eb2c8513a11019ab8c1e81452e82ac0c2285e0794a0e7c4f9ae0b5961cca583

See more details on using hashes here.

File details

Details for the file htrc_text_processing-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: htrc_text_processing-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 14.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for htrc_text_processing-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 93d48f18c175f232809dcb43d41de059d7b42568e0696b61382f0babe8e28933
MD5 183e3177871dd519e29e033a71a23178
BLAKE2b-256 c236a0bec736653acf7e891e0762b30d220aa3c8d3d6c730d26c587dd2adcaf1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page