Skip to main content

Text processing and analysis for HathiTrust Research Center

Project description

HTRC

HTRC-Text-Processing Library

Tool to process pairtree format data in 17 million digitized works at HathiTrust.

Table of Contents

  1. About htrc-text-processing Library
  2. Installation
  3. Usage
  4. Examples

About htrc-text-processing Library

Detailed Description goes here.

Installation

To install,

pip install htrc-text-processing

That's it! This library is written for Python 3.6+. For Python beginners, you'll need pip.

Usage

  • Function: get_zips()

    A function that finds the zip files at the end of the pairtree, moves them to a new folder and expands them, removing the zips.

    Inputs:

    1. Path (string) to directory that holds the pairtree.
    2. Path (string) to directory that will hold the folders from expanded zips.
    htrc_text_processing.get_zips('<path to pairtree parent/s>', 'path to output directory')
    
  • Function: normalize_txt_file_names()

    A function that clean and normalizes page file names.

    Example: turns 39002088672754_000001.txt into 00000001.txt

    htrc_text_processing.normalize_txt_file_names('txt path or dir to txts') 
    
  • Function: clean_vol()

    Inputs:

    1. List of paths (strings) to directories that holds page files, one per volume
    2. Path (string) to output directory where clean single text files will be stored after cleaning and concatenating pages together
  • Function: check_vol()

    Inputs:

    1. Page directory List
    2. Cleaned vols output dir

    Output

    1. Page directory list which is not cleaned yet
    new_page_directory_list = htrc_text_processing.check_vol(page_directory_list, clean_vol_out_dir)
    

issues? Please file here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htrc-text-processing-0.0.2.tar.gz (10.1 kB view hashes)

Uploaded Source

Built Distribution

htrc_text_processing-0.0.2-py3-none-any.whl (14.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page