Skip to main content

A package for picking the juciest text morsels out of a pile of documents.

Project description

rosinenpicker

Python Packaging PyPI PyPI - Downloads

Manual

Welcome to rosinenpicker! This tool is like a magical sieve that helps you find golden nuggets (or "Rosinen") of information within a mountain of documents. It's designed for anyone who needs to extract specific pieces of information without diving deep into the technicalities.

Understanding Key Terms

  • Command Line: A text-based interface to operate your computer. Imagine telling your computer exactly what to do by typing in commands.
  • YAML: A simple configuration file format used by rosinenpicker to understand your instructions. It's easy to read and write.
  • Arguments: Special instructions you provide to rosinenpicker when you start it, telling it where to find its instructions (YAML file) and where to store its findings.

Getting Started

  1. Python 3.11 is a prerequisite: Make sure you have Python 3.11 or above installed. There are various ways to install Python, but I recommend Miniconda.

  2. Installation: First, let's bring rosinenpicker to your computer. Open your command line and type:

    pip install rosinenpicker
    
  3. Running the Program: To launch rosinenpicker, enter the following:

    rosinenpicker -c path/to/your_config.yml -d path/to/your_database.db
    

    Replace path/to/your_config.yml with the actual path to your configuration file, and path/to/your_database.db with where you'd like to save the findings. (If not specified, the configuration and database files are assumed to be config.yml and matches.db in your current directory; also, the database is automatically created if it is not present on your system.)

Creating Your YAML Configuration

Here's a sample configuration to guide rosinenpicker:

title: 'My Document Search'
strategies:
  strategy1:
    processed_directory: '/path/to/documents'
    file_name_pattern: '.*\.pdf'
    file_format: 'pdf'
    terms:
      term1: 'apple pie'
    export_format: 'csv'
    export_path: '/path/to/export.csv'

This tells rosinenpicker to look in /path/to/documents for PDF files containing "apple pie" and save results in a CSV file at /path/to/export.csv. Fur further information, check out the sample configuration file in this repository - the file contains additional comments you may find useful.

Going deeper

Now of course it is not very useful to just extract the term "apple pie" out of documents. But you can do much more. Instead of "apple pie" you can enter a regular expression, e.g. "\d{8}" to extract numbers consisting of exactly eight digits. But there's more: if you enter an expression along with "@@@" (which stands for "variable string"), only a match to "@@@" is returned. E.g. "Name: @@@" will return whatever follows "Name:"!

Even further fine-grained control

You can (i.e. you don't have to) even add more fine-grained control by appending characters after the string '===' (three equal signs):

  • m (multiline) will allow multiline pattern matching (default: off)
  • l (linebreak to space) will replace linebreaks with space (only applies for multiline matching, default: off)
  • c(x) (crop length to x) will crop the length of the returned string to x (default: off)
  • ? will mark the term as optional (default: off, i.e. without the question mark the term is assumed to be required); if set, optional key move_to_directory (see sample configuration file) will ignore this term.

You can use one of above options in isolation or several of them in tandem; the order doesn't count, the main thing is that the option is represented by above flags. So e.g. the term start@@@finish===mc(100)l? would search for text between pattern "start" and "finish" over multiple lines, replace line breaks with space, crop the returned text to 100 characters, and mark the term as optional (i.e. not required); nevertheless, it could also have been written as start@@@finish===lc(100)?m (i.e. flag order is up to you)!

Using rosinenpicker

With your config.yml ready, go back to the command line and run rosinenpicker with the -c and -d arguments as shown above.

Help and Options

For a list of commands and options, type:

rosinenpicker -h

This command displays all you need to know to navigate rosinenpicker:

usage: rosinenpicker [-h] [-c CONFIG] [-d DATABASE] [-v] [-r READOUT]

A package for picking the juciest text morsels out of a pile of documents.

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Path to configuration YAML file.
  -d DATABASE, --database DATABASE
                        Path to SQLite database file.
  -v, --version         Print version and exit.
  -r READOUT, --readout READOUT
                        Only read contents of file and exit.

Conclusion

You're all set to explore and extract valuable information with rosinenpicker. Happy information hunting!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rosinenpicker-0.1.17.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

rosinenpicker-0.1.17-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file rosinenpicker-0.1.17.tar.gz.

File metadata

  • Download URL: rosinenpicker-0.1.17.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for rosinenpicker-0.1.17.tar.gz
Algorithm Hash digest
SHA256 4f314297a12d4391b3ed30dbead474db0f106ee6c3a2963d81e9a0178c33fb74
MD5 93059ee671b3c1dfbb755954da9b6f55
BLAKE2b-256 8ea05a1ab28f2bef6266e7ab808fd83852edd1a177adb6e9f63d5fc6718663f9

See more details on using hashes here.

File details

Details for the file rosinenpicker-0.1.17-py3-none-any.whl.

File metadata

File hashes

Hashes for rosinenpicker-0.1.17-py3-none-any.whl
Algorithm Hash digest
SHA256 7fda5a9a0cf5512a2a7851456b29269a4e56aadd4d47ade8c228c7e670beed95
MD5 02d38bac4ab7788562ef10ed5177a407
BLAKE2b-256 f09e70616f2e30757b677e714b8d0f072ea76b48ce2779517b98314efeaa25dd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page