Skip to main content

Palimpzest is a system which enables anyone to process AI-powered analytical queries simply by defining them in a declarative language

Project description

pz-banner

Palimpzest (PZ)

Getting started

You can install the Palimpzest package and CLI on your machine by cloning this repository and running:

$ git clone git@github.com:mikecafarella/palimpzest.git
$ cd palimpzest
$ pip install .

Palimpzest CLI

Installing Palimpzest also installs its CLI tool pz which provides users with basic utilities for creating and managing their own Palimpzest system. Running pz --help diplays an overview of the CLI's commands:

$ pz --help
Usage: pz [OPTIONS] COMMAND [ARGS]...

  The CLI tool for Palimpzest.

Options:
  --help  Show this message and exit.

Commands:
  help (h)                        Print the help message for PZ.
  init (i)                        Initialize data directory for PZ.
  ls-data (ls,lsdata)             Print a table listing the datasets
                                  registered with PZ.
  register-data (r,reg,register)  Register a data file or data directory with
                                  PZ.
  rm-data (rm,rmdata)             Remove a dataset that was registered with
                                  PZ.

Users can initialize their own system by running pz init. This will create Palimpzest's working directory in ~/.palimpzest:

$ pz init
Palimpzest system initialized in: /Users/matthewrusso/.palimpzest

If we list the set of datasets registered with Palimpzest, we'll see there currently are none:

$ pz ls
+------+------+------+
| Name | Type | Path |
+------+------+------+
+------+------+------+

Total datasets: 0

Registering Datasets

To add (or "register") a dataset with Palimpzest, we can use the pz register-data command (also aliased as pz reg) to specify that a file or directory at a given --path should be registered as a dataset with the specified --name:

$ pz reg --path README.md --name rdme
Registered rdme

If we list Palimpzest's datasets again we will see that README.md has been registered under the dataset named rdme:

$ pz ls
+------+------+------------------------------------------+
| Name | Type |                   Path                   |
+------+------+------------------------------------------+
| rdme | file | /Users/matthewrusso/palimpzest/README.md |
+------+------+------------------------------------------+

Total datasets: 1

To remove a dataset from Palimpzest, simply use the pz rm-data command (also aliased as pz rm) and specify the --name of the dataset you would like to remove:

$ pz rm --name rdme
Deleted rdme

Finally, listing our datasets once more will show that the dataset has been deleted:

$ pz ls
+------+------+------+
| Name | Type | Path |
+------+------+------+
+------+------+------+

Total datasets: 0

Cache Management

Palimpzest will cache intermediate results by default. It can be useful to remove them from the cache when trying to evaluate the performance improvement(s) of code changes. We provide a utility command pz clear-cache (also aliased as pz clr) to clear the cache:

$ pz clr
Cache cleared

Config Management

You may wish to work with multiple configurations of Palimpzest in order to, e.g., evaluate the difference in performance between various LLM services for your data extraction task. To see the config Palimpzest is currently using, you can run the pz print-config command (also aliased as pz config):

$ pz config
--- default ---
filecachedir: /some/local/filepath
llmservice: openai
name: default
parallel: false

By default, Palimpzest uses the configuration named default. As shown above, if you run a script using Palimpzest out-of-the-box, it will use OpenAI endpoints for all of its API calls.

Now, let's say you wanted to try using together.ai's for your API calls, you could do this by creating a new config with the pz create-config command (also aliased as pz cc):

$ pz cc --name together-conf --llmservice together --parallel True --set
Created and set config: together-conf

The --name parameter is required and specifies the unique name for your config. The --llmservice and --parallel options specify the service to use and whether or not to process files in parallel. Finally, if the --set flag is present, Palimpzest will update its current config to point to the newly created config.

We can confirm that Palimpzest checked out our new config by running pz config:

$ pz config
--- together-conf ---
filecachedir: /some/local/filepath
llmservice: together
name: together-conf
parallel: true

You can switch which config you are using at any time by using the pz set-config command (also aliased as pz set):

$ pz set --name default
Set config: default

$ pz config
--- default ---
filecachedir: /some/local/filepath
llmservice: openai
name: default
parallel: false

$ pz set --name together-conf
Set config: together-conf

$ pz config
--- together-conf ---
filecachedir: /some/local/filepath
llmservice: together
name: together-conf
parallel: true

Finally, you can delete a config with the pz rm-config command (also aliased as pz rmc):

$ pz rmc --name together-conf
Deleted config: together-conf

Note that you cannot delete the default config, and if you delete the config that you currently have set, Palimpzest will set the current config to be default.

Configuring for Parallel Execution

There are a few things you need to do in order to use remote parallel services.

If you want to use parallel LLM execution on together.ai, you have to modify the config.yaml (by default, Palimpzest uses ~/.palimpzest/config_default.yaml) so that llmservice: together and parallel: True are set.

If you want to use parallel PDF processing at modal.com, you have to:

  1. Set pdfprocessing: modal in the config.yaml file.
  2. Run modal deploy src/palimpzest/tools/allenpdf.py. This will remotely install the modal function so you can run it. (Actually, it's probably already installed there, but do this just in case. Also do it if there's been a change to the server-side function inside that file.)

Python Demo

Below are simple instructions to run pz on a test data set of enron emails that is included with the system:

  • Initialize the configuration by running pz --init.

  • Add the enron data set with: pz reg --path testdata/enron-tiny --name enron-tiny then run it through the test program with: tests/simpleDemo.py --task enron --datasetid enron-tiny

  • Add the test paper set with: pz reg --path testdata/pdfs-tiny --name pdfs-tiny then run it through the test program with: tests/simpleDemo.py --task paper --datasetid pdfs-tiny

  • Palimpzest defaults to using OpenAI. You’ll need to export an environment variable OPENAI_API_KEY

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

palimpzest-0.1.14.tar.gz (95.3 kB view details)

Uploaded Source

Built Distribution

palimpzest-0.1.14-py3-none-any.whl (106.6 kB view details)

Uploaded Python 3

File details

Details for the file palimpzest-0.1.14.tar.gz.

File metadata

  • Download URL: palimpzest-0.1.14.tar.gz
  • Upload date:
  • Size: 95.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for palimpzest-0.1.14.tar.gz
Algorithm Hash digest
SHA256 4a1b624a65e5e44863b9c064ad599a115357abd790d3a6ea1f18392f8091cfe3
MD5 ba9cf0a321e5c3317616ae0592e3cf2c
BLAKE2b-256 3f00d8eabf133d40a5f4ff3685aa1d8faa0c4c68eb81c34f1e307838f5f18a66

See more details on using hashes here.

File details

Details for the file palimpzest-0.1.14-py3-none-any.whl.

File metadata

  • Download URL: palimpzest-0.1.14-py3-none-any.whl
  • Upload date:
  • Size: 106.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.12.3

File hashes

Hashes for palimpzest-0.1.14-py3-none-any.whl
Algorithm Hash digest
SHA256 941ed61be9a0bfee1302244309a3e82e54cd815a7e03eb1e0bb4c6c1e43eddc4
MD5 c0f5ee230aaf014e2b5c16c0d262e79d
BLAKE2b-256 fa597da24428ad578f69ceaf7fb6ec016c07263f0763021f944cfad00b62a9e9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page