Skip to main content

LLM integrations for DocumentCloud

Project description

llm-documentcloud

PyPI Changelog Tests License

LLM integrations for DocumentCloud

Installation

Install this plugin in the same environment as LLM.

llm install llm-documentcloud

Usage

Use the dc: fragment to load documents hosted on DocumentCloud.

# run a basic prompt
llm -f dc:71072 'Summarize this document'

# extract tabular data
llm -f dc:25507045 'Extract the tables in this document as CSV'

Documents can be fetched based on ID alone, ID and slug or full URL. The following are equivalent:

llm -f dc:25507045 'Extract the tables in this document as CSV'
llm -f dc:25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico 'Extract the tables in this document as CSV'
llm -f dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/ 'Extract the tables in this document as CSV'

In each case, a DocumentCloud API client will fetch the document's full text and store it as a fragment for llm.

Using file attachments instead of text

DocumentCloud stores each document in several ways: a PDF file, its extracted text and each page as an image. You can feed each of these into llm using mode parameters:

# use the original PDF as an attachment
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=pdf'

# use each page image as an attachment
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=images'

# this is the same, since "grid" is the mode name used on the documentcloud frontend
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=grid'

# these are all equivalent and will extract full text
llm -f dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=document'
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=text'

Getting specific pages

Sometimes you only want one page. DocumentCloud can link to specific pages, and those URLs can be used here:

# extract text, but only for page 2
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=document#document/p2'

Note that pages are 1-indexed. You can also get images:

# attach the image for page 2
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=images#document/p2'

There isn't a way to get a single page out of a PDF, so passing mode=pdf will set page to None.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment using uv:

cd llm-documentcloud
uv sync

To install the dependencies and test dependencies, include the test extras:

uv sync --extra test

To run the tests:

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_documentcloud-0.1.1.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_documentcloud-0.1.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_documentcloud-0.1.1.tar.gz.

File metadata

  • Download URL: llm_documentcloud-0.1.1.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llm_documentcloud-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3fdaf30250ff679b084293f0cac9e536cb51c6ec404055e84ee9d2a5b390ed6d
MD5 1f48caebbdb9c1824401e54c2578cbdb
BLAKE2b-256 c66e4a5480ef60f6993461b4d46a3e41651dd86fd747bc54adf71b3322063bd5

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_documentcloud-0.1.1.tar.gz:

Publisher: publish.yml on eyeseast/llm-documentcloud

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llm_documentcloud-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_documentcloud-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ab008d6904545ced2c4b7df57c47b600ff67066003e7ebed2a2d662c682eb602
MD5 ebef53ddf77224cd467f2abcdd368e10
BLAKE2b-256 feb8f62395beb37a0af88e2d511d71a373a5bc7596038e0341141783d37a02db

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_documentcloud-0.1.1-py3-none-any.whl:

Publisher: publish.yml on eyeseast/llm-documentcloud

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page