LLM integrations for DocumentCloud
Project description
llm-documentcloud
LLM integrations for DocumentCloud
Installation
Install this plugin in the same environment as LLM.
llm install llm-documentcloud
Usage
Use the dc: fragment to load documents hosted on DocumentCloud.
# run a basic prompt
llm -f dc:71072 'Summarize this document'
# extract tabular data
llm -f dc:25507045 'Extract the tables in this document as CSV'
Documents can be fetched based on ID alone, ID and slug or full URL. The following are equivalent:
llm -f dc:25507045 'Extract the tables in this document as CSV'
llm -f dc:25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico 'Extract the tables in this document as CSV'
llm -f dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/ 'Extract the tables in this document as CSV'
In each case, a DocumentCloud API client will fetch the document's full text and store it as a fragment for llm.
Using file attachments instead of text
DocumentCloud stores each document in several ways: a PDF file, its extracted text and each page as an image. You can feed each of these into llm using mode parameters:
# use the original PDF as an attachment
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=pdf'
# use each page image as an attachment
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=images'
# this is the same, since "grid" is the mode name used on the documentcloud frontend
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=grid'
# these are all equivalent and will extract full text
llm -f dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=document'
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=text'
Getting specific pages
Sometimes you only want one page. DocumentCloud can link to specific pages, and those URLs can be used here:
# extract text, but only for page 2
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=document#document/p2'
Note that pages are 1-indexed. You can also get images:
# attach the image for page 2
llm -f 'dc:https://www.documentcloud.org/documents/25507045-20250118-ufc-intuit-dome-athlete-pay-and-weights-c-amico/?mode=images#document/p2'
There isn't a way to get a single page out of a PDF, so passing mode=pdf will set page to None.
Development
To set up this plugin locally, first checkout the code. Then create a new virtual environment using uv:
cd llm-documentcloud
uv sync
To install the dependencies and test dependencies, include the test extras:
uv sync --extra test
To run the tests:
uv run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_documentcloud-0.1.1.tar.gz.
File metadata
- Download URL: llm_documentcloud-0.1.1.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fdaf30250ff679b084293f0cac9e536cb51c6ec404055e84ee9d2a5b390ed6d
|
|
| MD5 |
1f48caebbdb9c1824401e54c2578cbdb
|
|
| BLAKE2b-256 |
c66e4a5480ef60f6993461b4d46a3e41651dd86fd747bc54adf71b3322063bd5
|
Provenance
The following attestation bundles were made for llm_documentcloud-0.1.1.tar.gz:
Publisher:
publish.yml on eyeseast/llm-documentcloud
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_documentcloud-0.1.1.tar.gz -
Subject digest:
3fdaf30250ff679b084293f0cac9e536cb51c6ec404055e84ee9d2a5b390ed6d - Sigstore transparency entry: 225825832
- Sigstore integration time:
-
Permalink:
eyeseast/llm-documentcloud@bd705fe8dfe4d4e57867f6d29c32baf994887398 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/eyeseast
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bd705fe8dfe4d4e57867f6d29c32baf994887398 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llm_documentcloud-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llm_documentcloud-0.1.1-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab008d6904545ced2c4b7df57c47b600ff67066003e7ebed2a2d662c682eb602
|
|
| MD5 |
ebef53ddf77224cd467f2abcdd368e10
|
|
| BLAKE2b-256 |
feb8f62395beb37a0af88e2d511d71a373a5bc7596038e0341141783d37a02db
|
Provenance
The following attestation bundles were made for llm_documentcloud-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on eyeseast/llm-documentcloud
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_documentcloud-0.1.1-py3-none-any.whl -
Subject digest:
ab008d6904545ced2c4b7df57c47b600ff67066003e7ebed2a2d662c682eb602 - Sigstore transparency entry: 225825833
- Sigstore integration time:
-
Permalink:
eyeseast/llm-documentcloud@bd705fe8dfe4d4e57867f6d29c32baf994887398 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/eyeseast
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bd705fe8dfe4d4e57867f6d29c32baf994887398 -
Trigger Event:
release
-
Statement type: