Skip to main content

Import unstructured data (text and images) into structured tables

Project description

datasette-extract

PyPI Changelog Tests License

Import unstructured data (text and images) into structured tables

Installation

Install this plugin in the same environment as Datasette.

datasette install datasette-extract

Configuration

This plugin requires an OPENAI_API_KEY environment variable with an OpenAI API key.

Usage

This plugin provides the following features:

  • In the database action cog menu for a database select "Create table with extracted data" to create a new table with data extracted from text or an image
  • In the table action cog menu select "Extract data into this table" to extract data into an existing table

When creating a table you can specify the column names, types and provide an optional hint (like "YYYY-MM-DD" for dates) to influence how the data should be extracted.

When populating an existing table you can provide hints and select which columns should be populated.

Text input can be pasted directly into the textarea.

Drag and drop a PDF or text file onto the textarea to populate it with the contents of that file. PDF files will have their text extracted, but only if the file contains text as opposed to scanned images.

Images can be uploaded directly. These will have OCR run against them using GPT-4 Vision and then that text will be used for structured data extraction.

Permissions

Users must have the datasette-extract permission to use this tool.

In order to create tables they also need the create-table permission.

To insert rows into an existing table they need insert-row.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd datasette-extract
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datasette-extract-0.1a1.tar.gz (806.4 kB view details)

Uploaded Source

Built Distribution

datasette_extract-0.1a1-py3-none-any.whl (815.5 kB view details)

Uploaded Python 3

File details

Details for the file datasette-extract-0.1a1.tar.gz.

File metadata

  • Download URL: datasette-extract-0.1a1.tar.gz
  • Upload date:
  • Size: 806.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for datasette-extract-0.1a1.tar.gz
Algorithm Hash digest
SHA256 12ded28f16afb336df5f660167a47fcf82da1e28bf40ef96df95b54899302f3e
MD5 db54e1847b7c6f906853263dbe40ae5f
BLAKE2b-256 f5a7ebbcf14af90d3669e1393f01dccb491e54bf8360c362020516c377124005

See more details on using hashes here.

File details

Details for the file datasette_extract-0.1a1-py3-none-any.whl.

File metadata

File hashes

Hashes for datasette_extract-0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 1c8bf49bfe565ad35a98fb4541e414acc35e3dcfb1af7b3ad3d80755143eb30c
MD5 bfee429391d6bc0f3182599269831eb8
BLAKE2b-256 30505d1c44e3b253084739eb23757e7f00a07c4d1b78da91fe90f6ea860292b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page