Skip to main content

Tools for running data in a SQLite database through AWS Comprehend

Project description

sqlite-comprehend

PyPI Changelog Tests License

Tools for running data in a SQLite database through AWS Comprehend

See sqlite-comprehend: run AWS entity extraction against content in a SQLite database for background on this project.

Installation

Install this tool using pip:

pip install sqlite-comprehend

Demo

You can see examples of tables generated using this command here:

Configuration

You will need AWS credentials with the comprehend:BatchDetectEntities IAM permission.

You can configure credentials using these instructions. You can also save them to a JSON or INI configuration file and pass them to the command using -a credentials.ini, or pass them using the --access-key and --secret-key options.

Entity extraction

The sqlite-comprehend entities command runs entity extraction against every row in the specified table and saves the results to your database.

Specify the database, the table and one or more columns containing text in that table. The following runs against the text column in the pages table of the sfms.db SQLite database:

sqlite-comprehend sfms.db pages text

Results will be written into a pages_comprehend_entities table. Change the name of the output table by passing -o other_table_name.

You can run against a subset of rows by adding a --where clause:

sqlite-comprehend sfms.db pages text --where 'id < 10'

You can also used named parameters in your --where clause:

sqlite-comprehend sfms.db pages text --where 'id < :maxid' -p maxid 10

Only the first 5,000 characters for each row will be considered. Be sure to review Comprehend's pricing - which starts at $0.0001 per hundred characters.

If your context includes HTML tags, you can strip them out before extracting entities by adding --strip-tags:

sqlite-comprehend sfms.db pages text --strip-tags

Rows that have been processed are recorded in the pages_comprehend_entities_done table. If you run the command more than once it will only process rows that have been newly added.

You can delete records from that _done table to run them again.

sqlite-comprehend entities --help

Usage: sqlite-comprehend entities [OPTIONS] DATABASE TABLE COLUMNS...

  Detect entities in columns in a table

  To extract entities from columns text1 and text2 in mytable:

      sqlite-comprehend entities my.db mytable text1 text2

  To run against just a subset of the rows in the table, add:

      --where "id < :max_id" -p max_id 50

  Results will be written to a table called mytable_comprehend_entities

  To specify a different output table, use -o custom_table_name

Options:
  --where TEXT                WHERE clause to filter table
  -p, --param <TEXT TEXT>...  Named :parameters for SQL query
  -o, --output TEXT           Custom output table
  -r, --reset                 Start from scratch, deleting previous results
  --strip-tags                Strip HTML tags before extracting entities
  --access-key TEXT           AWS access key ID
  --secret-key TEXT           AWS secret access key
  --session-token TEXT        AWS session token
  --endpoint-url TEXT         Custom endpoint URL
  -a, --auth FILENAME         Path to JSON/INI file containing credentials
  --help                      Show this message and exit.

Schema

Assuming an input table called pages the tables created by this tool will have the following schema:

CREATE TABLE [pages] (
   [id] INTEGER PRIMARY KEY,
   [text] TEXT
);
CREATE TABLE [comprehend_entity_types] (
   [id] INTEGER PRIMARY KEY,
   [value] TEXT
);
CREATE TABLE [comprehend_entities] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [type] INTEGER REFERENCES [comprehend_entity_types]([id])
);
CREATE TABLE [pages_comprehend_entities] (
   [id] INTEGER REFERENCES [pages]([id]),
   [score] FLOAT,
   [entity] INTEGER REFERENCES [comprehend_entities]([id]),
   [begin_offset] INTEGER,
   [end_offset] INTEGER
);
CREATE UNIQUE INDEX [idx_comprehend_entity_types_value]
    ON [comprehend_entity_types] ([value]);
CREATE UNIQUE INDEX [idx_comprehend_entities_type_name]
    ON [comprehend_entities] ([type], [name]);
CREATE TABLE [pages_comprehend_entities_done] (
   [id] INTEGER PRIMARY KEY REFERENCES [pages]([id])
);

Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:

cd sqlite-comprehend
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlite-comprehend-0.2.2.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sqlite_comprehend-0.2.2-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file sqlite-comprehend-0.2.2.tar.gz.

File metadata

  • Download URL: sqlite-comprehend-0.2.2.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for sqlite-comprehend-0.2.2.tar.gz
Algorithm Hash digest
SHA256 b105152a178091ea24aaaafc362574c9e1206f2fda3bc14407f862f5e6c4b730
MD5 0846f3fa3806d342616f7a1483d078aa
BLAKE2b-256 95faa80fcf8d8f99d92e9978bd99fb490173f45f94b13981ce0ff30fbd1c58c8

See more details on using hashes here.

File details

Details for the file sqlite_comprehend-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for sqlite_comprehend-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ee81985dd9f09b1075fce1309ab80d9ceb928ee5b6e304430790a8cc403c2420
MD5 4242c9095a8272595f01f45629c26330
BLAKE2b-256 796307f32b6ada2eac0c1d6f67156f567488b43c78b6e577c0bb7c58957bfc65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page