Skip to main content

extraction toolkit

Project description

ETK: Information Extraction Toolkit

ETK is a Python library for high precision information extraction from many document formats. It proivdes a flexible framework of composable extractors that enables you to combine a host of predefined extractors provided in ETK with custom extractors that you may need to develop for your application. It supports extraction from HTML pages, text documents, CSV and Excel files and JSON documents. ETK is open-source software, released under the MIT license.

MIT License travis ci

Documentation

Read the documentation here

Features

  • Extraction from HTML, text, CSV, Excel, JSON
  • High-precision predefined extractors for common entities (dates, phones, email, cities, ...)
  • Extraction of microdata, schema.org and RDFa markup
  • Integration with spaCy for text processing
  • Automatic identification and extraction of HTML tables containing data
  • Automatic identification and extraction of time series
  • Semi-automatic generation of Web wrappers
  • Scalable execution and management of extraction pipelines
  • Automatic provenance recording

Releases

Installation

Operating system:macOS / OS X, Linux, Windows
Python version:Python 3.6+

Install using pip

pip install etk

OR

You can also install ETK Manually. Clone or fork this repository, open a terminal window and in the directory where you downloaded ETK type the following commands

python3 -m venv etk2_env
source etk2_env/bin/activate
pip install -e .

Load the spacy modules

python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg (optional)

Note: If the above commands fail with s SSL error, run this:

python -m spacy download en_core_web_sm-2.0.0 --direct

To deactivate this virtual environment

deactivate

Run Tests

python -m unittest discover

Run ETK CLI

ETK needs to be installed as python package.

python -m etk <command> [options]

For example:

python -m etk regex_extractor "a.*c" "abcd"

Docker

Build image

docker build -t etk:test .

Run container

docker run -it etk:dev /bin/bash

Mount local volume for test

docker run -it -v $(pwd):/app/etk etk:dev /bin/bash

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etk-2.2.8.tar.gz (150.9 kB view details)

Uploaded Source

Built Distribution

etk-2.2.8-py3-none-any.whl (203.1 kB view details)

Uploaded Python 3

File details

Details for the file etk-2.2.8.tar.gz.

File metadata

  • Download URL: etk-2.2.8.tar.gz
  • Upload date:
  • Size: 150.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.12

File hashes

Hashes for etk-2.2.8.tar.gz
Algorithm Hash digest
SHA256 2bd9cb0ae08a5908dd016af9450ee848905ab9e859fb495581d203873588bab1
MD5 7d13448dd50f9cfb6850aadba8889835
BLAKE2b-256 96e15c49f2bba4132cb703b71a767a911b9a05765c6f52b0e2391ee341175ad1

See more details on using hashes here.

File details

Details for the file etk-2.2.8-py3-none-any.whl.

File metadata

  • Download URL: etk-2.2.8-py3-none-any.whl
  • Upload date:
  • Size: 203.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.12

File hashes

Hashes for etk-2.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 317dd4bd2e4edcbbcd4e63fb8d4a05e7e112cbb50381d12ecab5bad943c891ef
MD5 9dc2ae0d3c2fe3e0d2498a23ac64866c
BLAKE2b-256 2e58c4a5a30bbd710b3a44f6f266c86bebf71a8010041a72e05e5659cd458a97

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page