Skip to main content

extraction toolkit

Project description

ETK: Information Extraction Toolkit

ETK is a Python library for high precision information extraction from many document formats. It proivdes a flexible framework of composable extractors that enables you to combine a host of predefined extractors provided in ETK with custom extractors that you may need to develop for your application. It supports extraction from HTML pages, text documents, CSV and Excel files and JSON documents. ETK is open-source software, released under the MIT license.

MIT License travis ci

Documentation

Features

  • Extraction from HTML, text, CSV, Excel, JSON
  • High-precision predefined extractors for common entities (dates, phones, email, cities, ...)
  • Extraction of microdata, schema.org and RDFa markup
  • Integration with spaCy for text processing
  • Automatic identification and extraction of HTML tables containing data
  • Automatic identification and extraction of time series
  • Semi-automatic generation of Web wrappers
  • Scalable execution and management of extraction pipelines
  • Automatic provenance recording

Releases

Installation

Operating system:macOS / OS X, Linux, Windows
Python version:Python 3.6+
  1. Create virtual environment (highly recommended)
python3 -m venv etk2_env
source etk2_env/bin/activate
  1. Install using pip
pip install etk

You can also install ETK Manually. Clone or fork this repository, open a terminal window and in the directory where you downloaded ETK type the following commands

pip install -e .

Load the spacy modules

python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg

To deactivate this virtual environment

deactivate

Run Tests

python -m unittest discover

Run ETK CLI

ETK needs to be installed as python package.

python -m etk <command> [options]

For example:

python -m etk regex_extractor "a.*c" "abcd"

Docker

Build image

docker build -t etk:test .

Run container

docker run -it etk:dev /bin/bash

Mount local volume for test

docker run -it -v $(pwd):/app/etk etk:dev /bin/bash

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etk-2.1.5.tar.gz (146.0 kB view hashes)

Uploaded Source

Built Distribution

etk-2.1.5-py3-none-any.whl (190.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page