Python binding for nativeextractor
Project description
NativeExtractor module for Python
This is official Python binding for the NativeExtractor project.
Installation
Requirements
- Python >=2.7 (>3 usage is highly recommended)
pip
build-essential
(gcc, make)libglib2.0
,libglib2.0-dev
,libpythonX-dev
We recommend to use virtual environments.
virtualenv myproject
source myproject/bin/activate
or
python -m venv myproject
source myproject/bin/activate
Instant PyPi solution
pip install pynativeextractor
Manual
-
Clone the repo
git clone --recurse-submodules https://github.com/SpongeData-cz/pynativeextractor.git
-
Install via
pip
orpip3
pip install -e ./pynativeextractor/
Typical usage
import os
from pynativeextractor.extractor import BufferStream, Extractor, DEFAULT_MINERS_PATH
# Construct new Extractor instance
ex = Extractor()
# Add fictional miner from web_entities.so with name match_url matching all URLs
ex.add_miner_so(os.path.join(DEFAULT_MINERS_PATH, 'web_entities.so'), 'match_url')
text = '{}'.format("https://spongedata.cz")
# Make from hw stream (you can also do the stream from files - use FileStream - mmap is used internally)
with BufferStream(text) as bf:
# Initialize occurrences list as empty list
occurrences = []
# Set the stream to the extractor
with ex.set_stream(bf):
# Mine all occurrences of URLs
while not ex.eof():
# Summarize occurrences
occurrences += ex.next()
print(occurrences) # Prints [{'label': 'URL', 'value': 'https://spongedata.cz', 'pos': 0, 'len': 13, 'prob': 1.0}]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pynativeextractor-10.0.12.tar.gz
(41.4 kB
view hashes)
Close
Hashes for pynativeextractor-10.0.12.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb6d9bc85bd74d46bf2c0393d1f2ddbf94b41f3c94548b5f40df8bedb65789fd |
|
MD5 | a40a10cb26e4df22fe3a6c91d63dfcc9 |
|
BLAKE2b-256 | ae617be4bb317ee6434504f3b34d823f7339e525e5fc6a74bdee0514ceb0182f |