Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.
Project description
DataFog Instructor SDK
DataFog Instructor is a Python SDK for named entity recognition (NER) using Ollama as the LLM backend. It provides an easy-to-use interface for detecting and classifying entities in text.
Installation
To install the DataFog Instructor SDK, you can use pip:
pip install datafog-instructor
For development purposes, including testing and documentation tools:
pip install datafog-instructor[dev,docs]
Quick Start
Here's a simple example to get you started with DataFog Instructor:
from datafog_instructor import DataFog
# Initialize DataFog with default settings
datafog = DataFog()
# Detect entities in text
text = "Cisco acquires Hess for $20 billion"
result = datafog.detect_entities(text)
# Print results
for entity in result.entities:
print(f"Text: {entity.text}, Type: {entity.type.value}")
Configuration
You can customize the DataFog instance using environment variables:
DATAFOG_LLM_BACKEND
: Currently only supports "ollama"DATAFOG_LLM_ENDPOINT
: The host URL for the Ollama service (default: "http://localhost:11434")DATAFOG_LLM_MODEL
: The model to use for entity detection (default: "phi3")
Example with custom settings:
import os
os.environ['DATAFOG_LLM_ENDPOINT'] = 'http://custom-ollama-host:11434'
os.environ['DATAFOG_LLM_MODEL'] = 'custom-model'
from datafog_instructor import DataFog
datafog = DataFog()
Features
Detect Entities
Use the detect_entities
method to identify and classify named entities in a given text:
text = "Apple Inc. reported $100 billion in revenue for Q4 2023"
result = datafog.detect_entities(text)
for entity in result.entities:
print(f"Text: {entity.text}, Type: {entity.type.value}")
Manage Entity Types
You can add or remove entity types dynamically:
# Add a new entity type
datafog.add_entity_type("CUSTOM", "Custom Entity")
# Remove an entity type
datafog.remove_entity_type("CUSTOM")
# Get all entity types
entity_types = datafog.get_entity_types()
print(entity_types)
Default Entity Types
The SDK comes with an expanded list of predefined entity types, including:
- Organization Information: ORG, PERSON, TRANSACTION_TYPE, DEAL_STRUCTURE, FINANCIAL_INFO, PRODUCT, LOCATION, DATE, INDUSTRY, ROLE, REGULATORY, SENSITIVE_INFO, CONTACT, ID, STRATEGY, COMPANY, MONEY
- Personal Information: EMAIL, PHONE, SSN, CREDIT_CARD, IP_ADDRESS, URL, AGE, NATIONALITY, JOB_TITLE, EDUCATION
- Location Information: ADDRESS, CITY, STATE, ZIP, COUNTRY, REGION
Error Handling
The SDK includes error handling for various scenarios. If there's an issue with processing the response or an unexpected response format, it will raise a ValueError
with details about the error.
Development and Testing
For development purposes, you can install additional dependencies:
pip install datafog-instructor[dev]
This includes tools like pytest, black, flake8, and mypy for testing and code quality.
Documentation
To build the documentation locally:
pip install datafog-instructor[docs]
cd docs
make html
The documentation will be available in the docs/_build/html
directory.
Contributing
Contributions to the DataFog Instructor SDK are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License.
Support
If you encounter any problems or have any questions, please open an issue on the GitHub repository or join our Discord community at https://discord.gg/bzDth394R4.
Links
- Homepage: https://datafog.ai
- Documentation: https://docs.datafog.ai
- Twitter: https://twitter.com/datafoginc
- GitHub: https://github.com/datafog/datafog-instructor
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file datafog_instructor-0.1.0b8.tar.gz
.
File metadata
- Download URL: datafog_instructor-0.1.0b8.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1e7107a9c01b9a49f97a77cd5334079e250d7d9b18438026d4f917e13cd8dd1 |
|
MD5 | d4e6761447f65acc3b09f4b8d1b3a3d8 |
|
BLAKE2b-256 | 5a838d3378949b3ed5462c4dddc5286f3dd372c22607ebd69068b0610423944d |