Wrappers around the Tonic Textual API
Project description
Tonic Textual
Tonic Textual SDK for Python
AI-ready data, with privacy at the core. Unblock AI initiatives by maximizing your free-text assets through realistic data de-identification and high quality data extraction
Explore the docs »
Get an API Key · Report Bug · Request Feature
Table of Contents
Prerequisites
- Get a free API Key at Textual
- Install the package from PyPI
pip install tonic-textual
- Your API Key can be passed as an argument directly into SDK calls or you can save it to your environment
export TONIC_TEXTUAL_API_KEY=<API Key>
Getting Started
This library supports two different workflows, NER detection (along with entity tokenization and synthesis) and data extraction of unstructured files like PDF and Office documents (docx, xlsx).
Each workflow, has its own respective client. Each client, supports the same set of constructor arguments.
from tonic_textual.redact_api import TextualNer
from tonic_textual.parse_api import TextualParse
textual_ner = TextualNer()
textual_parse = TextualParse()
Both clients support the following optional arguments
-
base_url - The URL of the server, hosting Tonic Textual. Defaults to https://textual.tonic.ai
-
api_key - Your API key. If not specified you must set the TONIC_TEXTUAL_API_KEY in your environment
-
verify - Whether SSL Certification verification is performed. Default is enabled.
NER Usage
Textual can identify entities within free text. It works on both raw text and on content found within files such as pdf, docx, xlsx, images, txt, and csv files. For raw text,
Free text
raw_redaction = textual_ner.redact("My name is John and I live in Atlanta.")
The raw_redaction
returns a response like the following:
{
"original_text": "My name is John and I a live in Atlanta.",
"redacted_text": "My name is [NAME_GIVEN_dySb5] and I a live in [LOCATION_CITY_FgBgz8WW].",
"usage": 9,
"de_identify_results": [
{
"start": 11,
"end": 15,
"new_start": 11,
"new_end": 29,
"label": "NAME_GIVEN",
"text": "John",
"score": 0.9,
"language": "en",
"new_text": "[NAME_GIVEN_dySb5]"
},
{
"start": 32,
"end": 39,
"new_start": 46,
"new_end": 70,
"label": "LOCATION_CITY",
"text": "Atlanta",
"score": 0.9,
"language": "en",
"new_text": "[LOCATION_CITY_FgBgz8WW]"
}
]
}
The redacted_text
property provides the new text, with identified entities replaced with tokenized values. Each identified entity will be listed in the de_identify_results
array.
In addition to tokenizing entities, they can also be synthesized. To synthesize specific entities use the optional generator_config
argument.
raw_redaction = textual_ner.redact("My name is John and I live in Atlanta.", generator_config={'LOCATION_CITY':'Synthesis', 'NAME_GIVEN':'Synthesis'})
This will generate a new redacted_text
value in the response with synthetic entites. For example, it could look like
| My name is Alfonzo and I live in Wilkinsburg.
Files
Textual can also identify, tokenize, and synthesize text within files such as PDF and DOCX. The result is a new file with specified entities either tokenized or synthesized.
To generate a redacted file,
with open('file.pdf','rb') as f:
ref_id = textual_ner.start_file_redact(f, 'file.pdf')
with open('redacted_file.pdf','wb') as of:
file_bytes = textual_ner.download_redacted_file(ref_id)
of.write(file_bytes)
The download_redacted_file
takes similar arguments to the redact()
method and supports a generator_config
parameter to adjust which entities are tokenized and synthesized.
Consistency
When entities are tokenized, the tokenized values we generate are unique to the original value. A given entity will also generate to the same, unique token. Tokens can be mapped back to their original value via the unredact
function call.
Synthetic entities are consistent. This means, a given entity, such as 'Atlanta' will always get mapped to the same fake city. Synthetic values can potentially collide and are not reversible.
To change the underlying mapping of both tokens and synthetic values, you can pass in the optional random_seed
parameter in the redact()
function call.
For more examples, please refer to the Documentation
Parse Usage
Textual supports the extraction of text and other content from files. Textual currently supports
- png, tif, jpg
- txt, csv, tsv, and other plaintext formats
- docx, xlsx
Textual takes these unstructured files and converts them to a structured representation in JSON.
The JSON output has file specific pieces, for example, table and KVP detection is performed on PDFs and images but all files support the following JSON properties:
{
"fileType": "<file type>",
"content": {
"text": "<Markdown file content>",
"hash": "<hashed file content>",
"entities": [ //Entry for each entity in the file
{
"start": <start location>,
"end": <end location>,
"label": "<value type>",
"text": "<value text>",
"score": <confidence score>
}
]
},
"schemaVersion": <integer schema version>
}
PDFs and images additionally have properties for tables
and kvps
. DocX files have support for headers
, footers
, and endnotes
and Xlsx files break content down a per-sheet basis.
For a detailed breakdown of the JSON schema for each file type please reference on documentation, here.
To parse a file one time, you can use our SDK.
with open('invoice.pdf','rb') as f:
parsed_file = textual_parse.parse_file(f.read(), 'invoice.pdf')
The parsed_file is a FileParseResult
type and has various helper methods to retrieve content from the document.
-
get_markdown(generator_config={})
retrieves the document as markdown. The markdown can be optionally tokenized/synthesized by passing in a list of entities togenerator_config
-
get_chunks(generator_config={}, metadata_entities=[])
chunks the files in a form suitable for vector DB ingestion. Chunks can be tokenized/synthesized and additionally can be enriched with entity level metadata by providing a list of entities. The entity list should be entities that are relevant to questions being asked to the RAG system. e.g. if you are building a RAG for front line customer support reps, you might expect to include 'PRODUCT' and 'ORGANIZATION' as metadata entities.
In addition for processing files from you local system, you can reference files directly in S3. The parse_s3_file
function call behaves the same as parse_file
but requires a bucket and key argument to specify your specific file in S3. It uses boto3 to retrieve files in S3.
For more examples, please refer to the Documentation
UI Automation
The Textual UI supports file redactionand parsing. It provides an experience for users to orchestrate jobs and process files at scale. It supports integrations with various bucket solutions like S3 as well as systems like Sharepoint and Databricks Unity Catalog volumes. Actions such as building smart pipelines (for parsing) and Dataset collections (file redaction) can be completed via the SDK.
For more examples, please refer to the Documentation
Bug Reports and Feature Requests
Bugs and Feature requests can be submitted via the open issues. We try to be responsive here so any issues filed should expect a prompt response from the Textual team.
Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License. See LICENSE.txt
for more information.
Contact
Tonic AI - @tonicfakedata - support@tonic.ai
Project Link: Textual
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tonic_textual-3.0.0.tar.gz
.
File metadata
- Download URL: tonic_textual-3.0.0.tar.gz
- Upload date:
- Size: 38.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e05bf48aee1d11611abfee533943e3e417d0da08a45fa57d1a3ebd2c557280a |
|
MD5 | 476b104e32182f9ecca33b857e1f81bd |
|
BLAKE2b-256 | f96cb6d521991b4c4cc8609bf6c6763c2f3962b898f650c0a54b2c7d0f8d8828 |
File details
Details for the file tonic_textual-3.0.0-py3-none-any.whl
.
File metadata
- Download URL: tonic_textual-3.0.0-py3-none-any.whl
- Upload date:
- Size: 51.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7606c93aaf7173d9a9b23d0facf4a9fd6118943ab3971533f1d051976a21a73b |
|
MD5 | 446b1cedc99629ce6eded7e7c54aab6a |
|
BLAKE2b-256 | e1dcc9e248c97e2e96d11819906c1cf492a0001410ccad00ecb1bbdc94298fa6 |