PDF anonymizer/synthesizer for Cradl
Project description
Image/PDF synthesizer for Cradl
Disclaimer
This code does not guarantee that images/PDFs will be successfully synthesized. Use at your own risk.
Installation
$ pip install lucidtech-synthetic
Make sure to have the following software installed on your system before using the CLI:
- ghostscript
Basic Usage
Docker
We recommend disabling networking and setting /path/to/src_dir
to read-only as shown below:
docker run --network none -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic pdf /root/src_dir /root/dst_dir
docker run --network none -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic image /root/src_dir /root/dst_dir /usr/share/fonts/truetype/dejavu/DejaVuSansMono-Bold.ttf 6-36
CLI
synthetic pdf /path/to/src_dir /path/to/dst_dir
synthetic image /path/to/src_dir /path/to/dst_dir /usr/share/fonts/ubuntu/Ubuntu-B.ttf 6-36
/path/to/src_dir
is the input directory and should contain your image/PDFs and JSON ground truths
/path/to/dst_dir
is the output directory where synthesized image/PDFs and JSON ground truths will be written to
Here is an example of the directory layout for /path/to/src_dir
:
/path/to/src_dir
├── a.pdf|jpeg
├── a.json
├── b.pdf|jpeg
└── b.json
The output directory will follow the same layout but with modified images/PDFs and JSON ground truths:
/path/to/dst_dir
├── a.pdf|jpeg
├── a.json
├── b.pdf|jpeg
└── b.json
Using a custom Synthesizer
The following examples shown are for custom PDF synthesizers, but it works similarly for image synthesizers
CLI
synthetic pdf /path/to/src_dir /path/to/dst_dir --synthesizer-class path.to.python.Class
Make sure that parent directory of path.to.python.Class
is in your PYTHONPATH
Example using one of the example Synthesizers in examples
directory
synthetic pdf /path/to/src_dir /path/to/dst_dir --synthesizer-class examples.exclude-words.synthesizer.ExcludeWordsSynthesizer
Docker
docker run --network none -v /path/to/synthesizer:/root/synthesizer -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic pdf /root/src_dir /root/dst_dir --synthesizer-class mypythonfile.ExcludeWordsSynthesizer
Note that the python module must be mounted into the docker container to /root/synthesizer
for it to work. In the above example we assume a directory structure of your custom synthesizer to be like below.
/path/to/synthesizer
└── mypythonfile.py
Example using one of the example Synthesizers in examples
directory. The examples
directory should already exist in the image so that we don't need to mount anything additional.
docker run --network none -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic pdf /root/src_dir /root/dst_dir --synthesizer-class examples.exclude-words.synthesizer.ExcludeWordsSynthesizer
Help
All methods support the --help
flag which will provide information on the purpose of the method,
and what arguments could be added.
$ synthetic --help
Known Issues
Image Synthesizer
- Synthesized text does not follow the rotation of the document in the image if document is rotated
- Bounding boxes needed in ground truth
PDF Synthesizer
- Does not synthesize images inside PDF
- Replaced strings are sometimes not hexadecimal encoded even when expected to be
- Text appearing as single characters with custom spacing in PDF will often yield poor results
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file lucidtech-synthetic-0.4.0.tar.gz
.
File metadata
- Download URL: lucidtech-synthetic-0.4.0.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f26534c8e6ee574782e780c3677fb34ab192da1faae5a571e60298826b95f578 |
|
MD5 | 354356c3f624234a6e9d9ada678e34f5 |
|
BLAKE2b-256 | e73be073e641d41a0fd4ddf1868134d26bf8206dc20fc1f84eac6b5f7d29ad7c |
File details
Details for the file lucidtech_synthetic-0.4.0-py2.py3-none-any.whl
.
File metadata
- Download URL: lucidtech_synthetic-0.4.0-py2.py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5233867c614bdd61080b8822cb6e0dff187368fd9f0bd8b7ff4576a12d51462a |
|
MD5 | be01037ae40d9daf8e5056b63522def9 |
|
BLAKE2b-256 | b947baab1843b2cb383e06ae04302cbfe615292f22aecd5dda9f49d17dce3f34 |