PDF anonymizer/synthesizer for Cradl
Project description
Image/PDF synthesizer for Cradl
Disclaimer
This code does not guarantee that images/PDFs will be successfully synthesized. Use at your own risk.
Installation
$ pip install lucidtech-synthetic
Make sure to have the following software installed on your system before using the CLI:
- ghostscript
Basic Usage
Docker
We recommend disabling networking and setting /path/to/src_dir
to read-only as shown below:
docker run --network none -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic pdf /root/src_dir /root/dst_dir
docker run --network none -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic image /root/src_dir /root/dst_dir /usr/share/fonts/truetype/dejavu/DejaVuSansMono-Bold.ttf 6-36
CLI
synthetic pdf /path/to/src_dir /path/to/dst_dir
synthetic image /path/to/src_dir /path/to/dst_dir /usr/share/fonts/ubuntu/Ubuntu-B.ttf 6-36
/path/to/src_dir
is the input directory and should contain your image/PDFs and JSON ground truths
/path/to/dst_dir
is the output directory where synthesized image/PDFs and JSON ground truths will be written to
Here is an example of the directory layout for /path/to/src_dir
:
/path/to/src_dir
├── a.pdf|jpeg
├── a.json
├── b.pdf|jpeg
└── b.json
The output directory will follow the same layout but with modified images/PDFs and JSON ground truths:
/path/to/dst_dir
├── a.pdf|jpeg
├── a.json
├── b.pdf|jpeg
└── b.json
Using a custom Synthesizer
The following examples shown are for custom PDF synthesizers, but it works similarly for image synthesizers
CLI
synthetic pdf /path/to/src_dir /path/to/dst_dir --synthesizer-class path.to.python.Class
Make sure that parent directory of path.to.python.Class
is in your PYTHONPATH
Example using one of the example Synthesizers in examples
directory
synthetic pdf /path/to/src_dir /path/to/dst_dir --synthesizer-class examples.exclude-words.synthesizer.ExcludeWordsSynthesizer
Docker
docker run --network none -v /path/to/synthesizer:/root/synthesizer -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic pdf /root/src_dir /root/dst_dir --synthesizer-class mypythonfile.ExcludeWordsSynthesizer
Note that the python module must be mounted into the docker container to /root/synthesizer
for it to work. In the above example we assume a directory structure of your custom synthesizer to be like below.
/path/to/synthesizer
└── mypythonfile.py
Example using one of the example Synthesizers in examples
directory. The examples
directory should already exist in the image so that we don't need to mount anything additional.
docker run --network none -v /path/to/src_dir:/root/src_dir:ro -v /path/to/dst_dir:/root/dst_dir -it lucidtechai/synthetic pdf /root/src_dir /root/dst_dir --synthesizer-class examples.exclude-words.synthesizer.ExcludeWordsSynthesizer
Help
All methods support the --help
flag which will provide information on the purpose of the method,
and what arguments could be added.
$ synthetic --help
Known Issues
Image Synthesizer
- Synthesized text does not follow the rotation of the document in the image if document is rotated
- Bounding boxes needed in ground truth
PDF Synthesizer
- Does not synthesize images inside PDF
- Replaced strings are sometimes not hexadecimal encoded even when expected to be
- Text appearing as single characters with custom spacing in PDF will often yield poor results
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for lucidtech-synthetic-0.4.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b38047a0b757918271a4a465f23a4f61e043cf786f77a814f6fb0b18a0145c6 |
|
MD5 | 8786d8e6697fd9e20d449564d8c57c9b |
|
BLAKE2b-256 | 353bfd3b243d3ca805f6d9238379718d616f20f1a281de66a1530c00dd0b877a |
Hashes for lucidtech_synthetic-0.4.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 111b6b3f6d9a520c9e422a532835ef8c428d99b3fbf4d7e246ca88ef21807a41 |
|
MD5 | 2c2c6c99f7e00034c825cb10c34b2c6a |
|
BLAKE2b-256 | 422004a68010df29e475d133136c8a14004b722f9da96b65cd41ecfd38a12a0b |