Simple wrapper for tabula-java, read tables from PDF into DataFrame
Project description
tabula-py
tabula-py
is a simple Python wrapper of tabula-java, which can read table of PDF.
You can read tables from PDF and convert into pandas's DataFrame. tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.
You can see the example notebook and try it on Google Colab, or we highly recommend to read our document especially for FAQ.
Requirements
- Java 8+
- Python 3.5+
OS
I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also the document for the detailed installation for Windows 10.
Usage
- Documentation
- FAQ would be helpful if you have issue
- Example notebook on Google Colaboratory
Install
Ensure you have Java runtime and set PATH for it.
pip install tabula-py
Example
tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON.
import tabula
# Read pdf into list of DataFrame
df = tabula.read_pdf("test.pdf", pages='all')
# Read remote pdf into list of DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')
# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all)
See example notebook for more detail. I also recommend to read the tutorial article written by @aegis4048.
Contributing
Interested in helping out? I'd love to have your help!
You can help by:
- Reporting a bug.
- Adding or editing documentation.
- Contributing code via a Pull Request. See also for the contribution
- Write a blog post or spreading the word about
tabula-py
to people who might be able to benefit from using it.
Contributors
- @lahoffm
- @jakekara
- @lcd1232
- @kirkholloway
- @CurtLH
- @nikhilgk
- @krassowski
- @alexandreio
- @rmnevesLH
- @red-bin
- @Gallaecio
- @red-bin
- @alexandreio
- @bpben
- @Bueddl
- @cjotade
Another support
You can also support our continued work on tabula-py
with a donation on Patreon.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tabula_py-2.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a6626ab091c3cd965cdf52c3af7ba9d69bb3bc9290f2276d0d19a65ba8b50cd |
|
MD5 | c03d06c741d08045d436856c51201aa3 |
|
BLAKE2b-256 | 53a466add528eca00398af98f181772006750019eb9f2d68c7c6fdd53ba661c5 |