读取pdf、docx文件，返回文件内的文本数据。

Project description

最近运行课件代码，发现pdf文件读取部分的函数失效。这里找到读取pdf文件的可运行代码，为了方便后续学习使用，我已将pdf和docx读取方法封装成pdfdocx包。

pdfdocx

只有简单的两个读取函数

read_pdf(file)
read_docx(file)

file为文件路径，函数运行后返回file文件内的文本数据。

安装

pip install pdfdocx

使用

读取pdf文件

from pdfdocx import read_pdf
p_text = read_pdf('test/data.pdf')
print(p_text)

Run

这是来⾃pdf⽂件内的内容

from pdfdocx import read_docx
d_text = read_pdf('test/data.docx')
print(d_text)

Run

这是来⾃docx⽂件内的内容

如果

如果您是经管人文社科专业背景，编程小白，面临海量文本数据采集和处理分析艰巨任务，可以参看《python网络爬虫与文本数据分析》视频课。作为文科生，一样也是从两眼一抹黑开始，这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(￣︶￣)o，

python入门
网络爬虫
数据读取
文本分析入门
机器学习与文本分析
文本分析在经管研究中的应用

感兴趣的童鞋不妨戳一下《python网络爬虫与文本数据分析》进来看看~

Project details

Release history Release notifications | RSS feed

This version

1.7

Sep 10, 2023

1.6

Aug 6, 2022

1.2

Mar 5, 2022

1.0

May 14, 2021

0.1

Apr 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pdfdocx-1.7-py3-none-any.whl (3.8 kB view hashes)

Uploaded Sep 10, 2023 Python 3

Hashes for pdfdocx-1.7-py3-none-any.whl

Hashes for pdfdocx-1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e5e325d99177de54ff9eeef7e0be6cd4de5e12b9496c327670e559775f7119cc`
MD5	`a0d052e6f067e27ef1eb736188d8cfdc`
BLAKE2b-256	`80ecaebe998b9d19edcab9cb2f03cf427a73dc4018b5927d8035f64bddb9b343`

pdfdocx 1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta