从维基百科抽取中文语料
Project description
zword
安装方法如下,请用 python3
pip install zword
使用有问题请到 gitee.com/znlp/zword/issues 发帖。
从维基百科抽取中文语料
维基百科语料下载地址 : dumps.wikimedia.org/zhwiki
有很多链接,下载比如 https://dumps.wikimedia.org/zhwiki/20200701/zhwiki-20200701-pages-articles.xml.bz2
下载后运行类似如下的命令来抽取中文语料
wiki_txt /share/wiki/zhwiki-20200701-pages-articles.xml.bz2
小技巧:维基百科打包打包很大,但是不需要完全下载也可以运行以上命令(会报错,但能部分输出,方便开发)
会在bz2的同目录输出两个文件
- 条目正文:zhwiki-20200701-pages-articles.title.txt.zd
- 条目标题:zhwiki-20200701-pages-articles.txt.zd
这两个文件是Zstandard压缩后的纯文本文件 ( 参见 Zstandard:一种新的无损压缩算法 )
使用本软件包附带的 zdcat 命令可以查看, 比如:
zdcat /share/wiki/zhwiki-20200701-pages-articles.title.txt.zd
在条目正文中,条目的标题以 "➜ " 开头。
在程序中读取zd文件,可用如下方法
from zword import zd
with zd.open(
"/share/wiki/zhwiki-20200701-pages-articles.txt.zd"
) as f:
for i in f:
print(i)
特别感谢
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zword-0.0.4.tar.gz.
File metadata
- Download URL: zword-0.0.4.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b89e581321e205840149108c5d386e5f31dd3a887bc27612d9eb61a904a8d7e2
|
|
| MD5 |
698d1966cd10fbec169fd86c4a1c2559
|
|
| BLAKE2b-256 |
43634813740e7557d5eafe0c606b005342f42f5f16a1ae39fb8ef12bf9c5d975
|
File details
Details for the file zword-0.0.4-py3-none-any.whl.
File metadata
- Download URL: zword-0.0.4-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d798de61a9f27e95b8d266b3ed98d6bf9bda8f07b281ee3d808a7df703267df2
|
|
| MD5 |
9137948fa578a6b87d0b8b090369bd7d
|
|
| BLAKE2b-256 |
1312e8027aa21f0479461e1d3dae8779a3661fdbe60986961a5ceafed15e8beb
|