Skip to main content

从维基百科抽取中文语料

Project description

zword

安装方法如下,请用 python3

pip install zword

使用有问题请到 gitee.com/znlp/zword/issues 发帖。

从维基百科抽取中文语料

维基百科语料下载地址 : dumps.wikimedia.org/zhwiki

有很多链接,下载比如 https://dumps.wikimedia.org/zhwiki/20200701/zhwiki-20200701-pages-articles.xml.bz2

下载后运行类似如下的命令来抽取中文语料

wiki_txt /share/wiki/zhwiki-20200701-pages-articles.xml.bz2

小技巧:维基百科打包打包很大,但是不需要完全下载也可以运行以上命令(会报错,但能部分输出,方便开发)

会在bz2的同目录输出两个文件

  • 条目正文:zhwiki-20200701-pages-articles.title.txt.zd
  • 条目标题:zhwiki-20200701-pages-articles.txt.zd

这两个文件是Zstandard压缩后的纯文本文件 ( 参见 Zstandard:一种新的无损压缩算法 )

使用本软件包附带的 zdcat 命令可以查看, 比如:

zdcat /share/wiki/zhwiki-20200701-pages-articles.title.txt.zd

在条目正文中,条目的标题以 "➜ " 开头。

在程序中读取zd文件,可用如下方法

from zword import zd

with zd.open(
  "/share/wiki/zhwiki-20200701-pages-articles.txt.zd"
) as f:
  for i in f:
    print(i)

特别感谢

代码改编自 《获取并处理中文维基百科语料 - 科学空间|Scientific Spaces》

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zword-0.0.4.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

zword-0.0.4-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file zword-0.0.4.tar.gz.

File metadata

  • Download URL: zword-0.0.4.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.8

File hashes

Hashes for zword-0.0.4.tar.gz
Algorithm Hash digest
SHA256 b89e581321e205840149108c5d386e5f31dd3a887bc27612d9eb61a904a8d7e2
MD5 698d1966cd10fbec169fd86c4a1c2559
BLAKE2b-256 43634813740e7557d5eafe0c606b005342f42f5f16a1ae39fb8ef12bf9c5d975

See more details on using hashes here.

File details

Details for the file zword-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: zword-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.8

File hashes

Hashes for zword-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d798de61a9f27e95b8d266b3ed98d6bf9bda8f07b281ee3d808a7df703267df2
MD5 9137948fa578a6b87d0b8b090369bd7d
BLAKE2b-256 1312e8027aa21f0479461e1d3dae8779a3661fdbe60986961a5ceafed15e8beb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page