Skip to main content

Wiki Extractor

Project description

📂 WikiExtractor 💡

Dumping hole wiki content,which can :

  • clean unused symbol,mark,label
  • extract knowledge - synonym,concept,relationship

這是一個wiki的預處理工具,可以:

  • 清理wiki中沒有用的内容:標簽,符號...
  • 提取出一些有用的知識:同義詞,關係,翻譯

Usage

How to use:

git clone this project
run main.py

Function

init

from wiki import *
wiki = WikiDump(language_source="zh_yuewiki", c2t=False)

Arguments

dump_articles(outfile, type="csv")

Arguments

  • outfile(String) : name of output file
  • type(String) : csv or text Result
csv : 
數學,"歐幾裏得,西元前三世紀的古希臘數學家,現在被認為是幾何之父,此畫為拉斐爾的作品《雅典學院》。
數學是利用符號語言研究數量、結構、變化以及空間等概念的一門學科,從某種角度看屬於形式科學的一種。數學透過抽象化和邏輯推理的使用,由計數、計算、量度和對物體形狀及運動的觀察而產生。數學家們拓展這些概念,為了公式化新的猜想以及從選定的公理及定義中建立起嚴謹推導出的定理。
......
text :
數學
歐幾裏得,西元前三世紀的古希臘數學家,現在被認為是幾何之父,此畫為拉斐爾的作品《雅典學院》。
數學是利用符號語言研究數量、結構、變化以及空間等概念的一門學科,從某種角度看屬於形式科學的一種。數學透過抽象化和邏輯推理的使用,由計數、計算、量度和對物體形狀及運動的觀察而產生。數學家們拓展這些概念,為了公式化新的猜想以及從選定的公理及定義中建立起嚴謹推導出的定理。

dump_redirect_pair(outfile, type)

get all redirect pair Arguments

  • outfile(String) : name of output file
  • type(String) : csv or dict Result
csv:
origin.redirect to
鋼の錬金術師,鋼之鍊金術師
香港仔海旁道,香港仔海傍道
飛機外部燈光,航行燈
螢幕八爪娛,熒幕八爪娛
司农卿,大司農
大司农卿,大司農
司農,大司農
司农,大司農
Earth 2160,地球2160
图勒凯尔姆,图勒凯尔姆省
盖勒吉利耶,盖勒吉利耶省
......
dict
鋼の錬金術師
鋼之鍊金術師
香港仔海旁道
香港仔海傍道
飛機外部燈光
航行燈
螢幕八爪娛
熒幕八爪娛
司农卿
大司農
大司农卿
大司農
司農
大司農
司农
大司農
Earth 2160
地球2160
图勒凯尔姆
图勒凯尔姆省
盖勒吉利耶
盖勒吉利耶省

dump_langlink(outfile, type):

Arguments

  • outfile(String) : name of output file
  • type(String) : csv or dict

dump_category(outfile, type="csv"):

use this to extract specific categories noun Arguments

  • outfile(String) : name of output file
  • type(String) : csv or dict

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiext-0.0.1.tar.gz (5.8 kB view details)

Uploaded Source

Built Distributions

wikiext-0.0.1-py3.7.egg (11.6 kB view details)

Uploaded Source

wikiext-0.0.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file wikiext-0.0.1.tar.gz.

File metadata

  • Download URL: wikiext-0.0.1.tar.gz
  • Upload date:
  • Size: 5.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4

File hashes

Hashes for wikiext-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b71ad521409fb78e13bac616869b2582c7cb5483c2a3a42da78f977d53b198d3
MD5 7440bbbe08974ad940819fc78ba9fa3d
BLAKE2b-256 1e184a2d5335b8b0d3bc7af5904480a927f5649863047803f6a8c090ab10e26e

See more details on using hashes here.

File details

Details for the file wikiext-0.0.1-py3.7.egg.

File metadata

  • Download URL: wikiext-0.0.1-py3.7.egg
  • Upload date:
  • Size: 11.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4

File hashes

Hashes for wikiext-0.0.1-py3.7.egg
Algorithm Hash digest
SHA256 d8d44a46e107a0115fcd1ec2ddabd8766d6336ee1def16eaf00ec2e2b5a1a5b8
MD5 6a3ce4a82398199a4a08017e9bf18037
BLAKE2b-256 5708bdbcf55d27f4e0e254ec6829bdabd32b6a7c46d2f28a30fa47e51864c698

See more details on using hashes here.

File details

Details for the file wikiext-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: wikiext-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.4

File hashes

Hashes for wikiext-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea9d9566c5d8dfbfbf29ef43ff04a4583692656fe741fcd44439ddddc95c54dd
MD5 967c50034c8d25ea9b60cadd3a7ccdb5
BLAKE2b-256 d599810215cee2f9a4257ae515a8adf73887a080a91bcf0bd0ebc44fe510afe7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page