Simple Wiktionary scraper
Project description
pyktionary
Simple Wiktionary scraper. Get information from words in Wiktionary.
The module is at an early stage, be advised that:
- Only french Wiktionary is supported.
- The following sections are not scraped:
- Prononciation
- Anagrammes
- Voir aussi
- Références
- Forme de verbe
- Any section not matching Étymologie is scraped as Définition.
What pyktionary is
A scraper that gets data on words from Wiktionary. Sections of a word are scraped as raw HTML into a dict, see Example.
What pyktionary is not
An interface to make changes to Wiktionary. You can NOT send data to Wiktionary with this module.
What's next ?
This module is at a very early stage. It only cover my specific use case, which is scraping a word's etymology and definitions from french Wiktionary.
The module will improve over time. Priorities are for the following features and fixes:
- Scrap all sections from a word.
- Support wiktionaries from other languages.
You can read the TODO for more stuff to do.
Usage
from pyktionary import Wiktionary
# ...
wik = Wiktionary()
word = wik.word("oui")
# ...
Example
With word oui:
The following code:
from pyktionary import Wiktionary
wik = Wiktionary()
word = wik.word("oui")
pprint.pprint(word, compact=True)
output:
{
'Définition': '<ol><li>Réponsede<i><ahref="https://fr.wiktionary.org/wiki/oui#fr-interj"title="oui">oui</a></i>.Votepour.<strong>Noted’usage:</strong>L’<ahref="https://fr.wiktionary.org/wiki/article"title="article">article</a>définines’<ahref="https://fr.wiktionary.org/wiki/%C3%A9lider"title="élider">élide</a>pasdevantcemot.<ul><li><i>Uneballade,uneballade!s’écrial’ermite,celavautmieuxquetouslesocetles<b>oui</b>deFrance.</i><spanclass="sources"><spanclass="tiret">—</span>(<aclass="extiw"href="https://fr.wikipedia.org/wiki/Walter_Scott"title="w:WalterScott">Walter<spanclass="petites_capitales"style="font-variant:small-caps">Scott</span></a>,<i><aclass="extiw"href="https://fr.wikipedia.org/wiki/Ivanho%C3%A9"title="w:Ivanhoé">Ivanhoé</a></i>,traduitdel’anglaispar<aclass="extiw"href="https://fr.wikipedia.org/wiki/Alexandre_Dumas"title="w:AlexandreDumas">Alexandre<spanclass="petites_capitales"style="font-variant:small-caps">Dumas</span></a>,<aclass="extiw"href="https://fr.wikisource.org/wiki/Ivanho%C3%A9_(Scott_-_Dumas)"title="s:Ivanhoé(Scott-Dumas)">1820</a>)</span></li><li><i>Le<b>oui</b>etlenon.</i></li><li><i>Iladitce<b>oui</b>-làdeboncœur.</i></li><li><i>Ilnefautpastantdediscours,onnevousdemandequ’un<b>oui</b>ouunnon.Ditesunbon<b>oui</b>.</i></li></ul></li></ol>',
'Étymologie': '<dl><dd><spanclass="date"><i>(<spanclass="texte">1380</span>)</i></span>Del’ancienfrançais<i><spanclass="lang-fro"lang="fro"><ahref="https://fr.wiktionary.org/wiki/o%C3%AFl#fro"title="oïl">oïl</a></span></i><spanclass="date"><i>(<spanclass="texte">1080</span>)</i></span>,formecomposéede<i>o</i>«cela»<spanclass="date"><i>(<spanclass="texte">842</span>)</i></span>,ausensde«oui»(àcomparerde<i><ahref="https://fr.wiktionary.org/wiki/%C3%B2c"title="òc">òc</a></i>«oui»en<ahref="https://fr.wiktionary.org/wiki/occitan"title="occitan">occitan</a>),renforcéparlepronompersonnel<i><ahref="https://fr.wiktionary.org/wiki/il"title="il">il</a></i>(ontrouveaussi<i>o-je</i>,<i>o-tu</i>,<i>onos</i>,<i>ovos</i>).<spanid="ref-1"><small></small><sup><ahref="#reference-1">[1]</a></sup></span><spanid="ref-2"><small></small><sup><ahref="#reference-2">[2]</a></sup></span>Lesmots«oui»et«òc»sontdescalquesceltiques<supclass="reference"id="cite_ref-1"><ahref="#cite_note-1">[1]</a></sup></dd></dl>'
}
Licence
This module is licenced under GNU GPL v3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for pyktionary-0.5a0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0bc3def2c33254f3b6408cf27e125a284ef48b2a1525cc7c6e33af7b8f35e135 |
|
MD5 | 0b173a488c4847ab3df0d1f59427d18f |
|
BLAKE2b-256 | 69d265c8896a83a85528b8717907da6215c3c7558b0fb408eac6c97e55551671 |