Python package for downloading wiki-links corpus
Project description
PyWikiLinks
-----------
Download a corpus of links to Wikipedia with their anchors tags
and surrounding context.
![multiple hypertext links from around the web shown in context pointing to Wikipedia articles](readme_images/wiki-link-figure.jpg)
This package lets you download and decode the wiki-links corpus. It
contains the necessary Python 3 code to decode the saved Apache Thrift
`WikiLinkItem` serialized in the dataset and read them.
### Example from wiki-links
Below are three mentions from the corpus that can be read from the
downloaded corpus. Note the presence of a link to Wikipedia (but
also freebase id) for the mentioned entity, along with "before" and
"after" context for the link to the entity along with the anchor text
found under "middle":
```bash
CONTEXT: Context(middle=b'Graphic designers', right=b'typically don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t get involved in HTML and CSS coding. Front-end developers code designs in HTML, CSS, and JavaScript . The term \xc3\xa2\xe2\x82\xac\xc5\x93web designer\xc3\xa2\xe2\x82\xac\xef\xbf\xbd means different\xc3\x82\xc2\xa0\xc3\x82\xc2\xa0', left=b'Photoshop or Fireworks and leave the HTML and CSS to others. Or you may choose to do your own coding. Line Between Design and Implementation')
ARTICLE: b'http://en.wikipedia.org/wiki/Graphic_designer'
CONTEXT: Context(middle=b'JavaScript', right=b'. The term \xc3\xa2\xe2\x82\xac\xc5\x93web designer\xc3\xa2\xe2\x82\xac\xef\xbf\xbd means different\xc3\x82\xc2\xa0\xc3\x82\xc2\xa0 things to different people, but typically it implies taking on both the graphic designer role and at least', left=b'coding. Line Between Design and Implementation Graphic designers typically don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t get involved in HTML and CSS coding. Front-end developers code designs in HTML, CSS, and')
ARTICLE: b'http://en.wikipedia.org/wiki/JavaScript'
CONTEXT: Context(middle=b'Graphic design', right=b'and programming are very different skills, and relatively few people have a natural talent for both of them. Design is mostly a right-brain, creative activity,', left=b'approach for you depends on your interests and aptitudes, your partners, and the kinds of sites you expect to build. Advantages of the Designer/Coder Split')
ARTICLE: b'http://en.wikipedia.org/wiki/Graphic_design'
```
As you may have noticed this data also contains many non ascii characters that show up as bytes in the text
above. Most often these are either unicode quotes or special punctuation that needs to be normalized.
### Installation
```bash
pip3 install pywikilinks
```
-----------
Download a corpus of links to Wikipedia with their anchors tags
and surrounding context.
![multiple hypertext links from around the web shown in context pointing to Wikipedia articles](readme_images/wiki-link-figure.jpg)
This package lets you download and decode the wiki-links corpus. It
contains the necessary Python 3 code to decode the saved Apache Thrift
`WikiLinkItem` serialized in the dataset and read them.
### Example from wiki-links
Below are three mentions from the corpus that can be read from the
downloaded corpus. Note the presence of a link to Wikipedia (but
also freebase id) for the mentioned entity, along with "before" and
"after" context for the link to the entity along with the anchor text
found under "middle":
```bash
CONTEXT: Context(middle=b'Graphic designers', right=b'typically don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t get involved in HTML and CSS coding. Front-end developers code designs in HTML, CSS, and JavaScript . The term \xc3\xa2\xe2\x82\xac\xc5\x93web designer\xc3\xa2\xe2\x82\xac\xef\xbf\xbd means different\xc3\x82\xc2\xa0\xc3\x82\xc2\xa0', left=b'Photoshop or Fireworks and leave the HTML and CSS to others. Or you may choose to do your own coding. Line Between Design and Implementation')
ARTICLE: b'http://en.wikipedia.org/wiki/Graphic_designer'
CONTEXT: Context(middle=b'JavaScript', right=b'. The term \xc3\xa2\xe2\x82\xac\xc5\x93web designer\xc3\xa2\xe2\x82\xac\xef\xbf\xbd means different\xc3\x82\xc2\xa0\xc3\x82\xc2\xa0 things to different people, but typically it implies taking on both the graphic designer role and at least', left=b'coding. Line Between Design and Implementation Graphic designers typically don\xc3\xa2\xe2\x82\xac\xe2\x84\xa2t get involved in HTML and CSS coding. Front-end developers code designs in HTML, CSS, and')
ARTICLE: b'http://en.wikipedia.org/wiki/JavaScript'
CONTEXT: Context(middle=b'Graphic design', right=b'and programming are very different skills, and relatively few people have a natural talent for both of them. Design is mostly a right-brain, creative activity,', left=b'approach for you depends on your interests and aptitudes, your partners, and the kinds of sites you expect to build. Advantages of the Designer/Coder Split')
ARTICLE: b'http://en.wikipedia.org/wiki/Graphic_design'
```
As you may have noticed this data also contains many non ascii characters that show up as bytes in the text
above. Most often these are either unicode quotes or special punctuation that needs to be normalized.
### Installation
```bash
pip3 install pywikilinks
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pywikilinks-0.0.2.tar.gz
(11.3 kB
view details)
File details
Details for the file pywikilinks-0.0.2.tar.gz
.
File metadata
- Download URL: pywikilinks-0.0.2.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f49e53a5d637bc4774206d83971b6dff7b2d1f888fe21150609612310a580db6 |
|
MD5 | 1dd41dc2dec89d739d6b5066fcd362e4 |
|
BLAKE2b-256 | ea9ba44c8067b4b0be2aa53528e8a996ba60eb8892deea311ad008e27c686f0a |