A multiprocessing web-scraping application to scrape wiki pages and find minimum number of links between two given wiki pages.
Project description
wikilink is a multiprocessing web-scraping application to scrape the wiki pages, extract urls and find the minimum number of links between 2 given wiki pages.
I discussed brief the motivation and an overview of the project in my blog.
The project is currently at version v0.3.0.post1, also see change log for more details on release history.
If you like this project, feel fee to leave a few words of appreciation here
Build | ||
---|---|---|
Quality | ||
Support | ||
Platform |
Table of contents
Usage
Install with pip
$ pip install wikilink
Database support
wikilink currently supports Mysql and PostgreSQL
API
setup_db(db, username, password, ip="127.0.0.1", port=3306): set up database
Args:
db(str): Database engine, currently support "mysql" and "postgresql"
name(str): database username
password(str): database password
ip(str): IP address of database (Default = "127.0.0.1")
port(str): port that databse is running on (default=3306)
Returns:
None
min_link(source, destination, limit=6, multiprocessing=False): find minimum number of link from source url to destination url within limit
Args:
source(str): source wiki url, i.e. "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
destination(str): Destination wiki url, i.e. "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
limit(int): max number of links from the source that will be considered (default=6)
multiprocessing(boolean): enable/disable multiprocessing mode (default=False)
Returns:
(int) minimum number of sepration between source and destination urls
return None and print messages if exceeding limits or no path found
Raises:
DisconnectionError: error connecting to DB
Examples
>>> from wikilink import WikiLink
>>> app = WikiLink()
>>> app.setup_db("mysql", "root", "12345", "127.0.0.1", "3306")
>>> source = "https://en.wikipedia.org/wiki/Cristiano_Ronaldo"
>>> destination = "https://en.wikipedia.org/wiki/Lionel_Messi"
>>> app.min_link(source, destination, 6)
1
Contribution 
How to contribute
Please follow our contribution convention at contribution instructions and code of conduct.
To set up development environment, simply run:
$ pip install -r requirements.txt
Please check out the issue file for list of issues that required helps.
Appreciation
Feel free to add your name into the list of contributors. You will automatically be inducted into Hall of Fame as a way to show my appreciation for your contributions.
Hall of Fame
License
See the LICENSE file for license rights and limitations (Apache License 2.0).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikilink-0.3.0.post1.tar.gz
.
File metadata
- Download URL: wikilink-0.3.0.post1.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f22b27d8f1e2a77aa4060491c0232df9c623ec8670de33220c70f8dc0b294e5 |
|
MD5 | 4a3847eb0e1f977770f2a591b4508420 |
|
BLAKE2b-256 | b57eb8fa75975522897b2a31dca41ca2158c4d0b5b80668afd0b63fdc092f430 |
File details
Details for the file wikilink-0.3.0.post1-py3-none-any.whl
.
File metadata
- Download URL: wikilink-0.3.0.post1-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a4d7b74fef81a880339be27c1157bb20ea2854e656766d2ae8379339631ea94 |
|
MD5 | 308a4ecbf598115b7032e0a2dbc033ad |
|
BLAKE2b-256 | 008dd423436ac2fcba1f715c66c426d051738d39076cd51e7218613ede6b96ad |