Scrape parent-child relationships from Wikipedia infoboxes.
Project description
Genea
Pronounced "genie". Scrape parent-child relationships from Wikipedia infoboxes.
Why Infoboxes?
Infoboxes give us a digest of a particular Wikipedia page, in addition to the relational information that we'll need to build a tree.
Modified infobox as seen on the Wikipedia page for George Washington
In the image above, we can see rows of data under the "Personal Details" section; each of these rows contain a header (bolded text) and (typically) associated links.
We'll use regular expression patterns to match with these headers, some of which provide ancestral relationships ("Parents", in this case), some provide descendant relationships ("Children"), and others that could provide extra links that we can walk out from ("Relatives").
Let's try out the above example.
Installation
Clone this repository to your local machine with git, then install with Python.
git clone https://github.com/shanedrabing/genea.git
cd genea
python setup.py install
Getting Started
Run the program with Python.
python genea.py "George Washington" "^Parent" "^Child"
Positional Arguments
term
: Search term. Redirects to initial Wikipedia page.pre
: (optional, regex) If matched, will add ancestor.post
: (optional, regex) If matched, will add descendant.
Named Arguments
-n [STEPS]
: How many steps to walk from initial page?-e [EXTRA]
: (regex) If matched, will add additional links (no relation).
Text Output
ANCESTORS of George Washington
├── Augustine Washington Sr.
│ ├── Mildred Gale
│ │ └── Augustine Warner Jr.
│ │ └── Augustine Warner
│ └── Lawrence Washington
│ └── John Washington
│ └── Lawrence Washington
└── Mary Washington
DESCENDANTS of George Washington
└── John Parke Custis
├── George Washington Parke Custis
│ ├── Mary Anna Custis Lee
│ │ ├── Eleanor Agnes Lee
│ │ ├── George Washington Custis Lee
│ │ ├── William Henry Fitzhugh Lee
│ │ ├── Robert E. Lee Jr.
│ │ ├── Mildred Childe Lee
│ │ ├── Anne Carter Lee
│ │ └── Mary Custis Lee
│ └── Maria Carter Syphax
├── Martha Parke Custis Peter
├── Elizabeth Custis Law
└── Eleanor Parke Custis Lewis
Motivating Examples
Try out these other searches! Genea is intended to be general, meaning that any infobox labels you find can define the relationships between pages.
# how many cars succeeded the Ford Quadricycle?
python genea.py "Ford Quadricycle" "^Predecessor" "^Successor"
# what is the pedigree of Secretariat? (goes back to the 1700s!)
python genea.py "Secretariat (horse)" "^(Sire|Dam)$" --extra "sire"
# where did Windows XP come from, where did it go?
python genea.py "Windows XP" "^(Preceded by)$" "^(Succeeded by)$"
# how many child companies does Disney have?
python genea.py "Disney" "Parent" "(Divisions|Subsidiaries)"
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genea-0.1.0.tar.gz
.
File metadata
- Download URL: genea-0.1.0.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.8.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | de582a8487d555e4ea2268e057926a722f3340d5df98d8ee30f50a62e085c669 |
|
MD5 | eaaed29a2ef31ac827778ef59abee70e |
|
BLAKE2b-256 | 14d5f68a7904c91c1e60698033eefe3d914a81cab4ccdde94c83e3e29445147d |
File details
Details for the file genea-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: genea-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.8.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.1 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b3e69940ee7a995ce579e27d28b83f275a9ad17bd15fef14fc08f3257571259 |
|
MD5 | d8bc97fc140997affc9eccd3f98dd745 |
|
BLAKE2b-256 | 1815a7abb39c6a0ab8679452c0e459039747e0f44597718fae929756fdbb0749 |