Skip to main content

Scrape parent-child relationships from Wikipedia infoboxes.

Project description

Genea

Pronounced "genie". Scrape parent-child relationships from Wikipedia infoboxes.

Why Infoboxes?

Infoboxes give us a digest of a particular Wikipedia page, in addition to the relational information that we'll need to build a tree.

infobox_washington.png

Modified infobox as seen on the Wikipedia page for George Washington

In the image above, we can see rows of data under the "Personal Details" section; each of these rows contain a header (bolded text) and (typically) associated links.

We'll use regular expression patterns to match with these headers, some of which provide ancestral relationships ("Parents", in this case), some provide descendant relationships ("Children"), and others that could provide extra links that we can walk out from ("Relatives").

Let's try out the above example.

Installation

Clone this repository to your local machine with git, then install with Python.

git clone https://github.com/shanedrabing/genea.git
cd genea
python setup.py install

Getting Started

Run the program with Python.

python genea.py "George Washington" "^Parent" "^Child"

Positional Arguments

  • term : Search term. Redirects to initial Wikipedia page.
  • pre : (optional, regex) If matched, will add ancestor.
  • post : (optional, regex) If matched, will add descendant.

Named Arguments

  • -n [STEPS] : How many steps to walk from initial page?
  • -e [EXTRA] : (regex) If matched, will add additional links (no relation).

Text Output

ANCESTORS of George Washington
├── Augustine Washington Sr.  
│   ├── Mildred Gale
│   │   └── Augustine Warner Jr.
│   │       └── Augustine Warner
│   └── Lawrence Washington
│       └── John Washington
│           └── Lawrence Washington
└── Mary Washington

DESCENDANTS of George Washington
└── John Parke Custis
    ├── George Washington Parke Custis
    │   ├── Mary Anna Custis Lee
    │   │   ├── Eleanor Agnes Lee
    │   │   ├── George Washington Custis Lee
    │   │   ├── William Henry Fitzhugh Lee
    │   │   ├── Robert E. Lee Jr.
    │   │   ├── Mildred Childe Lee
    │   │   ├── Anne Carter Lee
    │   │   └── Mary Custis Lee
    │   └── Maria Carter Syphax
    ├── Martha Parke Custis Peter
    ├── Elizabeth Custis Law
    └── Eleanor Parke Custis Lewis

Motivating Examples

Try out these other searches! Genea is intended to be general, meaning that any infobox labels you find can define the relationships between pages.

# how many cars succeeded the Ford Quadricycle?
python genea.py "Ford Quadricycle" "^Predecessor" "^Successor"

# what is the pedigree of Secretariat? (goes back to the 1700s!)
python genea.py "Secretariat (horse)" "^(Sire|Dam)$" --extra "sire"

# where did Windows XP come from, where did it go?
python genea.py "Windows XP" "^(Preceded by)$" "^(Succeeded by)$"

# how many child companies does Disney have?
python genea.py "Disney" "Parent" "(Divisions|Subsidiaries)"

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genea-0.1.0.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

genea-0.1.0-py3-none-any.whl (4.3 kB view details)

Uploaded Python 3

File details

Details for the file genea-0.1.0.tar.gz.

File metadata

  • Download URL: genea-0.1.0.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.8.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.1 CPython/3.8.10

File hashes

Hashes for genea-0.1.0.tar.gz
Algorithm Hash digest
SHA256 de582a8487d555e4ea2268e057926a722f3340d5df98d8ee30f50a62e085c669
MD5 eaaed29a2ef31ac827778ef59abee70e
BLAKE2b-256 14d5f68a7904c91c1e60698033eefe3d914a81cab4ccdde94c83e3e29445147d

See more details on using hashes here.

File details

Details for the file genea-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: genea-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.8.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.1 CPython/3.8.10

File hashes

Hashes for genea-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b3e69940ee7a995ce579e27d28b83f275a9ad17bd15fef14fc08f3257571259
MD5 d8bc97fc140997affc9eccd3f98dd745
BLAKE2b-256 1815a7abb39c6a0ab8679452c0e459039747e0f44597718fae929756fdbb0749

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page