Skip to main content

A library to scrap content from wikipedia categories

Project description

PyPI version Codacy Badge Open In Colab

highway_star

Scrap biographies from wikipedia categories and plot their life courses

The main goal of this project is to retrieve all biographies from a desired wikipedia category, and to plot the life course of those persons with a sankey diagram. Those data could then be analyzed for social purpose.
This project was made in partnership with the LEIRIS.

Installation


You can install the project via pip, or any other Pypi package manager.

pip install highway-star

Note : you may need some more packages from spacy for Natural Language Processing. This may cause error during your execution.

Please run those commands in your console, or in a python script.

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
 python -m spacy download fr

How to use


Scrapping


The function above allows you to scrap biographies from every page of the categories and subcategories crawled by this one.

from highway_star.scrapping.wikipedia_scraper import scrap_wikipedia_structure_with_content

content = scrap_wikipedia_structure_with_content(
    root_category="Acteur_français",
    lang="fr")

Let's decompose what this function is doing.
Admit that you want all biographies that comes from the wikipedia category Acteurs_français.
wikipedia_category
The algorithm will get every page link in the orange rectangle, and will store information of every subcategory in the red rectangle.
Then, it will repeat this process for every subcategory, until there are no category left.
For example, in the subcategory Acteur_français_de_cinéma of the category Acteurs_français, we still have 1 subcategory, and many new pages to scrap, as shown in the figure just below.
wikipedia_subcategory
Then, when it gets to a page, it will scrap all the content within the tags

<span class="mw-headline" id="Biographie">Biographie</span>

and

</h2>

In order to select only the content that we have for example in the image just below.
biography_example

The result of this function is a python dict.
You will just have to convert this dictionary to a dataframe using pandas :

import pandas as pd
pd.DataFrame.from_dict(content)

To have an output like this
all_scrapped
Note that you have here :

  • page_links : links to the pages
  • pages_names : names of the pages
  • subcategory : category where the page was found
  • content : the content of the biography that has been scrapped

Preprocessing


Once you have retrieved your data, you may need to preprocess them.

In order to do that, we have two functions, one simple, and the other more complex.

Easy but not custom way

from highway_star.preprocessing.biography_preprocessor import sent_to_words
sent_to_words(biographies_column=dataframe_with_biographies["biographies"])


The result of this will be a python list of tokenized biographies.
just add it to your dataframe using

content["biographies_tokenized"] = sent_to_words(biographies_column=dataframe_with_biographies["biographies"])

Complex but custom way

Note : To run this function, make sure to install following packages

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
python -m spacy download fr
from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"], 
                                   custom_stop_words = ["ajouter", "oui", "être", "avoir"],
                                   use_lemmatization=True,
                                   allowed_postags=['NOUN', 'VERB'])

This function does the tokenization, but also :

  • allows you to choose custom stop words
  • filter biographies with stop words of the package spacy.load('fr_core_news_sm')
  • allows you to use or not lemmatization
  • allows you to filter biographies by parts of speech (e.g., 'NOUN', 'VERB').

Default instantiation of this function is

from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"])

With non filled parameters set to default :

  • custom_stop_words = None
  • use_lemmatization = False
  • allowed_postags = None

Visualizing


The visualization is done using Sankey Diagram, and the algorithm prefixspan

Prefixspan

Prefixspan is an algorithm of Data Mining that retrieve the most frequent patterns in a set of data.
It was developed in 2001 by Pei, Han et. al, in Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
It can be implemented in python using the Pypi library prefixspan.
Considering that a set of data is a set of biographies, it will retrieve most frequent patterns in our biographies.
We can manage the length of the patterns it will search.
The more your length pattern is high, the more you have a chance that those pattern globe all biographies from the start to the end, but you may not have many patterns.

Sankey Diagram

Those are great data visualization tools to plot relational data.
sankey
An implementation could be found in javascript using Highcharts.

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"],
                                 prefixspan_minlen=15,
                                 prefixspan_topk=100)

This implementation will find the top 100 patterns of size 15.
The basic implementation of this function is :

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"])

with :

  • prefixspan_minlen = 10
  • prefixspan_topk = 50

The output of this function is already preprocessed prefixspan output for the sankey diagram.
It will count the number of relation couples of items have.
e.g. in :

born Alabama write song buy house
born Alabama buy house
born Europe write song buy house
  • born - Alabama = 2
  • buy - house = 3
  • write - song = 2

Note that :

  • Alabama - house


is not a valid item, because the two items are not next to each others.

Then, execute :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan, 
                                      js_filename="women", 
                                      html_filename="women",
                                    title="Life course of Women French Actress")

Where :

  • sankey_data_from_prefixspan : the output of the previous function give_sankey_data_from_prefixspan
  • js_filename : name of the js file
  • html_filename : name of the html file
  • title : title of the chart

Default implementation is :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan)

Where :

  • js_filename = data
  • html_filename = page
  • title = None

This will save locally two files. A html, and a Javascript.
Data of the function give_sankey_data_from_prefixspan is stocked into the Javascript file.
You will just have to open the html file to discover your plot. perso_sankey

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

highway_star-0.0.8.2.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

highway_star-0.0.8.2-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file highway_star-0.0.8.2.tar.gz.

File metadata

  • Download URL: highway_star-0.0.8.2.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for highway_star-0.0.8.2.tar.gz
Algorithm Hash digest
SHA256 b2978eab0473c6e759c2c861976873a9f865bfe2ee493374c5d307a5a0434095
MD5 59030abca331b2e6fef12ed764dbda9d
BLAKE2b-256 df0d0ee54b80262df3b88037583c6bfea80d186b2cb6d8378d2254934cb79dd6

See more details on using hashes here.

File details

Details for the file highway_star-0.0.8.2-py3-none-any.whl.

File metadata

  • Download URL: highway_star-0.0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for highway_star-0.0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 63909f821fa2dec5734c89516df30dee96aa2dcfb91116abc6bcf44886600b46
MD5 b88a2b538f08228373d122fe6bb9592e
BLAKE2b-256 0927b2a2a3679f8d99d35f9bbd0e0cbfe377c6662309b8fb0a4c20c690c13d91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page