Skip to main content

A library to scrap content from wikipedia categories

Project description

PyPI version Codacy Badge Open In Colab

highway_star

Scrap biographies from wikipedia categories and plot their life courses

The main goal of this project is to retrieve all biographies from a desired wikipedia category, and to plot the life course of those persons with a sankey diagram. Those data could then be analyzed for social purpose.
This project was made in partnership with the LEIRIS.

Installation


You can install the project via pip, or any other Pypi package manager.

pip install highway-star

Note : you may need some more packages from spacy for Natural Language Processing. This may cause error during your execution.

Please run those commands in your console, or in a python script.

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
 python -m spacy download fr

How to use


Scrapping


The function above allows you to scrap biographies from every page of the categories and subcategories crawled by this one.

from highway_star.scrapping.wikipedia_scraper import scrap_wikipedia_structure_with_content

content = scrap_wikipedia_structure_with_content(
    root_category="Acteur_français",
    lang="fr")

Let's decompose what this function is doing.
Admit that you want all biographies that comes from the wikipedia category Acteurs_français.
wikipedia_category
The algorithm will get every page link in the orange rectangle, and will store information of every subcategory in the red rectangle.
Then, it will repeat this process for every subcategory, until there are no category left.
For example, in the subcategory Acteur_français_de_cinéma of the category Acteurs_français, we still have 1 subcategory, and many new pages to scrap, as shown in the figure just below.
wikipedia_subcategory
Then, when it gets to a page, it will scrap all the content within the tags

<span class="mw-headline" id="Biographie">Biographie</span>

and

</h2>

In order to select only the content that we have for example in the image just below.
biography_example

The result of this function is a python dict.
You will just have to convert this dictionary to a dataframe using pandas :

import pandas as pd
pd.DataFrame.from_dict(content)

To have an output like this
all_scrapped
Note that you have here :

  • page_links : links to the pages
  • pages_names : names of the pages
  • subcategory : category where the page was found
  • content : the content of the biography that has been scrapped

Preprocessing


Once you have retrieved your data, you may need to preprocess them.

In order to do that, we have two functions, one simple, and the other more complex.

Easy but not custom way

from highway_star.preprocessing.biography_preprocessor import sent_to_words
sent_to_words(biographies_column=dataframe_with_biographies["biographies"])


The result of this will be a python list of tokenized biographies.
just add it to your dataframe using

content["biographies_tokenized"] = sent_to_words(biographies_column=dataframe_with_biographies["biographies"])

Complex but custom way

Note : To run this function, make sure to install following packages

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
python -m spacy download fr
from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"], 
                                   custom_stop_words = ["ajouter", "oui", "être", "avoir"],
                                   use_lemmatization=True,
                                   allowed_postags=['NOUN', 'VERB'])

This function does the tokenization, but also :

  • allows you to choose custom stop words
  • filter biographies with stop words of the package spacy.load('fr_core_news_sm')
  • allows you to use or not lemmatization
  • allows you to filter biographies by parts of speech (e.g., 'NOUN', 'VERB').

Default instantiation of this function is

from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"])

With non filled parameters set to default :

  • custom_stop_words = None
  • use_lemmatization = False
  • allowed_postags = None

Visualizing


The visualization is done using Sankey Diagram, and the algorithm prefixspan

Prefixspan

Prefixspan is an algorithm of Data Mining that retrieve the most frequent patterns in a set of data.
It was developed in 2001 by Pei, Han et. al, in Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
It can be implemented in python using the Pypi library prefixspan.
Considering that a set of data is a set of biographies, it will retrieve most frequent patterns in our biographies.
We can manage the length of the patterns it will search.
The more your length pattern is high, the more you have a chance that those pattern globe all biographies from the start to the end, but you may not have many patterns.

Sankey Diagram

Those are great data visualization tools to plot relational data.
sankey
An implementation could be found in javascript using Highcharts.

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"],
                                 prefixspan_minlen=15,
                                 prefixspan_topk=100)

This implementation will find the top 100 patterns of size 15.
The basic implementation of this function is :

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"])

with :

  • prefixspan_minlen = 10
  • prefixspan_topk = 50

The output of this function is already preprocessed prefixspan output for the sankey diagram.
It will count the number of relation couples of items have.
e.g. in :

born Alabama write song buy house
born Alabama buy house
born Europe write song buy house
  • born - Alabama = 2
  • buy - house = 3
  • write - song = 2

Note that :

  • Alabama - house


is not a valid item, because the two items are not next to each others.

Then, execute :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan, 
                                      js_filename="women", 
                                      html_filename="women",
                                    title="Life course of Women French Actress")

Where :

  • sankey_data_from_prefixspan : the output of the previous function give_sankey_data_from_prefixspan
  • js_filename : name of the js file
  • html_filename : name of the html file
  • title : title of the chart

Default implementation is :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan)

Where :

  • js_filename = data
  • html_filename = page
  • title = None

This will save locally two files. A html, and a Javascript.
Data of the function give_sankey_data_from_prefixspan is stocked into the Javascript file.
You will just have to open the html file to discover your plot. perso_sankey

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

highway_star-0.0.8.2.tar.gz (11.2 kB view hashes)

Uploaded Source

Built Distribution

highway_star-0.0.8.2-py3-none-any.whl (10.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page