A library to scrap content from wikipedia categories
Project description
highway_star
Scrap biographies from wikipedia categories and plot their life courses
The main goal of this project is to retrieve all biographies from a desired wikipedia category, and to plot the life course of those persons
with a sankey diagram. Those data could then be analyzed for social purpose.
This project was made in partnership with the LEIRIS.
Installation
You can install the project via pip
, or any other Pypi
package manager.
pip install highway-star
Note : you may need some more packages from spacy for Natural Language Processing. This may cause error during your execution.
Please run those commands in your console, or in a python script.
pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
python -m spacy download fr
How to use
Scrapping
The function above allows you to scrap biographies from every page of the categories and subcategories crawled by this one.
from highway_star.scrapping.wikipedia_scraper import scrap_wikipedia_structure_with_content
content = scrap_wikipedia_structure_with_content(
root_category="Acteur_français",
lang="fr")
Let's decompose what this function is doing.
Admit that you want all biographies that comes from the wikipedia category Acteurs_français.
The algorithm will get every page link in the orange rectangle, and will store information of every subcategory in the red rectangle.
Then, it will repeat this process for every subcategory, until there are no category left.
For example, in the subcategory Acteur_français_de_cinéma of the category Acteurs_français,
we still have 1 subcategory, and many new pages to scrap, as shown in the figure just below.
Then, when it gets to a page, it will scrap all the content within the tags
<span class="mw-headline" id="Biographie">Biographie</span>
and
</h2>
In order to select only the content that we have for example in the image just below.
The result of this function is a python dict.
You will just have to convert this dictionary to a dataframe using pandas :
import pandas as pd
pd.DataFrame.from_dict(content)
To have an output like this
Note that you have here :
- page_links : links to the pages
- pages_names : names of the pages
- subcategory : category where the page was found
- content : the content of the biography that has been scrapped
Preprocessing
Once you have retrieved your data, you may need to preprocess them.
In order to do that, we have two functions, one simple, and the other more complex.
Easy but not custom way
from highway_star.preprocessing.biography_preprocessor import sent_to_words
sent_to_words(biographies_column=dataframe_with_biographies["biographies"])
The result of this will be a python list of tokenized biographies.
just add it to your dataframe using
content["biographies_tokenized"] = sent_to_words(biographies_column=dataframe_with_biographies["biographies"])
Complex but custom way
Note : To run this function, make sure to install following packages
pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
python -m spacy download fr
from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"],
custom_stop_words = ["ajouter", "oui", "être", "avoir"],
use_lemmatization=True,
allowed_postags=['NOUN', 'VERB'])
This function does the tokenization, but also :
- allows you to choose custom
stop words
- filter biographies with
stop words
of the packagespacy.load('fr_core_news_sm')
- allows you to use or not
lemmatization
- allows you to filter biographies by parts of speech (e.g., 'NOUN', 'VERB').
Default instantiation of this function is
from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"])
With non filled parameters set to default :
- custom_stop_words =
None
- use_lemmatization =
False
- allowed_postags =
None
Visualizing
The visualization is done using Sankey Diagram, and the algorithm prefixspan
Prefixspan
Prefixspan is an algorithm of Data Mining that retrieve the most frequent patterns in a set of data.
It was developed in 2001 by Pei, Han et. al, in Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
It can be implemented in python using the Pypi library prefixspan.
Considering that a set of data is a set of biographies, it will retrieve most frequent patterns in our biographies.
We can manage the length of the patterns it will search.
The more your length pattern is high, the more you have a chance that those pattern globe all biographies from the start to the end, but you may not have many patterns.
Sankey Diagram
Those are great data visualization tools to plot relational data.
An implementation could be found in javascript using Highcharts.
from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"],
prefixspan_minlen=15,
prefixspan_topk=100)
This implementation will find the top 100
patterns of size 15
.
The basic implementation of this function is :
from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"])
with :
- prefixspan_minlen =
10
- prefixspan_topk =
50
The output of this function is already preprocessed prefixspan output for the sankey diagram.
It will count the number of relation couples of items have.
e.g. in :
born Alabama write song buy house
born Alabama buy house
born Europe write song buy house
- born - Alabama =
2
- buy - house =
3
- write - song =
2
Note that :
- Alabama - house
is not a valid item, because the two items are not next to each others.
Then, execute :
from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan,
js_filename="women",
html_filename="women",
title="Life course of Women French Actress")
Where :
- sankey_data_from_prefixspan : the output of the previous function
give_sankey_data_from_prefixspan
- js_filename : name of the js file
- html_filename : name of the html file
- title : title of the chart
Default implementation is :
from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan)
Where :
- js_filename =
data
- html_filename =
page
- title =
None
This will save locally two files. A html, and a Javascript.
Data of the function give_sankey_data_from_prefixspan
is stocked into the Javascript file.
You will just have to open the html
file to discover your plot.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for highway_star-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6085f6df45e02c5c174165d465027ae2759eedd9bd970da29a2cb0d91273285e |
|
MD5 | 9f73f1f738fe0b14ec37f13a87225a47 |
|
BLAKE2b-256 | dd2ab0bdcd9822d87bc4b0af6e20b4070f790859e54baf74ffdd38cfa5c11bba |