A library to scrap content from wikipedia categories

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

highway_star

Scrap biographies from wikipedia categories and plot their life courses

The main goal of this project is to retrieve all biographies from a desired wikipedia category, and to plot the life course of those persons with a sankey diagram. Those data could then be analyzed for social purpose.
This project was made in partnership with the LEIRIS.

Installation

You can install the project via pip, or any other Pypi package manager.

pip install highway-star

Note : you may need some more packages from spacy for Natural Language Processing. This may cause error during your execution.

Please run those commands in your console, or in a python script.

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0

 python -m spacy download fr

How to use

Scrapping

The function above allows you to scrap biographies from every page of the categories and subcategories crawled by this one.

from highway_star.scrapping.wikipedia_scraper import scrap_wikipedia_structure_with_content

content = scrap_wikipedia_structure_with_content(
    root_category="Acteur_français",
    lang="fr")

Let's decompose what this function is doing.
Admit that you want all biographies that comes from the wikipedia category Acteurs_français.
wikipedia_category
The algorithm will get every page link in the orange rectangle, and will store information of every subcategory in the red rectangle.
Then, it will repeat this process for every subcategory, until there are no category left.
For example, in the subcategory Acteur_français_de_cinéma of the category Acteurs_français, we still have 1 subcategory, and many new pages to scrap, as shown in the figure just below.
wikipedia_subcategory
Then, when it gets to a page, it will scrap all the content within the tags

<span class="mw-headline" id="Biographie">Biographie</span>

and

</h2>

In order to select only the content that we have for example in the image just below.
biography_example

The result of this function is a python dict.
You will just have to convert this dictionary to a dataframe using pandas :

import pandas as pd
pd.DataFrame.from_dict(content)

To have an output like this
all_scrapped
Note that you have here :

page_links : links to the pages
pages_names : names of the pages
subcategory : category where the page was found
content : the content of the biography that has been scrapped

Preprocessing

Once you have retrieved your data, you may need to preprocess them.

In order to do that, we have two functions, one simple, and the other more complex.

Easy but not custom way

from highway_star.preprocessing.biography_preprocessor import sent_to_words
sent_to_words(biographies_column=dataframe_with_biographies["biographies"])

The result of this will be a python list of tokenized biographies.
just add it to your dataframe using

content["biographies_tokenized"] = sent_to_words(biographies_column=dataframe_with_biographies["biographies"])

Complex but custom way

Note : To run this function, make sure to install following packages

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.0.0/fr_core_news_sm-2.0.0.tar.gz#egg=fr_core_news_sm==2.0.0
python -m spacy download fr

from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"], 
                                   custom_stop_words = ["ajouter", "oui", "être", "avoir"],
                                   use_lemmatization=True,
                                   allowed_postags=['NOUN', 'VERB'])

This function does the tokenization, but also :

allows you to choose custom stop words
filter biographies with stop words of the package spacy.load('fr_core_news_sm')
allows you to use or not lemmatization
allows you to filter biographies by parts of speech (e.g., 'NOUN', 'VERB').

Default instantiation of this function is

from highway_star.preprocessing.biography_preprocessor import remove_stop_words_from_biographies
remove_stop_words_from_biographies(biographies_column=dataframe_with_biographies["biographies"])

With non filled parameters set to default :

custom_stop_words = None
use_lemmatization = False
allowed_postags = None

Visualizing

The visualization is done using Sankey Diagram, and the algorithm prefixspan

Prefixspan

Prefixspan is an algorithm of Data Mining that retrieve the most frequent patterns in a set of data.
It was developed in 2001 by Pei, Han et. al, in Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth.
It can be implemented in python using the Pypi library prefixspan.
Considering that a set of data is a set of biographies, it will retrieve most frequent patterns in our biographies.
We can manage the length of the patterns it will search.
The more your length pattern is high, the more you have a chance that those pattern globe all biographies from the start to the end, but you may not have many patterns.

Sankey Diagram

Those are great data visualization tools to plot relational data.
sankey
An implementation could be found in javascript using Highcharts.

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"],
                                 prefixspan_minlen=15,
                                 prefixspan_topk=100)

This implementation will find the top 100 patterns of size 15.
The basic implementation of this function is :

from highway_star.visualizing.visualizer import give_sankey_data_from_prefixspan
give_sankey_data_from_prefixspan(dataframe_with_biographies["content_tokenized"])

with :

prefixspan_minlen = 10
prefixspan_topk = 50

The output of this function is already preprocessed prefixspan output for the sankey diagram.
It will count the number of relation couples of items have.
e.g. in :

born Alabama write song buy house
born Alabama buy house
born Europe write song buy house

born - Alabama = 2
buy - house = 3
write - song = 2

Note that :

Alabama - house

is not a valid item, because the two items are not next to each others.

Then, execute :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan, 
                                      js_filename="women", 
                                      html_filename="women",
                                    title="Life course of Women French Actress")

Where :

sankey_data_from_prefixspan : the output of the previous function give_sankey_data_from_prefixspan
js_filename : name of the js file
html_filename : name of the html file
title : title of the chart

Default implementation is :

from highway_star.visualizing.visualizer import sankey_diagram_with_prefixspan_output
sankey_diagram_with_prefixspan_output(sankey_data_from_prefixspan=sankey_data_from_prefixspan)

Where :

js_filename = data
html_filename = page
title = None

This will save locally two files. A html, and a Javascript.
Data of the function give_sankey_data_from_prefixspan is stocked into the Javascript file.
You will just have to open the html file to discover your plot. perso_sankey

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.8.2

Nov 21, 2021

0.0.8.1

Jun 28, 2021

0.0.8.0

Jun 28, 2021

0.0.7.0

May 18, 2021

0.0.6.2

May 3, 2021

0.0.6.1

May 3, 2021

0.0.6

May 3, 2021

0.0.5

May 2, 2021

0.0.4.8

May 2, 2021

0.0.3.9.1

Apr 29, 2021

0.0.3.6

Apr 29, 2021

0.0.3.5

Apr 21, 2021

0.0.3.4

Apr 21, 2021

0.0.1

Apr 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

highway_star-0.0.8.2.tar.gz (11.2 kB view hashes)

Uploaded Nov 21, 2021 Source

Built Distribution

highway_star-0.0.8.2-py3-none-any.whl (10.4 kB view hashes)

Uploaded Nov 21, 2021 Python 3

Hashes for highway_star-0.0.8.2.tar.gz

Hashes for highway_star-0.0.8.2.tar.gz
Algorithm	Hash digest
SHA256	`b2978eab0473c6e759c2c861976873a9f865bfe2ee493374c5d307a5a0434095`
MD5	`59030abca331b2e6fef12ed764dbda9d`
BLAKE2b-256	`df0d0ee54b80262df3b88037583c6bfea80d186b2cb6d8378d2254934cb79dd6`

Hashes for highway_star-0.0.8.2-py3-none-any.whl

Hashes for highway_star-0.0.8.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63909f821fa2dec5734c89516df30dee96aa2dcfb91116abc6bcf44886600b46`
MD5	`b88a2b538f08228373d122fe6bb9592e`
BLAKE2b-256	`0927b2a2a3679f8d99d35f9bbd0e0cbfe377c6662309b8fb0a4c20c690c13d91`