Skip to main content

A webmining CLI tool & library for python.

Project description

Build Status

Minet

A webmining CLI tool & library for python.

Minet features:

  • Multithreaded HTML fetching
  • Multiprocessing text content extraction
  • Facebook's share count fetching
  • Custom scraping script?

Installation

minet can be installed using pip:

pip install minet

You can also create a Minet executable.

Commands

fetch

Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.

minet fetch COLUMN FILE

Additional options:

  • -s STORAGE_LOCATION specifies the location where the (temporary) HTML files are stored. Is ./data by default.
  • -id COLUMN_NAME : name of the url ID column, if present in the csv FILE. Used for the name of the HTML files. If not specified, UUIDs are generated.
  • --monitoring_file FILE_NAME : location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.

Minet

Example

Imagine you have a urls.csv file containing urls - in a column called 'urls' - you want to extract data from. Just use this command:

minet fetch url urls.csv

That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.


facebook

Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).

The share count of a url is the sum of :

  • the number of likes of the url
  • the number of shares of the url
  • the number of likes & comments on stories about this url
minet facebook COLUMN FILE

Additional options:

  • -o OUTPUT specifies the location of the output csv (being the source csv FILE with an additional facebook_share_count column). Is stdout by default.

Minet

Example

Let's say you have a urls.csv file with - in a 'url' column - the urls you want the share count of.

Just use this command:

minet facebook url urls.csv -o urls_with_fb_data.csv

As a result, you get a urls_with_fb_data.csv file with a facebook_share_count column.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minet-0.2.0.tar.gz (11.8 kB view hashes)

Uploaded Source

Built Distributions

minet-0.2.0-py3.6.egg (32.4 kB view hashes)

Uploaded Source

minet-0.2.0-py3-none-any.whl (15.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page