Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

A webmining CLI tool & library for python.

Project description

Build Status

Minet

A webmining CLI tool & library for python.

Minet features:

  • Multithreaded HTML fetching
  • Multiprocessing text content extraction
  • Facebook's share count fetching
  • Custom scraping script?

Installation

minet can be installed using pip:

pip install minet

You can also create a Minet executable.

Commands

fetch

Handy command to fetch the HTML content of every url provided in a potentially huge given csv. Works in a multithreaded & lazy way (the csv is not loaded into memory), and starts where it stopped at last execution.

minet fetch COLUMN FILE

Additional options:

  • -s STORAGE_LOCATION specifies the location where the (temporary) HTML files are stored. Is ./data by default.
  • -id COLUMN_NAME : name of the url ID column, if present in the csv FILE. Used for the name of the HTML files. If not specified, UUIDs are generated.
  • --monitoring_file FILE_NAME : location of the monitoring file used to save progress. Is ./data/monitoring.csv by default.

Minet

Example

Imagine you have a urls.csv file containing urls - in a column called 'urls' - you want to extract data from. Just use this command:

minet fetch url urls.csv

That's it, your HTML files are stored in ./data/htmlfiles, ready for text content extraction for instance.


facebook

Quickly fetches the (rounded*) Facebook share count of each url in a given csv, without the need of an API nor access token (and thus no rate limitation). Works in a multithreaded & lazy way (the csv is not loaded into memory).

The share count of a url is the sum of :

  • the number of likes of the url
  • the number of shares of the url
  • the number of likes & comments on stories about this url
minet facebook COLUMN FILE

Additional options:

  • -o OUTPUT specifies the location of the output csv (being the source csv FILE with an additional facebook_share_count column). Is stdout by default.

Minet

Example

Let's say you have a urls.csv file with - in a 'url' column - the urls you want the share count of.

Just use this command:

minet facebook url urls.csv -o urls_with_fb_data.csv

As a result, you get a urls_with_fb_data.csv file with a facebook_share_count column.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
minet-0.1.0-py3-none-any.whl (15.5 kB) Copy SHA256 hash SHA256 Wheel py3
minet-0.1.0.tar.gz (11.5 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page