Skip to main content

Process all the RSS and Atom feeds from the Small Web feeds list, validate them, generate statistics and eventually more.

Project description

small-web-dataset

The Small Web Dataset is a command line tool used to generate a dataset by aggregating of all the data from the Kagi Small Web index.

What is the Small Web? The Small Web is the web of independent websites that are not part of the big tech platforms. Here are some more reference about the concept [1][2][3][4][5].

There are different purpose for this tool and the dataset it creates:

  1. help analyzing the Kagi Small Web index, to detect and eventually remove the sites that doesn’t comply with the policy of the index
  2. create a dataset of all the sites that compose the index. This dataset is a very specialized subset of websites that are created and maintained by independent people, mostly old school bloggers. This dataset can be used for different specialized ML training, for example to train a classifier to detect the Small Web sites from the Big Web sites, etc.

Install

To install the command line tool, you simply have to:

git clone https://github.com/fgiasson/small-web-dataset.git
cd small-web-dataset

make build
make install-local-build

This will clone the repository, build the command line tool and install it in your local Python environment.

Configure

You have to make those environment variables available in your environment:

Variable Description
FEEDS_PATH The path where you want to save all the feeds on your local file system
DB_PATH The path where you want to save the SQLite dataset on your local file system

How to use

You can make sure that the command line tool is installed by running, and that the latest version is available by running:

small-web-dataset version

You can get the help documentation by running:

small-web-dataset --help

You can check what are the current configuration options for the tool in the current environment by running:

small-web-dataset config

To create the dataset, you simply have to run the following command:

small-web-dataset sync-feeds

This command will do three things:

  1. it will download all the RSS and Atom feeds from the Kagi Small Web index in the FEEDS_PATH folder
  2. it will read all the local feeds files and import them in a local SQLite database in the DB_PATH folder
  3. it will infer the core language of a feed from the language used to write the articles in the feed, and it will add this information in the database

Optionally, if you already have a local cache of the feeds and you only want to update/recreate the database, you simply have to specify the DDMMYYYY folder of the feeds you want to process:

small-web-dataset sync-feeds 18092023

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

small-web-dataset-0.0.2.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

small_web_dataset-0.0.2-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file small-web-dataset-0.0.2.tar.gz.

File metadata

  • Download URL: small-web-dataset-0.0.2.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for small-web-dataset-0.0.2.tar.gz
Algorithm Hash digest
SHA256 e5c79d638372c337bbbbde7d034aa09bd720faf448ec7bcc62f68f1d7304bbe5
MD5 ebba96ed0b974578579ba776a70a29e0
BLAKE2b-256 7c4981b99320e19085f7a9289d63956b3211fb7935a064d3d257c7e22ed03c34

See more details on using hashes here.

File details

Details for the file small_web_dataset-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for small_web_dataset-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 195d98d8faee21e1aeaaca9f322f220ab27cfbd7fd7ade941f77d646ec5d26ae
MD5 535fb428d836086ff19014f4ade4c7be
BLAKE2b-256 afc46af04dd709cb2d1bda993e50f0592cdb43b55274d52439fc65172be8be82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page