Skip to main content

Swiftea's Open Source Web Crawler

Project description

Swiftea Crawler

Build Status Coverage Status Documentation Status Code Health Requirements Status

Description

Swiftea-Crawler is an open source web crawler for Swiftea search engine.

Currently, it can:

  • Visit websites
    • check robots.txt
    • search encoding
  • Parse them
    • extract data
      • title
      • description
      • ...
    • extract words
      • filter stopwords
  • Index them
    • in database
    • in inverted-index
  • Archive log files in a zip file
    • avoid duplicates (http and https)

The domain crawler focus on the links that belong to the given domain name. The level option of the domain crawler defines how deep the crawl goes. For example, the level 2 means the crawler will crawl all the links of the domain plus the links that all pages in this domain lead to.

The domain crawler can use a MongoDB database to store the inverted index.

Install and usage

Run crawler

Create crawler-config.json file and fill it:

{
  "DIR_DATA": "data",

  "DB_HOST": "",
  "DB_USER": "",
  "DB_PASSWORD": "",
  "DB_NAME": "",
  "TABLE_NAMES": ["website", "suggestion"],
  "DIR_INDEX": "ii/",
  "FTP_HOST": "",
  "FTP_USER": "",
  "FTP_PASSWORD": "",
  "FTP_PORT": 21,
  "FTP_DATA": "/www/data/",
  "FTP_INDEX": "/www/data/inverted_index",

  "HOST": "",

  "MONGODB_PASSWORD": "",
  "MONGODB_USER": "",
  "MONGODB_CON_STRING": ""
}

Then:

from crawler import main

# infinite crawling:
crawler = main(l1=50, l2=10, dir_data='data1')

# domain crawling:
crawler = main(url='http://example.example', level=0, target_level=1, dir_data='data1')
crawler = main(url='http://some.thing', level=1, target_level=3, use_mongodb=True)

crawler.start()

Setup

virtualenv -p /usr/bin/python3 crawler-env
source crawler-env/bin/activate
pip install -r requirements.txt

Run tests

Using only pytest:

python setup.py test

With coverage:

coverage run setup.py test
coverage report
coverage html

Build documentation

You must install python3-sphinx package.

cd docs
make html

Run linter

Install prospector, then:

prospector > prospector_output.json

Deploy

Create directories in ftp server:

  • /www/data/badwords
  • /www/data/stopwords
  • /www/data/inverted_index

Upload the list of words: /www/[type]/[lang].[type].txt.

Create database with sql/swiftea_mysql_db.sql.

How it works?

If the files below don't exist, the crawler will download them from our server:

  • data/stopwords/fr.stopwords.txt
  • data/stopwords/en.stopwords.txt
  • data/badwords/fr.badwords.txt
  • data/badwords/en.badwords.txt

In crawler-config.json, if FTP_INDEX is "", then the inverted index will be save in DIR_INDEX but not send on the FTP server.

Database:

The DatabaseSwiftea object can:

  • send documents
  • get the id of a document by the url
  • delete a document
  • select the suggestions
  • check if a doc exists
  • check for http and https duplicate

Limits

When stoping the crawler (ctrl+V), it will not restart with the interupted url.

Version

Current version is 1.1.3

Tech

Swiftea's Crawler uses a number of open source projects to work properly:

Contributing

Want to contribute? Great!

Fork the repository. Then, run:

git clone git@github.com:<username>/Crawler.git
cd Crawler

Then, do your work and commit your changes. Finally, make a pull request.

Commit conventions:

General

  • Use the present tense
  • Use the imperative mood

Examples

  • Add something: "Add feature ..."
  • Update: "Update ..."
  • Improve something: "Improve ..."
  • Change something: "Change ..."
  • Fix something: "Fix ..."
  • Fix an issue: "Fix #123456" or "Close #123456"

License

GNU GENERAL PUBLIC LICENSE (v3)

Free Software, Hell Yeah!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swiftea-crawler-1.1.3.tar.gz (37.8 kB view details)

Uploaded Source

Built Distribution

swiftea_crawler-1.1.3-py3-none-any.whl (60.8 kB view details)

Uploaded Python 3

File details

Details for the file swiftea-crawler-1.1.3.tar.gz.

File metadata

  • Download URL: swiftea-crawler-1.1.3.tar.gz
  • Upload date:
  • Size: 37.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for swiftea-crawler-1.1.3.tar.gz
Algorithm Hash digest
SHA256 f3335c2ee466292f61a76888be8d741b7750fdc5d9d3605906db46c9a229a8bb
MD5 c2107dbbedfe2d43e91ba4956e82ac6c
BLAKE2b-256 a4e277b6d78a07a3b3a292f6989413eb17932e2a34922787217081a3878110cf

See more details on using hashes here.

File details

Details for the file swiftea_crawler-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: swiftea_crawler-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 60.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for swiftea_crawler-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2b9d31685b562bab6d5c7480e99ed5e9bbc78dd8148909f78343e56f61066e78
MD5 625ece7f7bce62f30303633731dac8de
BLAKE2b-256 3e7f627132b46cd676be0c2a4738c2047c8231cc18c80baec6256ef4a5843eb4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page