Swiftea's Open Source Web Crawler
Project description
Swiftea Crawler
Description
Swiftea-Crawler is an open source web crawler for Swiftea search engine.
Currently, it can:
- Visit websites
- check robots.txt
- search encoding
- Parse them
- extract data
- title
- description
- ...
- extract words
- filter stopwords
- extract data
- Index them
- in database
- in inverted-index
- Archive log files in a zip file
- avoid duplicates (http and https)
The domain crawler focus on the links that belong to the given domain name. The level option of the domain crawler defines how deep the crawl goes. For example, the level 2 means the crawler will crawl all the links of the domain plus the links that all pages in this domain lead to.
The domain crawler can use a MongoDB database to store the inverted index.
Install and usage
Setup
virtualenv -p /usr/bin/python3 crawler-env
source crawler-env/bin/activate
pip install -r requirements.txt
export PYTHONPATH=crawler
If the files below don't exist, the crawler will download them from our server:
- data/stopwords/fr.stopwords.txt
- data/stopwords/en.stopwords.txt
- data/badwords/fr.badwords.txt
- data/badwords/en.badwords.txt
Run tests
Using only pytest:
python setup.py test
With coverage:
coverage run setup.py test
coverage report
coverage html
Run crawler
from crawler.main import main
# infinite crawling:
main(loop_1=50, loop_2=10, dir_data='data1')
# domain crawling:
main(url='http://example.example', level=0, target_level=1, dir_data='data1')
main(url='http://some.thing', level=1, target_level=3, use_mongodb=True)
Build documentation
You must install python3-sphinx
package.
cd docs
make html
Run linter
Install prospector
, then:
prospector > prospector_output.json
Deploy
Create directories in ftp server:
- /www/data/badwords
- /www/data/stopwords
- /www/data/inverted_index
Upload the list of words: /www/[type]/[lang].[type].txt
.
Create database with sql/swiftea_mysql_db.sql
.
How it works?
Database:
The DatabaseSwiftea object can:
- send documents
- get the id of a document by the url
- delete a document
- select the suggestions
- check if a doc exists
- check for http and https duplicate
Limits
When stoping the crawler (ctrl+V), it will not restart with the interupted url.
There are some little bugs with in the file data/links/links.json
: some items are missing the file
value.
Version
Current version is 1.1.2
Tech
Swiftea's Crawler uses a number of open source projects to work properly:
Contributing
Want to contribute? Great!
Fork the repository. Then, run:
git clone git@github.com:<username>/Crawler.git
cd Crawler
Then, do your work and commit your changes. Finally, make a pull request.
Commit conventions:
General
- Use the present tense
- Use the imperative mood
Examples
- Add something: "Add feature ..."
- Update: "Update ..."
- Improve something: "Improve ..."
- Change something: "Change ..."
- Fix something: "Fix ..."
- Fix an issue: "Fix #123456" or "Close #123456"
License
GNU GENERAL PUBLIC LICENSE (v3)
Free Software, Hell Yeah!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for swiftea_crawler-1.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b9e8f8a0b47b6c0c7833ec677725cc72635890d721b2f0f3fbefe4b06a5207d5 |
|
MD5 | 136c609d336ea16ea47296199dd4bdf0 |
|
BLAKE2b-256 | a86b53bd78db1c17d678abcffaec6dfd7405b3b7d03d1c05593eb23e94c309bf |