Scrappe all products and theirs related suppliers existing on Alibaba based on keywords provided by user and save results into a database (Mysql/Sqlite).
Project description
Alibaba-CLI-Scraper
Is a python package that provides a dedicated CLI interface for scraping data from Alibaba.com. The purpose of this project is to extract products and theirs related suppliers informations from Alibaba.com and store it in a local database (SQLite or MySQL). The project utilizes asynchronous requests for efficient handling of numerous requests and allows users to easily run the scraper and manage the database using a user-friendly command-line interface (CLI).
Features:
- Asynchronous Scraping: Utilizes asynchronous API of Playwright for efficient handling of numerous pages results.
- Database Integration: Stores scraped data in a database (SQLite or MySQL) for structured persistence.
- User-Friendly CLI: Provides easy-to-use commands for running the scraper and managing the database.
Future Enhancements
This project has a lot of potential for growth! Here are some exciting features I'm considering for the future:
- Data Export: Add functionality to export scraped data to various formats like CSV and Excel spreadsheets for easier analysis and sharing.
- PostgreSQL Support: Expand database compatibility to include PostgreSQL, giving users more database choices.
- Retrieval Augmented Generation (RAG): Integrate a RAG system that allows users to ask natural language questions about the scraped data, making it even more powerful for insights.
Installation
Like any other python packages, to avoid any issues, with other packages or depencies installed already installed on your machine, this tool should be installed with pipx to create isolated environments before to run it. But i didn't found a way to allow that. Then you will need to create a virtual environment before to install it with the following commands:
-
Create virtual environment:
python -m venv scrapper
-
Activate virtual environment:
scrapper\Scripts\activate.bat
-
Install scraper package:
python -m pip install aba-cli-scrapper
Using the CLI Interface
This project provides a user-friendly command-line interface (CLI) built with typer
for interacting with the scraper and database.
Available Commands:
Need Help? run any commands followed by --help
for detailed informations about its usage and options. For example: aba-run --help
will show you all subcommands available and how to use them.
Warnings: 1) aba-run
is the base command means all other commands that will be introduce bellow are sub-commands and should always be preceded by aba-run
.
2) As i'm still working on this packages i'm facing many bugs with async part of this tool i'm using bright data to enhance perfomance when its come to retrieve html pages results asynchronously. So it's could be unvailable due to api key bright data limit reached. Then if for any reason you encounter an issue with async api which is set by default, you can use instead sync api by specifying --sync-api
flag cause is works perfecly fine. So let's jump to the tutorial.
Practice make perfect isn't ? So let's get started with a use case example. Let's assume that you want to scrape data about electric bikes from Alibaba.com.
-
scraper
: Initiates scraping of Alibaba.com based on the provided keywords. this command takes two required arguments and one optional argument:key_words
(required): The search term(s) for finding products on Alibaba. Enclose multiple keywords in quotes.--page-results
(required): Usually keys words will results to many pages macthing them. Then you must to indicate how many of them you want to pull out.--html-folder
(optional): Specifies the directory to store the raw HTML files. If omitted, a folder with sanitized keywords as name will be automatically created.
Example:
aba-run scraper "electric bikes" --html-folder bike_results --page-results 15
by default scrapper
will use async which is as explained unstable. the if you want to use sync api run:
bash aba-run scraper "electric bikes" --html-folder bike_results --page-results 15 --sync-api
and voila!
if --html-folder
option is not provided, a folder with sanitized keywords as name will be automatically created and should result to electric_bikes
as a results folder name.
after that bike_results
(since you already provided name you wish to have) directory has been created and should contains all html files from alibaba.com matching your keywords.
Then you must initialize a database. Mysql and sqlite are supported.
db-init
: Creates a new database mysql/sqlite. this command takes one required arguments and six optional arguments(depends on engine you choose):engine
(required): Choose eithersqlite
ormysql
.--sqlite-file
(optional, SQLite only): The name for your SQLite database file (without the extension).--host
,--port
,--user
,--password
,--db-name
(required for MySQL): Your MySQL database connection details.--only-with
(optional Mysql): If you just want to update some details of your credentials indb_credentials.json
file but not all before to initialize an brand new database.
MySQL Example:
aba-run db-init mysql --user "mysql_username" --password "mysql_password" --db-name "alibaba_products"
Assuming that you have already initialized your database,and you want to created a new one without updating all your credentials, simply run :
aba-run db-init mysql --db-name "alibaba_products" --only-with
NB: This commands will save your credentials in db_credentials.json
file, so when you will need to update your database, simply run aba-run db-update mysql --kw-results bike_results\
to automatically update your database and using your saved credentials
SQLite Use case :
aba db-init sqlite --sqlite-file alibaba_data
As soons as your database has been initialized, you can update it with the scraped data.
db-update
: add scraped data from html files to your database (you can't use this command twice with same database credentals to avoid UNIQUE CONSTRAINT ERROR).
this command takes two required arguments and two optional arguments:
* --db-engine
(required): Select your database engine: sqlite
or mysql
.
* --kw-results
(required): The path to the folder containing the HTML files generated by the scraper
sub command.
* --filename
(required for SQLite): If you're using SQLite, provide the desired filename for your database. whitout any extension.
* --db-name
(optional for MySQL): If you're using MySQL,and want to push the data to a different database, provide the desired database name.
MySQL Example:
aba-run db-update mysql --kw-results bike_results\
NB:What if you want to change something while you updating the database? Assuming that you have run another scraping command and you want to save this data in another database name whitout update credential file or rewriting all theses parameter just to change your database name then, simply run aba-run db-update mysql --kw-results another_keyword_folder_result\ --db-name "another_database_name"
.
SQLite Example:
aba-run db-update sqlite --kw-results bike_results\ --filename alibaba_data
Contributions Welcome!
I believe in the power of open source! If you'd like to contribute to this project, feel free to fork the repository, make your changes, and submit a pull request. I'm always open to new ideas and improvements.
License
This project is licensed under the Gnu General Public License Version 3.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aba_cli_scrapper-0.1.8.tar.gz
.
File metadata
- Download URL: aba_cli_scrapper-0.1.8.tar.gz
- Upload date:
- Size: 51.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.2 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 700d3ff813f48c31771ae2cd4f4d4d4d636c18511ea9fef219732b4881481877 |
|
MD5 | b58c7280553124f5e79901f183a890b3 |
|
BLAKE2b-256 | 3c4190f6be99722d5d59e154b79fa5d70e764ed31d90afa0100aff02ad267db7 |
Provenance
File details
Details for the file aba_cli_scrapper-0.1.8-py3-none-any.whl
.
File metadata
- Download URL: aba_cli_scrapper-0.1.8-py3-none-any.whl
- Upload date:
- Size: 69.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.2 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ecee2560ce32443e0a7551b93aebf47d4d80e51675b1995ae103562abe6b4c17 |
|
MD5 | 190818c85a26e16f579d24a96cdd61fe |
|
BLAKE2b-256 | cd0c36ea912ae796e7c1fb0d45f4841afc0ab0624aa1e0aa7318e4b846671cf4 |