Python3 library to ease Aliexpress crawling
Project description
Crawliexpress
Description
Allows to fetch various resources from Aliexpress, such as category, text search, product, feedbacks.
It does not use official API nor a headless browser, but parses page source.
Obviously, it is very vulnerable to DOM changes.
Usage
Install
pip install crawliexpress
Item
from crawliexpress import Client
client = Client("https://www.aliexpress.com")
client.get_item("4000505787173")
Feedbacks
from crawliexpress import Client
from pprint import pprint
from time import sleep
client = Client("https://www.aliexpress.com")
item = client.get_item("20000001708485")
page = 1
pages = list()
while True:
feedback_page = client.get_feedbacks(
item.product_id,
item.owner_member_id,
item.company_id,
with_picture=True,
page=page,
)
print(feedback_page.page)
if feedback_page.has_next_page() is False:
break
page += 1
sleep(1)
Category
from crawliexpress import Client
from time import sleep
client = Client(
"https://www.aliexpress.com",
# copy it from your browser cookies
"xxxx",
)
page = 1
while True:
search_page = client.get_category(205000314, "t-shirts", page=page)
print(search_page.page)
if search_page.has_next_page() is False:
break
page += 1
sleep(1)
- Cookies must be taken from your browser cookies, to avoid captcha and empty results. I usually login then copy as cURL a request made by my browser on a category or a text search. Make sure to remove the
Cookie:
prefix to keep only cookie values.
Search
from crawliexpress import Client
from time import sleep
client = Client(
"https://www.aliexpress.com",
# copy it from your browser cookies
"xxxx",
)
page = 1
while True:
search_page = client.get_search("akame ga kill", page=page)
print(search_page.page)
if search_page.has_next_page() is False:
break
page += 1
sleep(1)
- Cookies must be taken from your browser cookies, to avoid captcha and empty results. I usually login then copy as cURL a request made by my browser on a category or a text search. Make sure to remove the
Cookie:
prefix to keep only cookie values.
API
class crawliexpress.Category(client, category_id, category_name, sort_by='default')
A category
-
Parameters
-
category_id – id of the category, category id of https://www.aliexpress.com/category/205000221/t-shirts.html is 205000220
-
category_name – name of the category, category name of https://www.aliexpress.com/category/205000221/t-shirts.html is t-shirts
-
sort_by (default: best match total_tranpro_desc: number of orders) – indeed
-
class crawliexpress.Client(base_url, cookies=None)
Exposes methods to fetch various resources.
-
Parameters
-
base_url – allows to change locale (not sure about this one)
-
cookies – must be taken from your browser cookies, to avoid captcha and empty results. I usually login then copy as cURL a request made by my browser on a category or a text search. Make sure to remove the Cookie: prefix to keep only cookie values.
-
get_category(category_id, category_name, page=1, sort_by='default')
Fetches a category page
-
Parameters
-
category_id – id of the category, category id of https://www.aliexpress.com/category/205000221/t-shirts.html is 205000220
-
category_name – name of the category, category name of https://www.aliexpress.com/category/205000221/t-shirts.html is t-shirts
-
page – page number
-
sort_by (default: best match total_tranpro_desc: number of orders) – indeed
-
-
Returns
a search page
-
Return type
Crawliexpress.SearchPage
-
Raises
-
CrawliexpressException – if there was an error fetching the dataz
-
CrawliexpressCaptchaException – if there is a captcha, make sure to use valid cookies to avoid this
-
get_feedbacks(product_id, owner_member_id, company_id=None, v=2, member_type='seller', page=1, with_picture=False)
Fetches a product feedback page
-
Parameters
-
product_id – id of the product, item id of https://www.aliexpress.com/item/20000001708485.html is 20000001708485
-
owner_member_id – member id of the product owner, as stored in Crawliexpress.Item.owner_member_id
-
page – page number
-
with_picture – limit to feedbacks with a picture
-
-
Returns
a feedback page
-
Return type
Crawliexpress.FeedbackPage
-
Raises
CrawliexpressException – if there was an error fetching the dataz
get_item(item_id)
Fetches a product informations from its id
-
Parameters
item_id – id of the product to fetch, item id of https://www.aliexpress.com/item/20000001708485.html is 20000001708485
-
Returns
a product
-
Return type
Crawliexpress.Item
-
Raises
CrawliexpressException – if there was an error fetching the dataz
get_search(text, page=1, sort_by='default')
Fetches a search page
-
Parameters
-
text – text search
-
page – page number
-
sort_by (default: best match total_tranpro_desc: number of orders) – indeed
-
-
Returns
a search page
-
Return type
Crawliexpress.SearchPage
-
Raises
-
CrawliexpressException – if there was an error fetching the dataz
-
CrawliexpressCaptchaException – if there is a captcha, make sure to use valid cookies to avoid this
-
exception crawliexpress.CrawliexpressCaptchaException()
exception crawliexpress.CrawliexpressException()
class crawliexpress.Feedback()
A user feedback
comment( = None)
Review
country( = None)
Country code
datetime( = None)
Raw datetime from DOM
images( = None)
List of image links
profile( = None)
Profile link
rating( = None)
Rating out of 100
user( = None)
Name
class crawliexpress.FeedbackPage()
A feedback page
feedbacks( = None)
List of Crawliexpress.Feedback objects
has_next_page()
Returns true if there is a following page, useful for crawling
-
Return type
bool
known_pages( = None)
Sibling pages
page( = None)
Page number
class crawliexpress.Search(client, text, sort_by='default')
A search
-
Parameters
-
text – text search
-
sort_by (default: best match total_tranpro_desc: number of orders) – indeed
-
class crawliexpress.SearchPage()
A search page
has_next_page()
Returns true if there is a following page, useful for crawling
-
Return type
bool
items( = None)
List of products, raw from JS parsing
page( = None)
page number
result_count( = None)
Number of result for the whole search
size_per_page( = None)
Number of result per page
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for crawliexpress-0.1.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b5741f7179fec6c68023a9431aa38cc51efa1000e87b93dd9ccb4b2d60fc54d1 |
|
MD5 | 7c6ae227fd2c629dae6b4ddcbec6c870 |
|
BLAKE2b-256 | d706cbfdbdb34c0819debc9ff8c148abfa6af0674d6c6dc28a914f02b0fb04dd |