Skip to main content

简易、强大的推特(Twitter)采集程序,支持用户,发文,评论等采集

Project description

easy_twitter_crawler

推特(Twitter)采集程序,支持用户,发文,评论采集,希望能为使用者带来益处。如果您也想贡献好的代码片段,请将代码以及描述,通过邮箱( xinkonghan@gmail.com )发送给我。代码格式是遵循自我主观,如存在不足敬请指出!

安装

pip install easy-twitter-crawler

主要功能

  • search_crawler 关键词搜索采集(支持热门,用户,最新,视频,照片;支持条件过滤)
  • user_crawler 用户采集(支持用户信息,用户发文,用户回复)
  • common_crawler 通用采集(支持发文,评论)

简单使用

设置代理及cookie (关键词,用户发文,用户回复,粉丝,关注,评论需要设置cookie)

proxy = {
    'http': 'http://127.0.0.1:10808',
    'https': 'http://127.0.0.1:10808'
}
cookie = ''

关键词采集使用案例(对关键词指定条件采集10条数据)

from easy_spider_tool import cookie_to_dic, format_json
from easy_twitter_crawler import set_proxy, set_cookie, search_crawler, TwitterFilter

key_word = 'elonmusk'

twitter_filter = TwitterFilter(key_word)
twitter_filter.word_category(lang='en')
twitter_filter.account_category(filter_from='', to='', at='')
twitter_filter.filter_category(only_replies=None, only_links=None, exclude_replies=None, exclude_links=None)
twitter_filter.interact_category(min_replies='', min_faves='', min_retweets='')
twitter_filter.date_category(since='', until='')
key_word = twitter_filter.filter_join()

set_proxy(proxy)
set_cookie(cookie_to_dic(cookie))

for info in search_crawler(
        key_word,
        data_type='Top',
        count=10,
):
    set_proxy(proxy)
    set_cookie(cookie_to_dic(cookie))
    print(format_json(info))

关键词采集参数说明

字段名 类型 必须 描述
key_word string 关键词
data_type string 指定采集的板块(热门:Top 用户:People 最新:Latest 视频:Videos 照片:Photos)
count int 采集的数量(默认不采集:-1,采集全部:0,采集指定的数量:>0)

关键词过滤参数说明(对标推特搜索功能,同一参数多个值间用空格隔开)

所属类别 字段名 类型 必须 描述
word_category exact string 精确短语
word_category filter_any string 任何一词(支持多个)
word_category exclude string 排除这些词语 (支持多个) 示例:dog cat
word_category tab string 这些话题标签(支持多个)
word_category lang string 语言(文档后附语言可选范围)
account_category filter_from string 来自这些账号(支持多个)
account_category to string 发给这些账号(支持多个)
account_category at string 提及这些账号(支持多个)
filter_category only_replies bool 仅回复
filter_category only_links bool 仅链接
filter_category exclude_replies bool 排除回复
filter_category exclude_links bool 排除链接
interact_category min_replies int 最少回复次数
interact_category min_faves int 最少喜欢次数
interact_category min_retweets int 最少转推次数
date_category since string 开始日期('2023-07-20')
date_category until string 结束日期('2023-08-20')

用户信息采集使用案例(采集该用户信息及10条文章,10条回复,10个粉丝信息,10个关注信息)

from easy_spider_tool import cookie_to_dic, format_json
from easy_twitter_crawler import set_proxy, set_cookie, user_crawler

set_proxy(proxy)
set_cookie(cookie_to_dic(cookie))

for info in user_crawler(
        'elonmusk',
        article_count=10,
        reply_count=10,
        following_count=10,
        followers_count=10,
        # start_time='2023-07-20 00:00:00',
        # end_time='2023-07-27 00:00:00',
):
    set_proxy(proxy)
    set_cookie(cookie_to_dic(cookie))
    print(format_json(info))
    print(f"文章数:{len(info.get('article', []))}")
    print(f"粉丝数:{len(info.get('followers', []))}")
    print(f"关注数:{len(info.get('following', []))}")
    print(f"回复数:{len(info.get('reply', []))}")

用户信息采集参数说明

字段名 类型 必须 描述
user_id string 用户名(https://twitter.com/elonmusk 中的 elonmusk)
article_count int 采集文章数(默认不采集:-1,采集全部:0,采集指定的数量:>0)
reply_count int 采集回复数 (默认不采集:-1,采集全部:0,采集指定的数量:>0)
following_count int 采集关注数 (默认不采集:-1,采集全部:0,采集指定的数量:>0)
followers_count int 采集粉丝数 (默认不采集:-1,采集全部:0,采集指定的数量:>0)
start_time string 数据截取开始时间 (仅当采集文章或回复时有效)
end_time string 数据截取结束时间(仅当采集文章或回复时有效)

通用采集使用案例(已知文章id,采集此文章信息)

from easy_spider_tool import cookie_to_dic, format_json
from easy_twitter_crawler import set_proxy, set_cookie, common_crawler

set_proxy(proxy)
set_cookie(cookie_to_dic(cookie))

for info in common_crawler(
        '1684447438864785409',
        data_type='article',
):
    set_proxy(proxy)
    set_cookie(cookie_to_dic(cookie))
    print(format_json(info))

通用采集使用案例(已知文章id,采集此文章下10条评论)

from easy_spider_tool import cookie_to_dic, format_json
from easy_twitter_crawler import set_proxy, set_cookie, common_crawler

set_proxy(proxy)
set_cookie(cookie_to_dic(cookie))

for info in common_crawler(
        '1684447438864785409',
        data_type='comment',
        comment_count=10,
):
    set_proxy(proxy)
    set_cookie(cookie_to_dic(cookie))
    print(format_json(info))

通用采集参数说明

字段名 类型 必须 描述
task_id string 文章id(https://twitter.com/elonmusk/status/1690164670441586688 中的 1690164670441586688)
data_type string 采集类型(文章:article 评论:comment)
comment_count int 采集评论数量(仅当data_type为comment时有效;默认不采集:-1,采集全部:0,采集指定的数量:>0)

语言表

有时间再补充

链接

Github:https://github.com/hanxinkong/easy_twitter_crawler

在线文档:https://easy_twitter_crawler.xink.top/

贡献者

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easy_twitter_crawler-1.0.2.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

easy_twitter_crawler-1.0.2-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file easy_twitter_crawler-1.0.2.tar.gz.

File metadata

  • Download URL: easy_twitter_crawler-1.0.2.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.6

File hashes

Hashes for easy_twitter_crawler-1.0.2.tar.gz
Algorithm Hash digest
SHA256 ee1405fc5d898b23a78d37025aee3a2740e5c0295e951219ce05e575bd4c4bf5
MD5 83063be659ce10d7570814b82bd472d3
BLAKE2b-256 992767d12e988fef3e83f62051ebb3b67b548f747a0ed91b0cebe4e3008b4d5d

See more details on using hashes here.

File details

Details for the file easy_twitter_crawler-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for easy_twitter_crawler-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9125ce6c08ec769b7b91e5e48484dc264716e8a213048f0ec24dff856c39824c
MD5 432730f715d31880b8c9de0d280e5fb7
BLAKE2b-256 3d182e6ce683d385781c2b98876245cb280c58cf6a95ff0610ef9d50391ca9bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page