goose3

Html Content / Article Extractor, web scrapping for Python3

These details have not been verified by PyPI

Project links

Homepage

Project description

Intro

Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project.

This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any YouTube/Vimeo movies embedded in article
Meta Description
Meta tags

The Python version was originally rewritten by:

Xavier Grangier

Licensing

If you find Goose useful or have issues please drop me a line. I’d love to hear how you’re using it or what features should be improved.

Goose is licensed by Gravity.com under the Apache 2.0 license; see the LICENSE file for more details.

On-line Documentation

On-line documentation is available on Read the Docs which contains more in-depth documentation.

Setup

To install using pip, with all supported languages, which will install additional dependencies:

pip install goose3[all]

To install the minimal version:

pip install goose3

To install just the dependencies for a single language:

pip install goose3[chinese]
pip install goose3[arabic]
pip install goose3[japanese]

To install from source:

mkvirtualenv --no-site-packages goose3
git clone https://github.com/goose3/goose3.git
cd goose3
pip install -r ./requirements/python
python setup.py install

Take it for a spin

>>> from goose3 import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'
>>> article.meta_description
"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
>>> article.cleaned_text[:150]
(CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

Configuration

There are two ways to pass configuration to goose. The first one is to pass goose a Configuration() object. The second one is to pass a configuration dict.

For instance, if you want to change the userAgent used by Goose just pass:

>>> g = Goose({'browser_user_agent': 'Mozilla'})

Switching parsers: Goose can now be used with lxml html parser or lxml soup parser. By default the html parser is used. If you want to use the soup parser pass it in the configuration dict :

>>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

One can also set Goose to be more lenient on network exceptions. To turn off throwing all network exceptions, set the strict configuration setting to false:

>>> g = Goose({'strict': False})

To turn on image fetching, one can simply enable it using the enable_image_fetching configuration property:

>>> g = Goose({'enable_image_fetching': True})

Goose is now language aware

For example, scraping a Spanish content page with correct meta language tags:

>>> from goose3 import Goose
>>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Las listas de espera se agravan'
>>> article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'

Some pages don’t have correct meta language tags, you can force it using configuration :

>>> from goose3 import Goose
>>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'
>>> g = Goose({'use_meta_language': False, 'target_language':'es'})
>>> article = g.extract(url=url)
>>> article.cleaned_text[:150]
u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '

Passing {‘use_meta_language’: False, ‘target_language’:’es’} will forcibly select Spanish.

Video extraction

>>> import goose3
>>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'
>>> g = goose3.Goose({'target_language':'fr'})
>>> article = g.extract(url=url)
>>> article.movies
[<goose.videos.videos.Video object at 0x25f60d0>]
>>> article.movies[0].src
'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false'
>>> article.movies[0].embed_code
'<iframe src="http://sa.kewego.com/embed/vp/?language_code=fr&amp;playerKey=1764a824c13c&amp;configKey=dcc707ec373f&amp;suffix=&amp;sig=9bc77afb496s&amp;autostart=false" frameborder="0" scrolling="no" width="476" height="357"/>'
>>> article.movies[0].embed_type
'iframe'
>>> article.movies[0].width
'476'
>>> article.movies[0].height
'357'

Goose in Chinese

Some users want to use Goose for Chinese content. Chinese word segmentation is way more difficult to deal with than occidental languages. Chinese needs a dedicated StopWord analyser that need to be passed to the config object.

>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> url  = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
香港行政长官梁振英在各方压力下就其大宅的违章建筑（僭建）问题到立法会接受质询，并向香港民众道歉。

梁振英在星期二（12月10日）的答问大会开始之际在其演说中道歉，但强调他在违章建筑问题上没有隐瞒的意图和动机。

一些亲北京阵营议员欢迎梁振英道歉，且认为应能获得香港民众接受，但这些议员也质问梁振英有

Goose in Arabic

In order to use Goose in Arabic you have to use the StopWordsArabic class.

>>> from goose3 import Goose
>>> from goose3.text import StopWordsArabic
>>> url = 'http://arabic.cnn.com/2013/middle_east/8/3/syria.clashes/index.html'
>>> g = Goose({'stopwords_class': StopWordsArabic})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
دمشق، سوريا (CNN) - أكدت جهات سورية معارضة أن فصائل مسلحة معارضة لنظام الرئيس بشار الأسد وعلى صلة بـ"الجيش الحر" تمكنت من السيطرة على مستودعات للأسل

Goose in Korean

In order to use Goose in Korean you have to use the StopWordsKorean class.

>>> from goose3 import Goose
>>> from goose3.text import StopWordsKorean
>>> url='http://news.donga.com/3/all/20131023/58406128/1'
>>> g = Goose({'stopwords_class':StopWordsKorean})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
경기도 용인에 자리 잡은 민간 시험인증 전문기업 ㈜디지털이엠씨(www.digitalemc.com).
14년째 세계 각국의 통신·안전·전파 규격 시험과 인증 한 우물만 파고 있는 이 회사 박채규 대표가 만나기로 한 주인공이다.
그는 전기전자·무선통신·자동차 전장품 분야에

Goose in Japanese

In order to use Goose in Japanese you have to use the StopWordsJapanese class.

>>> from goose3 import Goose
>>> from goose3.text import StopWordsJapanese
>>> url='https://www.cnn.co.jp/usa/35237967.html'
>>> g = Goose({'stopwords_class':StopWordsJapanese})
>>> article = g.extract(url=url)
>>> print article.cleaned_text[:150]
イリーナ・ザルツカさん（２３）。今年８月、ノースカロライナ州シャーロットのライトレール列車に乗っていた際に刺されて死亡した/Iryna Zarutska/Instagram

（ＣＮＮ） 米ノースカロライナ州シャーロット中心部から数キロ離れたスケイリーバーク駅。駅に到着した深夜の列車に乗り込んだとき

TODO

Video html5 tag extraction

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

3.1.22

Jul 23, 2026

3.1.21

Nov 30, 2025

3.1.20

Sep 17, 2025

3.1.19

Jan 19, 2024

3.1.18

Dec 27, 2023

3.1.17

Jul 4, 2023

3.1.16

Jun 7, 2023

3.1.15

Jun 1, 2023

3.1.14

Apr 26, 2023

3.1.13

Feb 24, 2023

3.1.12

Sep 14, 2022

3.1.11

Jan 18, 2022

3.1.10

Nov 17, 2021

3.1.9

Apr 27, 2021

3.1.8

Feb 22, 2021

3.1.7

Feb 2, 2021

3.1.6

Oct 20, 2018

3.1.5

Sep 11, 2018

3.1.4

Aug 19, 2018

3.1.3

Jul 7, 2018

3.1.2

Jun 2, 2018

3.1.1

May 29, 2018

3.1.0

Apr 3, 2018

3.0.9

Jan 12, 2018

3.0.8

Dec 9, 2017

3.0.7

Nov 23, 2017

3.0.6

Aug 22, 2017

3.0.5

Mar 30, 2017

3.0.4

Mar 30, 2017

3.0.3

Mar 28, 2017

3.0.2

Mar 28, 2017

3.0.1

Mar 24, 2017

3.0.0

Mar 13, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goose3-3.1.22.tar.gz (108.2 kB view details)

Uploaded Jul 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goose3-3.1.22-py3-none-any.whl (114.6 kB view details)

Uploaded Jul 23, 2026 Python 3

File details

Details for the file goose3-3.1.22.tar.gz.

File metadata

Download URL: goose3-3.1.22.tar.gz
Upload date: Jul 23, 2026
Size: 108.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for goose3-3.1.22.tar.gz
Algorithm	Hash digest
SHA256	`db53cb2ffbab3d5dfe933ded4568ba41243bf349a264aceccc01109d19e15f23`
MD5	`445a45805d03d0650fa2867c2d7b49fb`
BLAKE2b-256	`16e55d42b6704a2591d050460b93f3f9aa69ab588c18461a726c5d00278aaad8`

See more details on using hashes here.

File details

Details for the file goose3-3.1.22-py3-none-any.whl.

File metadata

Download URL: goose3-3.1.22-py3-none-any.whl
Upload date: Jul 23, 2026
Size: 114.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.6

File hashes

Hashes for goose3-3.1.22-py3-none-any.whl
Algorithm	Hash digest
SHA256	`787daa1a6439222e462e1da1783601f44affa0d941aba7e2609f71e2e2eb8aa0`
MD5	`7b27f0fd57933f78b794b1d272040b54`
BLAKE2b-256	`a222052b8315064577f773ed51d6aba67cb67a718e21ac4a0c38f006741721a6`

See more details on using hashes here.

goose3 3.1.22

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Intro

Licensing

On-line Documentation

Setup

Take it for a spin

Configuration

Goose is now language aware

Video extraction

Goose in Chinese

Goose in Arabic

Goose in Korean

Goose in Japanese

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes