extract structured data from any unstructered web page

These details have not been verified by PyPI

Project description

Extracty: Dynamic Data Extraction

Extract structured data from any unstructered web page

extracty is a library designed to streamline and simplify the process of extracting structured data from websites. Utilizing the robust functionality of Pydantic and Instructor, extracty enables users to define dynamic data extraction schemas and interact with a simple function call.

How to Run

first install the library.

pip install extracty

How to use the library

from extracty import LLMExtractor

extractor = LLMExtractor(
    url="url-you-want-to-scrape",
    query="your-query",
    api_key="your-openai-api-key",
)

data = extractor.extract()

print(data.model_dump_json())

more advance extraction

here you can specify the fields you want the model to look for.
Also, you can specify the types you want for each model for easier data handling.

from extracty import LLMExtractor

fields = {
    "feild_1": str,
    "feild_2": int,
    "field_3": bool,
}

extractor = LLMExtractor(
    url="url-you-want-to-scrape",
    query="your-query",
    api_key="your-openai-api-key",
    fields=fields,
    gpt_model="the-model-you-want-to-use (defaults to gpt-4)",
)

data = extractor.extract()

print(data.model_dump_json())

Example-usage

Here is an example usage where we want to get the top 5 trending github repo's

from extracty import LLMExtractor

fields = {
    "rank": int,
    "repo_name": str,
    "small_description": str,
}

extractor = LLMExtractor(
    url="https://www.github.com/trending",
    query="What are the top 5 trending repositories on GitHub?",
    api_key="your-openai-api-key",
    fields=fields,
    gpt_model="the-model-you-want-to-use (defaults to gpt-4)",
)

data = extractor.extract()

print(data.model_dump_json())

and the corresponding output

{
  "name": "Top 5 Trending Repositories on GitHub",
  "data": [
    {
      "rank": 1,
      "repo_name": "stitionai / devika",
      "small_description": "Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Devika aims to be a competitive open-source alternative to Devin by Cognition AI."
    },
    {
      "rank": 2,
      "repo_name": "OpenDevin / OpenDevin",
      "small_description": "OpenDevin: Code Less, Make More 利用大模型，一键生成短视频"
    },
    {
      "rank": 3,
      "repo_name": "harry0703 / MoneyPrinterTurbo",
      "small_description": "Decentralized Autonomous Regulated Company (DARC), a company virtual machine that runs on any EVM-compatible blockchain, with on-chain law system, multi-level tokens and dividends mechanism."
    },
    {
      "rank": 4,
      "repo_name": "Project-DARC / DARC",
      "small_description": "A natural language interface for computers"
    },
    {
      "rank": 5,
      "repo_name": "OpenInterpreter / open-interpreter",
      "small_description": "A one stop repository for generative AI research updates, interview resources, notebooks and much more!"
    }
  ]
}

Contributing

If you would like to contribute to this project, please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Make your changes and commit them.
Push your changes to your forked repository.
Submit a pull request to the original repository.

I appreciate your contributions to build this fun project!

TODOs

Enhance the scraping and make it more robust.
- Utilize async Playwright to be efficient for installation.
- Enhance cleaning HTML content and make it more efficient.
Enhance Pydantic modeling.
- Enhance dynamic model creation.
- Enhance BaseExtractor.
Optimize performance for large-scale data extraction.

Acknowledgments

Utilizes OpenAI for advanced data extraction capabilities. Leverages Pydantic and Instructor for dynamic and robust data modeling. Employs langchain, playwright for efficient web interaction.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Mar 29, 2024

This version

0.1.0

Mar 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracty-0.1.0.tar.gz (5.6 kB view details)

Uploaded Mar 27, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extracty-0.1.0-py3-none-any.whl (6.4 kB view details)

Uploaded Mar 27, 2024 Python 3

File details

Details for the file extracty-0.1.0.tar.gz.

File metadata

Download URL: extracty-0.1.0.tar.gz
Upload date: Mar 27, 2024
Size: 5.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.8 Darwin/23.3.0

File hashes

Hashes for extracty-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`fc519604f0bad7fce07612b076199d855de55c726b2daef4be24161e01c56391`
MD5	`8f5d917210423eddecdba855b2bac0b8`
BLAKE2b-256	`b702c9fc8ddf0434956e33299d38bc4b2fd8f04b7c0cf63eceda00e61195cb44`

See more details on using hashes here.

File details

Details for the file extracty-0.1.0-py3-none-any.whl.

File metadata

Download URL: extracty-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2024
Size: 6.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.11.8 Darwin/23.3.0

File hashes

Hashes for extracty-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad48a95dd4b5767187c2b2c6dd97c503c5d7c5c29c6a7227d976a6c83f91cdcf`
MD5	`44b8160ee466e14f79b83b2799e8f8f7`
BLAKE2b-256	`fd59b7b74d241004b438e06fccd4c2c284e584330bc6dbfbe9da6292f7a0e3b1`

See more details on using hashes here.

extracty 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Extracty: Dynamic Data Extraction

How to Run

Example-usage

Contributing

TODOs

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes