Skip to main content

extract structured data from any unstructered web page

Project description

Extracty: Dynamic Data Extraction

Extract structured data from any unstructered web page

extracty is a library designed to streamline and simplify the process of extracting structured data from websites. Utilizing the robust functionality of Pydantic and Instructor, extracty enables users to define dynamic data extraction schemas and interact with a simple function call.

How to Run

first install the library.

pip install extracty

How to use the library

from extracty import LLMExtractor

extractor = LLMExtractor(
    url="url-you-want-to-scrape",
    query="your-query",
    api_key="your-openai-api-key",
)

data = extractor.extract()

print(data.model_dump_json())

more advance extraction

here you can specify the fields you want the model to look for.
Also, you can specify the types you want for each model for easier data handling.

from extracty import LLMExtractor

fields = {
    "feild_1": str,
    "feild_2": int,
    "field_3": bool,
}

extractor = LLMExtractor(
    url="url-you-want-to-scrape",
    query="your-query",
    api_key="your-openai-api-key",
    fields=fields,
    gpt_model="the-model-you-want-to-use (defaults to gpt-4)",
)

data = extractor.extract()

print(data.model_dump_json())

Example-usage

Here is an example usage where we want to get the top 5 trending github repo's

from extracty import LLMExtractor

fields = {
    "rank": int,
    "repo_name": str,
    "small_description": str,
}

extractor = LLMExtractor(
    url="https://www.github.com/trending",
    query="What are the top 5 trending repositories on GitHub?",
    api_key="your-openai-api-key",
    fields=fields,
    gpt_model="the-model-you-want-to-use (defaults to gpt-4)",
)

data = extractor.extract()

print(data.model_dump_json())

and the corresponding output

{
  "name": "Top 5 Trending Repositories on GitHub",
  "data": [
    {
      "rank": 1,
      "repo_name": "stitionai / devika",
      "small_description": "Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Devika aims to be a competitive open-source alternative to Devin by Cognition AI."
    },
    {
      "rank": 2,
      "repo_name": "OpenDevin / OpenDevin",
      "small_description": "OpenDevin: Code Less, Make More 利用大模型,一键生成短视频"
    },
    {
      "rank": 3,
      "repo_name": "harry0703 / MoneyPrinterTurbo",
      "small_description": "Decentralized Autonomous Regulated Company (DARC), a company virtual machine that runs on any EVM-compatible blockchain, with on-chain law system, multi-level tokens and dividends mechanism."
    },
    {
      "rank": 4,
      "repo_name": "Project-DARC / DARC",
      "small_description": "A natural language interface for computers"
    },
    {
      "rank": 5,
      "repo_name": "OpenInterpreter / open-interpreter",
      "small_description": "A one stop repository for generative AI research updates, interview resources, notebooks and much more!"
    }
  ]
}

Contributing

If you would like to contribute to this project, please follow these steps:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix.
  3. Make your changes and commit them.
  4. Push your changes to your forked repository.
  5. Submit a pull request to the original repository.

I appreciate your contributions to build this fun project!

TODOs

  • Enhance the scraping and make it more robust.
    • Utilize async Playwright to be efficient for installation.
    • Enhance cleaning HTML content and make it more efficient.
  • Enhance Pydantic modeling.
    • Enhance dynamic model creation.
    • Enhance BaseExtractor.
  • Optimize performance for large-scale data extraction.

Acknowledgments

Utilizes OpenAI for advanced data extraction capabilities. Leverages Pydantic and Instructor for dynamic and robust data modeling. Employs langchain, playwright for efficient web interaction.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extracty-0.1.0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extracty-0.1.0-py3-none-any.whl (6.4 kB view details)

Uploaded Python 3

File details

Details for the file extracty-0.1.0.tar.gz.

File metadata

  • Download URL: extracty-0.1.0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.8 Darwin/23.3.0

File hashes

Hashes for extracty-0.1.0.tar.gz
Algorithm Hash digest
SHA256 fc519604f0bad7fce07612b076199d855de55c726b2daef4be24161e01c56391
MD5 8f5d917210423eddecdba855b2bac0b8
BLAKE2b-256 b702c9fc8ddf0434956e33299d38bc4b2fd8f04b7c0cf63eceda00e61195cb44

See more details on using hashes here.

File details

Details for the file extracty-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: extracty-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.8 Darwin/23.3.0

File hashes

Hashes for extracty-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ad48a95dd4b5767187c2b2c6dd97c503c5d7c5c29c6a7227d976a6c83f91cdcf
MD5 44b8160ee466e14f79b83b2799e8f8f7
BLAKE2b-256 fd59b7b74d241004b438e06fccd4c2c284e584330bc6dbfbe9da6292f7a0e3b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page