Skip to main content

Simple packego for grab data from raw text.

Project description

TextLasso 🤠

PyPI version Python 3.9+ License: MIT

TextLasso is a simple Python library for extracting structured data from raw text, with special focus on processing LLM (Large Language Model) responses. Whether you're parsing JSON buried in markdown, extracting data from XML, or need to generate structured prompts for AI models, TextLasso has you covered.

✨ Key Features

  • 🎯 Smart Text Extraction: Extract structured data from messy text with multiple fallback strategies
  • 🧹 LLM Response Cleaning: Automatically clean code blocks, markdown artifacts, and formatting
  • 🏗️ Dataclass Integration: Convert raw text directly to Python dataclasses with type validation
  • 🤖 AI Prompt Generation: Generate structured prompts with schema validation and examples
  • 📊 Multiple Formats: Support for JSON, XML, and extensible to other formats
  • 🔧 Flexible Configuration: Configurable error handling, logging, and validation modes
  • 🎨 Decorator Support: Enhance existing functions with structured output capabilities

🚀 Quick Start

Installation

pip install textlasso

Basic Usage

from dataclasses import dataclass
from typing import List, Optional
from textlasso import extract

@dataclass
class Person:
    name: str
    age: int
    email: Optional[str] = None
    skills: List[str] = None

# Extract from messy LLM response
llm_response = """
Here's the person data you requested:

\```json
{
    "name": "Alice Johnson",
    "age": 30,
    "email": "alice@company.com", 
    "skills": ["Python", "Machine Learning", "Data Science"]
}
\```

Hope this helps!
"""

person = extract(llm_response, Person, extract_strategy='json')
print(f"Extracted: {person.name}, {person.age} years old")
print(person)
# Extracted: Alice Johnson, 30 years old
# Person(name='Alice Johnson', age=30, email='alice@company.com', skills=['Python', 'Machine Learning', 'Data Science'])

📚 Comprehensive Examples

1. Basic Text Extraction

JSON Extraction with Fallback Strategies

from dataclasses import dataclass
from typing import List, Optional
from textlasso import extract

@dataclass
class Product:
    name: str
    price: float
    category: str
    in_stock: bool
    tags: Optional[List[str]] = None

# Works with clean JSON
clean_json = '{"name": "Laptop", "price": 999.99, "category": "Electronics", "in_stock": true}'

# Works with markdown-wrapped JSON
markdown_json = """
Here's your product data:
```json
{
    "name": "Wireless Headphones",
    "price": 199.99,
    "category": "Electronics", 
    "in_stock": false,
    "tags": ["wireless", "bluetooth", "noise-canceling"]
}
\```
"""

# Works with messy responses
messy_response = """
Let me extract that product information for you...

The product details are: {"name": "Smart Watch", "price": 299.99, "category": "Wearables", "in_stock": true}

Is this what you were looking for?
"""

# All of these work automatically
products = [
    extract(clean_json, Product, extract_strategy='json'),
    extract(markdown_json, Product, extract_strategy='json'), 
    extract(messy_response, Product, extract_strategy='json')
]

for product in products:
    print(f"{product.name}: ${product.price} ({'✅' if product.in_stock else '❌'})")

XML Extraction

from dataclasses import dataclass
from typing import List, Optional
from textlasso import extract

@dataclass 
class Address:
    street: str
    city: str
    country: str
    zip_code: Optional[str] = None
    
@dataclass
class ResponseAddress:
    address: Address

xml_data = """
<address>
    <street>123 Main St</street>
    <city>San Francisco</city>
    <country>USA</country>
    <zip_code>94102</zip_code>
</address>
"""

response_address = extract(xml_data, ResponseAddress, extract_strategy='xml')
print(f"Address: {response_address.address.street}, {response_address.address.city}, {response_address.address.country}")
# Address: 123 Main St, San Francisco, USA

2. Complex Nested Data Structures

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum

class Department(Enum):
    ENGINEERING = "engineering"
    MARKETING = "marketing" 
    SALES = "sales"
    HR = "hr"

@dataclass
class Employee:
    id: int
    name: str
    department: Department
    salary: float
    skills: List[str]
    manager_id: Optional[int] = None

@dataclass
class Company:
    name: str
    founded_year: int
    employees: List[Employee]
    headquarters: Address

complex_json = """
{
    "name": "TechCorp Inc",
    "founded_year": 2015,
    "headquarters": {
        "street": "100 Tech Plaza",
        "city": "Austin", 
        "country": "USA",
        "zip_code": "78701"
    },
    "employees": [
        {
            "id": 1,
            "name": "Sarah Chen", 
            "department": "engineering",
            "salary": 120000,
            "skills": ["Python", "React", "AWS"],
            "manager_id": null
        },
        {
            "id": 2,
            "name": "Mike Rodriguez",
            "department": "marketing", 
            "salary": 85000,
            "skills": ["SEO", "Content Strategy", "Analytics"],
            "manager_id": 1
        }
    ]
}
"""

company = extract(complex_json, Company, extract_strategy='json')
print(f"Company: {company.name} ({company.founded_year})")
print(f"HQ: {company.headquarters.city}, {company.headquarters.country}")
print(f"Employees: {len(company.employees)}")

for emp in company.employees:
    print(f"  - {emp.name} ({emp.department.value}): {', '.join(emp.skills)}")

# HQ: Austin, USA
# Employees: 2
#   - Sarah Chen (engineering): Python, React, AWS
#   - Mike Rodriguez (marketing): SEO, Content Strategy, Analytics

3. LLM Response Cleaning

from textlasso.cleaners import clear_llm_res

# Clean various LLM response formats
messy_responses = [
    "\```json\\n{\"key\": \"value\"}\\n\```",
    "\```\\n{\"key\": \"value\"}\\n\```", 
    "Here's the data: {\"key\": \"value\"} hope it helps!",
    "\```xml\\n<root><item>data</item></root>\\n\```"
]

for response in messy_responses:
    clean_json = clear_llm_res(response, extract_strategy='json')
    clean_xml = clear_llm_res(response, extract_strategy='xml')
    print(f"Original: {response}")
    print(f"JSON cleaned: {clean_json}")
    print(f"XML cleaned: {clean_xml}")
    print("---")

4. Advanced Data Extraction with Configuration

from textlasso import extract_from_dict
import logging

# Configure custom logging
logger = logging.getLogger("my_extractor")
logger.setLevel(logging.DEBUG)

@dataclass
class FlexibleData:
    required_field: str
    optional_field: Optional[str] = None
    number_field: int = 0

# Strict mode - raises errors on type mismatches
data_with_extra = {
    "required_field": "test",
    "optional_field": "optional", 
    "number_field": "123",  # String instead of int
    "extra_field": "ignored"  # Extra field
}

# Strict mode (default)
try:
    result_strict = extract_from_dict(
        data_with_extra, 
        FlexibleData,
        strict_mode=True,
        ignore_extra_fields=True,
        logger=logger
    )
    print("Strict mode result:", result_strict)
except Exception as e:
    print("Strict mode error:", e)

# Flexible mode - attempts conversion
result_flexible = extract_from_dict(
    data_with_extra,
    FlexibleData, 
    strict_mode=False,
    ignore_extra_fields=True,
    logger=logger
)
print("Flexible mode result:", result_flexible)

5. Structured Prompt Generation

Basic Prompt Generation

from textlasso import generate_structured_prompt

@dataclass
class UserFeedback:
    rating: int  # 1-5
    comment: str
    category: str
    recommended: bool
    issues: Optional[List[str]] = None

# Generate a structured prompt
prompt = generate_structured_prompt(
    prompt="Analyze this customer review and extract structured feedback",
    schema=UserFeedback,
    strategy="json",
    include_schema_description=True,
    example_count=2
)

print(prompt)
# Output:
# Analyze this customer review and extract structured feedback

# ## OUTPUT FORMAT REQUIREMENTS

# You must respond with a valid JSON object that follows this exact structure:

# ### Schema: UserFeedback
# - **rating**: int (required)
# - **comment**: str (required)
# - **category**: str (required)
# - **recommended**: bool (required)
# - **issues**: Array of str (optional)


# ### JSON Format Rules:
# - Use proper JSON syntax with double quotes for strings
# - Include all required fields
# - Use null for optional fields that are not provided
# - Arrays should contain objects matching the specified structure
# - Numbers should not be quoted
# - Booleans should be true/false (not quoted)


# ## EXAMPLES

# Here are 2 examples of the expected JSON format:

# ### Example 1:
# ```json
# {
#   "rating": 1,
#   "comment": "example_comment_1",
#   "category": "example_category_1",
#   "recommended": true,
#   "issues": [
#     "example_issues_item_1",
#     "example_issues_item_2"
#   ]
# }
# ```

# ### Example 2:
# ```json
# {
#   "rating": 2,
#   "comment": "example_comment_2",
#   "category": "example_category_2",
#   "recommended": false,
#   "issues": [
#     "example_issues_item_1",
#     "example_issues_item_2",
#     "example_issues_item_3"
#   ]
# }
# ```

# Remember: Your response must be valid JSON that matches the specified structure exactly.

Using the Decorator for Function Enhancement

If you have a prompt returning functions, you can use the @structured_output decorator to automatically enhance your prompts with structure requirements.

from dataclasses import dataclass
from typing import Optional, List

from textlasso import structured_output

@dataclass
class NewsArticle:
    title: str
    summary: str
    category: str
    sentiment: str
    key_points: List[str]
    publication_date: Optional[str] = None

# decorate prompt-returning function
@structured_output(schema=NewsArticle, strategy="xml", example_count=1)
def create_article_analysis_prompt(article_text: str) -> str:
    return f"""
    Analyze the following news article and extract key information:
    
    Article: {article_text}
    
    Please provide a comprehensive analysis focusing on the main themes,
    sentiment, and key takeaways.
    """

# The decorator automatically enhances your prompt with structure requirements
article_text = "Breaking: New AI breakthrough announced by researchers..."
enhanced_prompt = create_article_analysis_prompt(article_text)

# This prompt now includes schema definitions, examples, and format requirements
print("Enhanced prompt: ", enhanced_prompt)

# Enhanced prompt:  
#     Analyze the following news article and extract key information:
    
#     Article: Breaking: New AI breakthrough announced by researchers...
    
#     Please provide a comprehensive analysis focusing on the main themes,
#     sentiment, and key takeaways.
    


# ## OUTPUT FORMAT REQUIREMENTS

# You must respond with a valid XML object that follows this exact structure:

# ### Schema: NewsArticle
# - **title**: str (required)
# - **summary**: str (required)
# - **category**: str (required)
# - **sentiment**: str (required)
# - **key_points**: Array of str (required)
# - **publication_date**: str (optional)


# ### XML Format Rules:
# - Use proper XML syntax with opening and closing tags
# - Root element should match the main dataclass name
# - Use snake_case for element names
# - For arrays, repeat the element name for each item
# - Use self-closing tags for null/empty optional fields
# - Include all required fields as elements

6. Real-World Use Cases

Processing Survey Responses

@dataclass
class SurveyResponse:
    respondent_id: str
    age_group: str
    satisfaction_rating: int
    feedback: str
    would_recommend: bool
    improvement_areas: List[str]

# Simulating LLM processing of survey data
llm_survey_output = """
Based on the survey response, here's the extracted data:

\```json
{
    "respondent_id": "RESP_001",
    "age_group": "25-34", 
    "satisfaction_rating": 4,
    "feedback": "Great service overall, but could improve response time",
    "would_recommend": true,
    "improvement_areas": ["response_time", "pricing"]
}
\```

This response indicates positive sentiment with specific improvement suggestions.
"""

survey = extract(llm_survey_output, SurveyResponse, extract_strategy='json')
print(survey)
# SurveyResponse(respondent_id='RESP_001', age_group='25-34', satisfaction_rating=4, feedback='Great service overall, but could improve response time', would_recommend=True, improvement_areas=['response_time', 'pricing'])

E-commerce Product Extraction

@dataclass
class ProductReview:
    product_id: str
    reviewer_name: str
    rating: int
    review_text: str
    verified_purchase: bool
    helpful_votes: int
    review_date: str

@structured_output(schema=ProductReview, strategy="xml")
def create_review_extraction_prompt(raw_review: str) -> str:
    return f"""
    Extract structured information from this product review:
    
    {raw_review}
    
    Pay attention to implicit ratings, sentiment, and any verification indicators.
    """

raw_review = """
★★★★☆ Amazing headphones! by John D. (Verified Purchase) - March 15, 2024
These headphones exceeded my expectations. Great sound quality and comfortable fit.
Battery life could be better but overall very satisfied. Would definitely buy again!
👍 47 people found this helpful
"""

extraction_prompt = create_review_extraction_prompt(raw_review)
# Send this prompt to your LLM, then extract the response:
# review = extract(llm_response, ProductReview, extract_strategy='xml')

🔧 Configuration Options

Extraction Configuration

from textlasso import extract_from_dict
import logging

# Configure extraction behavior
result = extract_from_dict(
    data_dict=your_data,
    target_class=YourDataClass,
    strict_mode=False,          # Allow type conversions
    ignore_extra_fields=True,   # Ignore unknown fields
    logger=custom_logger,       # Custom logging
    log_level=logging.DEBUG     # Detailed logging
)

Prompt Generation Configuration

from textlasso import generate_structured_prompt

prompt = generate_structured_prompt(
    prompt="Your base prompt",
    schema=YourSchema,
    strategy="json",                    # or "xml"
    include_schema_description=True,    # Include field descriptions
    example_count=3                     # Number of examples (1-3)
)

📖 API Reference

Core Functions

extract(text, target_class, extract_strategy='json')

Extract structured data from text.

Parameters:

  • text (str): Raw text containing data to extract
  • target_class (type): Dataclass to convert data into
  • extract_strategy (Literal['json', 'xml']): Extraction strategy

Returns: Instance of target_class

extract_from_dict(data_dict, target_class, **options)

Convert dictionary to dataclass with advanced options.

generate_structured_prompt(prompt, schema, strategy, **options)

Generate enhanced prompts with structure requirements.

Decorators

@structured_output(schema, strategy='json', **options)

Enhance prompt functions with structured output requirements.

@chain_prompts(*prompt_funcs, separator='\n\n---\n\n')

Chain multiple prompt functions together.

@prompt_cache(maxsize=128)

Cache prompt results for better performance.

Utilities

clear_llm_res(text, extract_strategy)

Clean LLM responses by removing code blocks and formatting.

🤝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Run tests: pytest
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built for the AI/LLM community
  • Inspired by the need for robust text processing in AI applications
  • Special thanks to all contributors and users

📞 Support


TextLasso - Wrangle your text data with ease! 🤠

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textlasso-0.1.1.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

textlasso-0.1.1-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file textlasso-0.1.1.tar.gz.

File metadata

  • Download URL: textlasso-0.1.1.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for textlasso-0.1.1.tar.gz
Algorithm Hash digest
SHA256 77b06cedb9f5086ae3aa6b156eb2d98eff8c435c1a324f38a1efcdc3ef6b21f9
MD5 306a3d42202788337310902c87d69197
BLAKE2b-256 81dbbf5cac53971506f9a597d2fdeac37dc2afe2955121f0745cac03de9ce544

See more details on using hashes here.

File details

Details for the file textlasso-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: textlasso-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.2

File hashes

Hashes for textlasso-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6d24937ca6bf7be66bf819ce7defe868df001e4310cb7220803f5779be81e5b
MD5 c7558007fb3878745644fa6cb7ce8660
BLAKE2b-256 4ef1423eff05839b439d6ac53827991485f645eb5df26a942baf2daeac40f8f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page