Simple packego for grab data from raw text.

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
Programming Language

Project description

TextLasso 🤠

TextLasso is a simple Python library for extracting structured data from raw text, with special focus on processing LLM (Large Language Model) responses. Whether you're parsing JSON buried in markdown, extracting data from XML, or need to generate structured prompts for AI models, TextLasso has you covered.

✨ Key Features

🎯 Smart Text Extraction: Extract structured data from messy text with multiple fallback strategies
🧹 LLM Response Cleaning: Automatically clean code blocks, markdown artifacts, and formatting
🏗️ Dataclass Integration: Convert raw text directly to Python dataclasses with type validation
🤖 AI Prompt Generation: Generate structured prompts with schema validation and examples
📊 Multiple Formats: Support for JSON, XML, and extensible to other formats
🔧 Flexible Configuration: Configurable error handling, logging, and validation modes
🎨 Decorator Support: Enhance existing functions with structured output capabilities

🚀 Quick Start

Installation

pip install textlasso

Basic Usage

from dataclasses import dataclass
from typing import List, Optional
from textlasso import extract

@dataclass
class Person:
    name: str
    age: int
    email: Optional[str] = None
    skills: List[str] = None

# Extract from messy LLM response
llm_response = """
Here's the person data you requested:

\```json
{
    "name": "Alice Johnson",
    "age": 30,
    "email": "alice@company.com", 
    "skills": ["Python", "Machine Learning", "Data Science"]
}
\```

Hope this helps!
"""

person = extract(llm_response, Person, extract_strategy='json')
print(f"Extracted: {person.name}, {person.age} years old")
print(person)
# Extracted: Alice Johnson, 30 years old
# Person(name='Alice Johnson', age=30, email='alice@company.com', skills=['Python', 'Machine Learning', 'Data Science'])

📚 Comprehensive Examples

1. Basic Text Extraction

JSON Extraction with Fallback Strategies

from dataclasses import dataclass
from typing import List, Optional
from textlasso import extract

@dataclass
class Product:
    name: str
    price: float
    category: str
    in_stock: bool
    tags: Optional[List[str]] = None

# Works with clean JSON
clean_json = '{"name": "Laptop", "price": 999.99, "category": "Electronics", "in_stock": true}'

# Works with markdown-wrapped JSON
markdown_json = """
Here's your product data:
```json
{
    "name": "Wireless Headphones",
    "price": 199.99,
    "category": "Electronics", 
    "in_stock": false,
    "tags": ["wireless", "bluetooth", "noise-canceling"]
}
\```
"""

# Works with messy responses
messy_response = """
Let me extract that product information for you...

The product details are: {"name": "Smart Watch", "price": 299.99, "category": "Wearables", "in_stock": true}

Is this what you were looking for?
"""

# All of these work automatically
products = [
    extract(clean_json, Product, extract_strategy='json'),
    extract(markdown_json, Product, extract_strategy='json'), 
    extract(messy_response, Product, extract_strategy='json')
]

for product in products:
    print(f"{product.name}: ${product.price} ({'✅' if product.in_stock else '❌'})")

XML Extraction

from dataclasses import dataclass
from typing import List, Optional
from textlasso import extract

@dataclass 
class Address:
    street: str
    city: str
    country: str
    zip_code: Optional[str] = None
    
@dataclass
class ResponseAddress:
    address: Address

xml_data = """
<address>
    <street>123 Main St</street>
    <city>San Francisco</city>
    <country>USA</country>
    <zip_code>94102</zip_code>
</address>
"""

response_address = extract(xml_data, ResponseAddress, extract_strategy='xml')
print(f"Address: {response_address.address.street}, {response_address.address.city}, {response_address.address.country}")
# Address: 123 Main St, San Francisco, USA

2. Complex Nested Data Structures

from dataclasses import dataclass
from typing import List, Optional
from enum import Enum

class Department(Enum):
    ENGINEERING = "engineering"
    MARKETING = "marketing" 
    SALES = "sales"
    HR = "hr"

@dataclass
class Employee:
    id: int
    name: str
    department: Department
    salary: float
    skills: List[str]
    manager_id: Optional[int] = None

@dataclass
class Company:
    name: str
    founded_year: int
    employees: List[Employee]
    headquarters: Address

complex_json = """
{
    "name": "TechCorp Inc",
    "founded_year": 2015,
    "headquarters": {
        "street": "100 Tech Plaza",
        "city": "Austin", 
        "country": "USA",
        "zip_code": "78701"
    },
    "employees": [
        {
            "id": 1,
            "name": "Sarah Chen", 
            "department": "engineering",
            "salary": 120000,
            "skills": ["Python", "React", "AWS"],
            "manager_id": null
        },
        {
            "id": 2,
            "name": "Mike Rodriguez",
            "department": "marketing", 
            "salary": 85000,
            "skills": ["SEO", "Content Strategy", "Analytics"],
            "manager_id": 1
        }
    ]
}
"""

company = extract(complex_json, Company, extract_strategy='json')
print(f"Company: {company.name} ({company.founded_year})")
print(f"HQ: {company.headquarters.city}, {company.headquarters.country}")
print(f"Employees: {len(company.employees)}")

for emp in company.employees:
    print(f"  - {emp.name} ({emp.department.value}): {', '.join(emp.skills)}")

# HQ: Austin, USA
# Employees: 2
#   - Sarah Chen (engineering): Python, React, AWS
#   - Mike Rodriguez (marketing): SEO, Content Strategy, Analytics

3. LLM Response Cleaning

from textlasso.cleaners import clear_llm_res

# Clean various LLM response formats
messy_responses = [
    "\```json\\n{\"key\": \"value\"}\\n\```",
    "\```\\n{\"key\": \"value\"}\\n\```", 
    "Here's the data: {\"key\": \"value\"} hope it helps!",
    "\```xml\\n<root><item>data</item></root>\\n\```"
]

for response in messy_responses:
    clean_json = clear_llm_res(response, extract_strategy='json')
    clean_xml = clear_llm_res(response, extract_strategy='xml')
    print(f"Original: {response}")
    print(f"JSON cleaned: {clean_json}")
    print(f"XML cleaned: {clean_xml}")
    print("---")

4. Advanced Data Extraction with Configuration

from textlasso import extract_from_dict
import logging

# Configure custom logging
logger = logging.getLogger("my_extractor")
logger.setLevel(logging.DEBUG)

@dataclass
class FlexibleData:
    required_field: str
    optional_field: Optional[str] = None
    number_field: int = 0

# Strict mode - raises errors on type mismatches
data_with_extra = {
    "required_field": "test",
    "optional_field": "optional", 
    "number_field": "123",  # String instead of int
    "extra_field": "ignored"  # Extra field
}

# Strict mode (default)
try:
    result_strict = extract_from_dict(
        data_with_extra, 
        FlexibleData,
        strict_mode=True,
        ignore_extra_fields=True,
        logger=logger
    )
    print("Strict mode result:", result_strict)
except Exception as e:
    print("Strict mode error:", e)

# Flexible mode - attempts conversion
result_flexible = extract_from_dict(
    data_with_extra,
    FlexibleData, 
    strict_mode=False,
    ignore_extra_fields=True,
    logger=logger
)
print("Flexible mode result:", result_flexible)

5. Structured Prompt Generation

Basic Prompt Generation

from textlasso import generate_structured_prompt

@dataclass
class UserFeedback:
    rating: int  # 1-5
    comment: str
    category: str
    recommended: bool
    issues: Optional[List[str]] = None

# Generate a structured prompt
prompt = generate_structured_prompt(
    prompt="Analyze this customer review and extract structured feedback",
    schema=UserFeedback,
    strategy="json",
    include_schema_description=True,
    example_count=2
)

print(prompt)
# Output:
# Analyze this customer review and extract structured feedback

# ## OUTPUT FORMAT REQUIREMENTS

# You must respond with a valid JSON object that follows this exact structure:

# ### Schema: UserFeedback
# - **rating**: int (required)
# - **comment**: str (required)
# - **category**: str (required)
# - **recommended**: bool (required)
# - **issues**: Array of str (optional)


# ### JSON Format Rules:
# - Use proper JSON syntax with double quotes for strings
# - Include all required fields
# - Use null for optional fields that are not provided
# - Arrays should contain objects matching the specified structure
# - Numbers should not be quoted
# - Booleans should be true/false (not quoted)


# ## EXAMPLES

# Here are 2 examples of the expected JSON format:

# ### Example 1:
# ```json
# {
#   "rating": 1,
#   "comment": "example_comment_1",
#   "category": "example_category_1",
#   "recommended": true,
#   "issues": [
#     "example_issues_item_1",
#     "example_issues_item_2"
#   ]
# }
# ```

# ### Example 2:
# ```json
# {
#   "rating": 2,
#   "comment": "example_comment_2",
#   "category": "example_category_2",
#   "recommended": false,
#   "issues": [
#     "example_issues_item_1",
#     "example_issues_item_2",
#     "example_issues_item_3"
#   ]
# }
# ```

# Remember: Your response must be valid JSON that matches the specified structure exactly.

Using the Decorator for Function Enhancement

If you have a prompt returning functions, you can use the @structured_output decorator to automatically enhance your prompts with structure requirements.

from dataclasses import dataclass
from typing import Optional, List

from textlasso import structured_output

@dataclass
class NewsArticle:
    title: str
    summary: str
    category: str
    sentiment: str
    key_points: List[str]
    publication_date: Optional[str] = None

# decorate prompt-returning function
@structured_output(schema=NewsArticle, strategy="xml", example_count=1)
def create_article_analysis_prompt(article_text: str) -> str:
    return f"""
    Analyze the following news article and extract key information:
    
    Article: {article_text}
    
    Please provide a comprehensive analysis focusing on the main themes,
    sentiment, and key takeaways.
    """

# The decorator automatically enhances your prompt with structure requirements
article_text = "Breaking: New AI breakthrough announced by researchers..."
enhanced_prompt = create_article_analysis_prompt(article_text)

# This prompt now includes schema definitions, examples, and format requirements
print("Enhanced prompt: ", enhanced_prompt)

# Enhanced prompt:  
#     Analyze the following news article and extract key information:
    
#     Article: Breaking: New AI breakthrough announced by researchers...
    
#     Please provide a comprehensive analysis focusing on the main themes,
#     sentiment, and key takeaways.
    


# ## OUTPUT FORMAT REQUIREMENTS

# You must respond with a valid XML object that follows this exact structure:

# ### Schema: NewsArticle
# - **title**: str (required)
# - **summary**: str (required)
# - **category**: str (required)
# - **sentiment**: str (required)
# - **key_points**: Array of str (required)
# - **publication_date**: str (optional)


# ### XML Format Rules:
# - Use proper XML syntax with opening and closing tags
# - Root element should match the main dataclass name
# - Use snake_case for element names
# - For arrays, repeat the element name for each item
# - Use self-closing tags for null/empty optional fields
# - Include all required fields as elements

6. Real-World Use Cases

Processing Survey Responses

@dataclass
class SurveyResponse:
    respondent_id: str
    age_group: str
    satisfaction_rating: int
    feedback: str
    would_recommend: bool
    improvement_areas: List[str]

# Simulating LLM processing of survey data
llm_survey_output = """
Based on the survey response, here's the extracted data:

\```json
{
    "respondent_id": "RESP_001",
    "age_group": "25-34", 
    "satisfaction_rating": 4,
    "feedback": "Great service overall, but could improve response time",
    "would_recommend": true,
    "improvement_areas": ["response_time", "pricing"]
}
\```

This response indicates positive sentiment with specific improvement suggestions.
"""

survey = extract(llm_survey_output, SurveyResponse, extract_strategy='json')
print(survey)
# SurveyResponse(respondent_id='RESP_001', age_group='25-34', satisfaction_rating=4, feedback='Great service overall, but could improve response time', would_recommend=True, improvement_areas=['response_time', 'pricing'])

E-commerce Product Extraction

@dataclass
class ProductReview:
    product_id: str
    reviewer_name: str
    rating: int
    review_text: str
    verified_purchase: bool
    helpful_votes: int
    review_date: str

@structured_output(schema=ProductReview, strategy="xml")
def create_review_extraction_prompt(raw_review: str) -> str:
    return f"""
    Extract structured information from this product review:
    
    {raw_review}
    
    Pay attention to implicit ratings, sentiment, and any verification indicators.
    """

raw_review = """
★★★★☆ Amazing headphones! by John D. (Verified Purchase) - March 15, 2024
These headphones exceeded my expectations. Great sound quality and comfortable fit.
Battery life could be better but overall very satisfied. Would definitely buy again!
👍 47 people found this helpful
"""

extraction_prompt = create_review_extraction_prompt(raw_review)
# Send this prompt to your LLM, then extract the response:
# review = extract(llm_response, ProductReview, extract_strategy='xml')

🔧 Configuration Options

Extraction Configuration

from textlasso import extract_from_dict
import logging

# Configure extraction behavior
result = extract_from_dict(
    data_dict=your_data,
    target_class=YourDataClass,
    strict_mode=False,          # Allow type conversions
    ignore_extra_fields=True,   # Ignore unknown fields
    logger=custom_logger,       # Custom logging
    log_level=logging.DEBUG     # Detailed logging
)

Prompt Generation Configuration

from textlasso import generate_structured_prompt

prompt = generate_structured_prompt(
    prompt="Your base prompt",
    schema=YourSchema,
    strategy="json",                    # or "xml"
    include_schema_description=True,    # Include field descriptions
    example_count=3                     # Number of examples (1-3)
)

📖 API Reference

Core Functions

`extract(text, target_class, extract_strategy='json')`

Extract structured data from text.

Parameters:

text (str): Raw text containing data to extract
target_class (type): Dataclass to convert data into
extract_strategy (Literal['json', 'xml']): Extraction strategy

Returns: Instance of target_class

`extract_from_dict(data_dict, target_class, **options)`

Convert dictionary to dataclass with advanced options.

`generate_structured_prompt(prompt, schema, strategy, **options)`

Generate enhanced prompts with structure requirements.

Decorators

`@structured_output(schema, strategy='json', **options)`

Enhance prompt functions with structured output requirements.

`@chain_prompts(*prompt_funcs, separator='\n\n---\n\n')`

Chain multiple prompt functions together.

`@prompt_cache(maxsize=128)`

Cache prompt results for better performance.

Utilities

`clear_llm_res(text, extract_strategy)`

Clean LLM responses by removing code blocks and formatting.

🤝 Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Run tests: pytest
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built for the AI/LLM community
Inspired by the need for robust text processing in AI applications
Special thanks to all contributors and users

📞 Support

📧 Email: aziznadirov@yahoo.com
🐛 Issues: GitHub Issues

TextLasso - Wrangle your text data with ease! 🤠

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
Programming Language

Release history Release notifications | RSS feed

0.1.3

Jul 31, 2025

0.1.2

Jul 26, 2025

0.1.1

Jul 26, 2025

This version

0.1.0

Jul 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textlasso-0.1.0.tar.gz (23.4 kB view details)

Uploaded Jul 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textlasso-0.1.0-py3-none-any.whl (19.5 kB view details)

Uploaded Jul 26, 2025 Python 3

File details

Details for the file textlasso-0.1.0.tar.gz.

File metadata

Download URL: textlasso-0.1.0.tar.gz
Upload date: Jul 26, 2025
Size: 23.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.2

File hashes

Hashes for textlasso-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f9e817a6e9d0dbe683d1625f8ad4ec285ac50dff33195f9c8970512b24413098`
MD5	`7cc559f2f935cb82ddf48142c46bf093`
BLAKE2b-256	`0a6e0568571d96d857cdfb87e6286f4b8258c0af651b0a2826dceff2960ed560`

See more details on using hashes here.

File details

Details for the file textlasso-0.1.0-py3-none-any.whl.

File metadata

Download URL: textlasso-0.1.0-py3-none-any.whl
Upload date: Jul 26, 2025
Size: 19.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.2

File hashes

Hashes for textlasso-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`749f643e639302e39997980e83acc87c12d24778b77aef8bb335e0b839ba45a7`
MD5	`0dc1f7b60fdf77872981450b60a63438`
BLAKE2b-256	`4d7960201ab2a0ac17808dbd891aebf7936002833bd808ef4f4e4c05ed92b3d8`

See more details on using hashes here.

textlasso 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

TextLasso 🤠

✨ Key Features

🚀 Quick Start

Installation

Basic Usage

📚 Comprehensive Examples

1. Basic Text Extraction

JSON Extraction with Fallback Strategies

XML Extraction

2. Complex Nested Data Structures

3. LLM Response Cleaning

4. Advanced Data Extraction with Configuration

5. Structured Prompt Generation

Basic Prompt Generation

Using the Decorator for Function Enhancement

6. Real-World Use Cases

Processing Survey Responses

E-commerce Product Extraction

🔧 Configuration Options

Extraction Configuration

Prompt Generation Configuration

📖 API Reference

Core Functions

extract(text, target_class, extract_strategy='json')

extract_from_dict(data_dict, target_class, **options)

generate_structured_prompt(prompt, schema, strategy, **options)

Decorators

@structured_output(schema, strategy='json', **options)

@chain_prompts(*prompt_funcs, separator='\n\n---\n\n')

@prompt_cache(maxsize=128)

Utilities

clear_llm_res(text, extract_strategy)

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`extract(text, target_class, extract_strategy='json')`

`extract_from_dict(data_dict, target_class, **options)`

`generate_structured_prompt(prompt, schema, strategy, **options)`

`@structured_output(schema, strategy='json', **options)`

`@chain_prompts(*prompt_funcs, separator='\n\n---\n\n')`

`@prompt_cache(maxsize=128)`

`clear_llm_res(text, extract_strategy)`