Skip to main content

SQLDataModel is a lightweight dataframe library designed for efficient data extraction, transformation, and loading (ETL) across various sources and destinations, providing an efficient alternative to common setups like pandas, numpy, and sqlalchemy while also providing additional features without the overhead of external dependencies.

Project description

SQLDataModel Logo

SQLDataModel: fast & lightweight source agnostic data model

SQLDataModel Home PyPI License PyPI Version Docs Status Downloads

SQLDataModel is a fast & lightweight data model with no additional dependencies for quickly fetching and storing your tabular data to and from the most commonly used databases & data sources in a couple lines of code. It's as easy as ETL:

from SQLDataModel import SQLDataModel

# Extract your data:
sdm = SQLDataModel.from_sql("your_table", cx_Oracle.Connection)

# Transform it:
for row in sdm.iter_rows():
    print(row)

# Load it wherever you need to!
sdm.to_sql("new_table", psycopg2.Connection)

Made for those times when you just want to use raw SQL on your dataframe, or need to move data around but the full Pandas, Numpy, SQLAlchemy installation is just overkill. SQLDataModel includes all the most commonly used features, including additional ones like pretty printing your table, at 1/1000 the size, 0.03MB vs 30MB


Installation

Use the package manager pip to install SQLDataModel.

$ pip install SQLDataModel

Then import the main class SQLDataModel into your local project, see usage below or go straight to the project docs.


Quick Example

A SQLDataModel can be created from any number of sources, as a quick demo lets create one using a Wikipedia page:

>>> from SQLDataModel import SQLDataModel
>>> 
>>> url = 'https://en.wikipedia.org/wiki/1998_FIFA_World_Cup'
>>> 
>>> sdm = SQLDataModel.from_html(url, table_identifier=94)   
>>> 
>>> sdm[:4, ['R', 'Team', 'W', 'Pts.']]
┌──────┬─────────────┬──────┬──────┐
 R     Team         W     Pts. 
├──────┼─────────────┼──────┼──────┤
 1     France       6     19   
 2     Brazil       4     13   
 3     Croatia      5     15   
 4     Netherlands  3     12   
└──────┴─────────────┴──────┴──────┘
[4 rows x 4 columns]

SQLDataModel provides a quick and easy way to import, view, transform and export your data in multiple formats and sources, providing the full power of executing raw SQL against your model in the process.


Usage

SQLDataModel is just that, A data model leveraging the mighty power of in-memory sqlite3 to perform fast and light-weight transformations allowing you to easily move and manipulate data from source to destination regardless of where, or in what format, the data is:

sdm_graph

If you need to extract tabular data from one of these formats, transform or simply just move it into another format, then SQLDataModel can make your life easier. Here's a few examples how:

From Website to Markdown

Say we find some cool data online, perhaps some planetary data, and we want to go and get it for our own purposes:

from SQLDataModel import SQLDataModel

# Target url with some planetary data we can extract
url = 'https://antet.github.io/sdm-planets'

# Create a model from the first table element found on the web page
sdm = SQLDataModel.from_html(url, table_identifier=1)

# Add some color to it
sdm.set_display_color('#A6D7E8')

# Lets see!
print(sdm)

SQLDataModel's default output is pretty printed and formatted to fit within the current terminal's width and height, since we added some color here's the result:

sdm_planets

Now that we have our data as a SQLDataModel, we can do any number of things with it using the provided methods or using your own SQL and returning the query results as a new model! Lets find out, all extreme temperatures and pressure aside, if it would be easier to fly given the planet's gravity relative to Earth:

# Extract: Slice by rows and columns
sdm = sdm[:,['Planet','Gravity']] # or sdm[:,:2]

# Transform: Create a new column based on existing values
sdm['Flyable?'] = sdm['Gravity'].apply(lambda x: str(x < 1.0))

# Filter: Keep only the 'Flyable' planets
sdm = sdm[sdm['Flyable?'] == 'True']

# Load: Let's turn our data into markdown!
sdm.to_markdown('Planet-Flying.MD')

Here's the raw markdown of the new file we created Planet-Flying.MD:

| Planet  | Gravity | Flyable? |
|:--------|--------:|:---------|
| Mercury |    0.38 | True     |
| Venus   |    0.91 | True     |
| Mars    |    0.38 | True     |
| Saturn  |    0.92 | True     |
| Uranus  |    0.89 | True     |

Notice that the output from the to_markdown() method also aligned our columns based on the data type, and padded the values so that even the raw markdown is pretty printed! While we used the from_html() and to_markdown() methods for this demo, we could just as easily have created the same table in any number of formats. Once we have our data as a SQLDataModel, it would've been just as easy to have the data in LaTeX:

# Let's convert it to a LaTeX table instead
sdm_latex = sdm.to_latex()

# Here's the output
print(sdm_latex)

As with our markdown output, the columns are correctly aligned and pretty printed:

\begin{tabular}{|l|r|l|}
\hline
    Planet  & Gravity & Flyable  \\
\hline
    Mercury &    0.38 & True     \\
    Venus   &    0.91 & True     \\
    Mars    &    0.38 & True     \\
    Saturn  &    0.92 & True     \\
    Uranus  &    0.89 & True     \\
\hline
\end{tabular}

In fact, using the to_html() method is how the table from the beginning of this demo was created! Click here for an example of how the styling and formatting applied to SQLDataModel gets exported along with it when using to_html().

SQL on your Pandas DataFrame

I can't tell you how many times I've found myself searching for information on how to do this or that operation in pandas and wished I could just quickly do it in SQL instead. Enter SQLDataModel:

import pandas as pd
from SQLDataModel import SQLDataModel

# Titanic dataset
df = pd.read_csv('titanic.csv')

# Transformations you don't want to do in pandas if you already know SQL
sql_query = """
select 
    Pclass, Sex, count(*) as 'Survived' 
from sdm where 
    Survived = 1 
group by 
    Pclass, Sex 
order by 
    count(*) desc
"""

# Extract: Create SQLDataModel from the df
sdm = SQLDataModel.from_pandas(df)

# Transform: Do them in SQLDataModel
sdm = sdm.execute_fetch(sql_query)

# Load: Then hand it back to pandas!
df = sdm.to_pandas()

Here we're using SQLDataModel to avoid performing the complex pandas operations required for aggregation if we already know SQL. Here's the output of the sdm we used to do the operations in and the df:

SQLDataModel:                         pandas:
┌───┬────────┬────────┬──────────┐    
│   │ Pclass │ Sex    │ Survived │       Pclass     Sex  Survived
├───┼────────┼────────┼──────────┤    0       1  female        91
│ 0 │      1 │ female │       91 │    1       3  female        72
│ 1 │      3 │ female │       72 │    2       2  female        70
│ 2 │      2 │ female │       70 │    3       3    male        47
│ 3 │      3 │ male   │       47 │    4       1    male        45
│ 4 │      1 │ male   │       45 │    5       2    male        17
│ 5 │      2 │ male   │       17 │    
└───┴────────┴────────┴──────────┘    
[6 rows x 3 columns]               

In this example our source and destination formats were both pd.DataFrame objects, however pandas is not required to use, nor is it a dependency of, SQLDataModel. It is only required if you're using the from_pandas() or to_pandas() methods.

From SQL to HTML table

Say we have a table located on a remote PostgreSQL server that we want to put on our website, normally we could pip install SQLAlchemy, psycopg2, pandas including whatever dependencies they come with, like numpy. This time all we need is psycopg2 for the PostgreSQL driver:

import psycopg2
import datetime
from SQLDataModel import SQLDataModel

# Setup the connection
psql_conn = psycopg2.connect(...)

# Grab a table with missions to Saturn
sdm = SQLDataModel.from_sql('saturn_missions', psql_conn) # or SQL statement

# Filter to only 'Future' missions
sdm = sdm[sdm['Status'] == 'Future']

# Create new column with today's date so it ages better!
sdm['Updated'] = datetime.date.today()

# Send our table to a new html file, this time without the index
saturn_html = sdm.to_html('Future-Saturn.html', include_index=False)

Here's a snippet from the Future-Saturn.html file generated by the to_html() method:

<!-- Metadata Removed -->
<table>
    <tr>
        <th>Mission</th>
        <th>Status</th>
        <th>Launch</th>
        <th>Destination</th>
        <th>Updated</th>
    </tr>
    <tr>
        <td>New Frontiers 4</td>
        <td>Future</td>
        <td>2028</td>
        <td>Surface of Titan</td>
        <td>2024-03-21</td>
    </tr>
    <tr>
        <td>Enceladus Orbilander</td>
        <td>Future</td>
        <td>2038</td>
        <td>Surface of Enceladus</td>
        <td>2024-03-21</td>
    </tr>
</table>
<!-- SQLDataModel css styles removed -->

The Workflow

For all the SQLDataModel examples, the same basic workflow and pattern is present:

from SQLDataModel import SQLDataModel

# Extract: Create the model from a source
sdm = SQLDataModel.from_data(...)

# Transform: Manipulate the data if needed
sdm['New Column'] = sdm['Column A'].apply(func)

# Load: Move it to a destination format
sdm.to_text('table.txt') # to_csv, to_json, to_latex, to_markdown, to_html, ..., etc.

Regardless of where the data originated or where it ends up, SQLDataModel's best use-case is to be the light-weight intermediary that's agnostic to the original source, or the final destination, of the data.


Data Formats

SQLDataModel seamlessly interacts with a wide range of data formats providing a versatile platform for data extraction, conversion, and writing. Supported formats include:

  • CSV: Extract from and write to comma separated value, .csv, files.
  • HTML: Extract from web and write to and from .html files including formatted string literals.
  • JSON: Extract from and write to .json files, JSON-like objects, or JSON formatted sring literals.
  • LaTeX: Extract from and write to .tex files, LaTeX formatted string literals.
  • Markdown: Extract from and write to .MD files, Markdown formatted string literals.
  • Numpy: Convert to and from numpy.ndarray objects, numpy required.
  • Pandas: Convert to and from pandas.DataFrame objects, pandas required.
  • Parquet: Extract from and write to .parquet files, pyarrow required.
  • Pickle: Extract from and write to .pkl files, package uses .sdm extension when pickling for SQLDataModel metadata.
  • SQL: Extract from and write to the following popular SQL databases:
    • SQLite: Using the built-in sqlite3 module.
    • PostgreSQL: Using the psycopg2 package.
    • SQL Server: Using the pyodbc package.
    • Oracle: Using the cx_Oracle package.
    • Teradata: Using the teradatasql package.
  • Text: Write to and from .txt files including other SQLDataModel string representations.
  • TSV or delimited: Write to and from files delimited by:
    • \t: Tab separated values or .tsv files.
    • \s: Single space or whitespace separated values.
    • ;: Semicolon separated values.
    • |: Pipe separated values.
    • :: Colon separated values.
    • ,: Comma separated values or .csv files.
  • Python objects:
    • dictionaries: Convert to and from collections of python dict objects.
    • lists: Convert to and from collections of python list objects.
    • tuples: Convert to and from collections of python tuple objects.
    • namedtuples: Convert to and from collections of namedtuples objects.

Note that SQLDataModel does not install any additional dependencies by default. This is done to keep the package as light-weight and small as possible. This means that to use package dependent methods like to_parquet() or the inverse from_parquet() the pyarrow package is required. The same goes for other package dependent methods like those converting to and from pandas and numpy objects.


Documentation

SQLDataModel's documentation can be found at https://sqldatamodel.readthedocs.io containing detailed descriptions for the key modules in the package. These are listed below as links to their respective sections in the docs:

However, to skip over the less relevant modules and jump straight to the meat of the package, the SQLDataModel module, click here.


Motivation

While there are packages/dependencies out there that can accomplish some of the same tasks as SQLDataModel, they're either missing key features or end up being overkill for common tasks like grabbing and converting tables from source A to destination B, or they don't quite handle the full process and require additional dependencies to make it all work. When you find yourself doing the same thing over and over again, eventually you sit down and write a package to do it for you.


Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.


License

MIT

Thank you!
Ante Tonkovic-Capin

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SQLDataModel-0.2.1.tar.gz (117.3 kB view hashes)

Uploaded Source

Built Distribution

SQLDataModel-0.2.1-py3-none-any.whl (187.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page