A Stata emulator for Python/pandas
Project description
pdexplorer is a Stata emulator for Python/pandas.
Installation
pdexplorer
is available on PyPI. Run pip
to install:
pip install pdexplorer
Usage
pdexplorer
can be run in three modes:
1. Stata-like Emulation
from pdexplorer import *
This import adds Stata-like commands into the Python namespace. For example,
webuse('auto')
reg('mpg price')
2. Pure Stata Emulation
from pdexplorer import do
do() # Launches a Stata emulator that can run normal Stata commands
Now you can run regular Stata commands e.g.,
webuse auto
reg mpg price
do()
also supports running the contents of do-file e.g.,
do('working.do')
Under the hoods, the Stata emulator translates pure Stata commands into their Pythonic equivalents.
For example, reg mpg price
becomes reg('mpg price')
.
3. Inline Stata
For example,
from pdexplorer import do, current
do(inline="""
webuse auto
reg mpg price
""") # Launches a Stata emulator that can run normal Stata commands
print(current.df) # access DataFrame object in Python
The rest of this documentation shows examples using Stata-like emulation, but these commands can all be run using pure Stata emulation as well.
How pdexplorer
differs from Stata
pdexplorer
uses Python libraries under the hood. (The result of a command reflects the output of those libraries and may differ slightly from equivalent Stata output.)- There is no support for mata. Under the hood,
pdexplorer
is just the Python data stack. - The API for producing charts is based on Altair, not Stata.
pdexplorer
adds commands for machine learning (using sklearn, PyTorch, and huggingface)
Philosophy
Stata is great for its conciseness and readability. But Python/pandas is free and easier to integrate with other applications. For example, you can build a web server in Python, but not Stata; You can run Python in AWS SageMmaker, but not Stata.
pdexplorer
enables Stata to be easily integrated into the Python ecosystem.
In contrast to raw Python/pandas, Stata syntax achieves succinctness by:
- Using spaces and commas rather than parentheses, brackets, curly braces, and quotes (where possible)
- Specifying a set of concise commands on the "current" dataset rather than cluttering the namespace with multiple datasets
- Being verbose by default i.e., displaying output that represents the results of the command
- Having sensible defaults that cover the majority of use cases and demonstrate common usage
- Allowing for namespace abbreviations for both commands and variable names
- Employing two types of column names: Variable name are concise and used for programming. Variable labels are verbose and used for presentation.
- Packages are imported lazily e.g.,
import torch
is loaded only when it's first used by a command. This ensures thatfrom pdexplorer import *
runs quickly.
Examples
Load Stata dataset and perform exploratory data analysis
webuse('auto')
li() # List the contents of the data
See https://www.stata.com/manuals/dwebuse.pdf
Summarize Data by Subgroups
webuse('auto')
with by('foreign'):
summarize('mpg weight')
See https://www.stata.com/manuals/rsummarize.pdf
Ordinary Least Squares (OLS) Regression
webuse('auto')
regress('mpg weight foreign')
ereturnlist()
Return Values
In the last example, note the use of ereturnlist()
, corresponding to the Stata command ereturn list
. Additionally, a Python object may also be available as the command's return value. For example,
webuse('auto')
results = regress('mpg weight foreign')
Here, results
is a RegressionResultsWrapper object from the statsmodels package.
Similarly,
results = regress('mpg weight foreign', library='scikit-learn')
Now, results
is a LinearRegression object from the scikit-learn package.
Finally,
results = regress('mpg weight foreign', library='pytorch')
Here, results
is a torch.nn.Linear object from the PyTorch package.
LLM Fine Tuning Examples
Sentiment Analysis using HuggingFace
from pdexplorer import *
from pdexplorer.tests.fixtures import yelp_reviews
df = pd.DataFrame.from_records(yelp_reviews)
use(df) # load examples into pdexplorer
fthuggingface("stars text", task="sentiment-analysis", model_name="distilbert-base-uncased") # slow
askhuggingface("I absolutely loved Burgerosity!", task="sentiment-analysis")
Next Word Prediction using HuggingFace
from pdexplorer import *
from pdexplorer.tests.fixtures import eli5
df = pd.DataFrame.from_records(eli5)
use(df) # load examples into pdexplorer
fthuggingface("text", task="text-generation", model_name="distilgpt2") # slow
askhuggingface("A poem about Mickey Mouse in iambic pentameter:\n", task="text-generation")
Next Word Prediction using OpenAI (gpt-3.5-turbo)
from pdexplorer import *
from pdexplorer.tests.test_ftgpt import df
use(df)
ftgpt("assistant user system") # slow; requires OPENAI_API_KEY environment variable
askgpt("A poem about Mickey Mouse in iambic pentameter:\n")
Syntax summary
With few exceptions, the basic Stata language syntax (as documented here) is
[by varlist:] command [subcommand] [varlist] [=exp] [if exp] [in range] [weight] [, options]
where square brackets distinguish optional qualifiers and options from required ones. In this diagram, varlist denotes a list of variable names, command denotes a Stata command, exp denotes an algebraic expression, range denotes an observation range, weight denotes a weighting expression, and options denotes a list of options.
The by varlist:
prefix causes Stata to repeat a command for each subset of the data for which the
values of the variables in varlist are equal. When prefixed with by varlist:, the result of the command
will be the same as if you had formed separate datasets for each group of observations, saved them,
and then gave the command on each dataset separately. The data must already be sorted by varlist,
although by has a sort option.
In pdexplorer, this gets translated to
with by('varlist'):
command("[subcommand] [varlist] [=exp] [if exp] [in range] [weight] [, options]", *args, **kwargs)
where *args
, and **kwargs
represent additional arguments that are available in a pdexplorer
command but
not in the equivalent Stata command.
Sometimes, Stata commands are two words. In such cases, the pdexplorer
command is a concatenation of the two words. For example,
label data "label"
becomes
labeldata("label")
Module Dependencies
File location | Description | Dependencies |
---|---|---|
/*.py |
commands that are native to Stata related to data wrangling or statistics | pandas |
/_altair_mapper.py |
commands that are native to Altair for charting | altair |
shortcuts/*.py |
shortcut commands related to data wrangling, statistics, or charting | all of the above |
finance/*.py |
commands that are specific to financial applications | all of the above |
ml/*.py |
commands that use machine learning techniques (and are outside the scope of Stata) | scikit-learn |
nn/*.py |
commands that use neutral networks (primarily built using PyTorch) | PyTorch |
data/*.py |
python scripts that collect data from various sources | Data suppliers |
experimental/*.py |
commands that are current under development and not yet stable | N/A |
Command Dependencies
pdexplorer command |
package dependency |
---|---|
cf | ydata-profiling or sweetviz |
browse | dtale |
regress | statsmodels or scikit-learn or PyTorch |
Charts
pdexplorer
departs somewhat from Stata in producing charts. Rather than emulating Stata's chart syntax,
pdexplorer
uses Altair with some syntactic sugar.
Take the example "Simple Scatter Plot with Tooltips":
import webbrowser
import altair as alt
from vega_datasets import data
source = data.cars()
chart = alt.Chart(source).mark_circle(size=60).encode(
x="Horsepower",
y="Miles_per_Gallon",
color='Origin',
tooltip=["Name", "Origin", "Horsepower", "Miles_per_Gallon"],
)
chart.save('mychart.html')
webbrowser.open('mychart.html')
In pdexplorer
, this becomes:
from pdexplorer import *
webuse("cars", "vega")
circlechart("miles_per_gallon horsepower, \
color(origin) tooltip(name origin horsepower miles_per_gallon)",
size=60,
)
In other words, pdexplorer
supports a varlist
parameter for y
/x
encodings. Additional encodings are specified via the
Stata options syntax.
In the above example, circlechart
automatically displays the chart in a web browser. However, sometimes it's necessary to add features to the alt.Chart
Altair object. To use the alt.Chart
object, we can use circlechart_
instead of circlechart
. For example,
import webbrowser
from pdexplorer import *
webuse("cars", "vega")
altair_chart_object = circlechart_(
"miles_per_gallon horsepower, color(origin) \
tooltip(name origin horsepower miles_per_gallon)",
size=60,
)
altair_chart_object.configure_legend(
strokeColor='gray',
fillColor='#EEEEEE',
padding=10,
cornerRadius=10,
orient='top-right'
) # See https://altair-viz.github.io/user_guide/configuration.html#legend-configuration
altair_chart_object.save('mychart.html')
webbrowser.open('mychart.html')
Instead of saving the chart, explicitly, we can also write
altair_chart_object()
The ()
at the end tells Altair to open the chart in a web browser. This method is not available in alt.Charts
itself,
but we monkey patched this into the class for convenience.
Similarly, the following two statements are identical,
circlechart_("miles_per_gallon horsepower")() # the object-oriented style
circlechart("miles_per_gallon horsepower") # the imperative style
In this example, circlechart_
itself simply returns a alt.Chart
object and ()
indicates that the chart should be displayed in
a web browser.
Since we can access the alt.Chart
object, we can also specify encodings explicitly using Altair's encode
method e.g.,
from pdexplorer import *
webuse("cars", "vega") # note that names are forced to be lower case by default
circlechart_(size=60).encode(
x="horsepower",
y="miles_per_gallon",
color="origin",
tooltip=["name", "origin", "horsepower", "miles_per_gallon"],
)() # () indicates that the chart is complete and should be opened in a web browser
Since pdexplorer
charts are just alt.Chart
objects,
Layered and Multi-View Charts are also supported e.g.,
from pdexplorer import *
webuse("cars", "vega")
chartA = circlechart_("miles_per_gallon weight_in_lbs")
chartB = circlechart_("horsepower weight_in_lbs")
(chartA & chartB)() # Vertically concatenate chartA and chartB
Finally, pdexplorer
also offers syntactic sugar for charting multiple x
/y
variables. More specifically, the previous block can be
written as
from pdexplorer import *
webuse("cars", "vega")
circlechart("miles_per_gallon horsepower weight_in_lbs", stacked=True)
Note that Stata's varlist
interpretation is used here by default i.e., var1 var2 var3
is assumed to represent
yvar1 yvar2 xvar
. We can change this interpretation with the optional argument yX
.
from pdexplorer import *
webuse("cars", "vega")
circlechart("miles_per_gallon horsepower weight_in_lbs", yX=True, stacked=True)
Now var1 var2 var3
is assumed to represent yvar xvar1 xvar2
as it would be for the regress
command.
The Stata default is to layer all variables onto a single chart. The stacked=True
option allows the graphs to be shown on
separate grids. If stacked=False
, the charts are all shown on the same grid i.e.,
from pdexplorer import *
webuse("cars", "vega")
circlechart("miles_per_gallon horsepower weight_in_lbs") # stacked=False is the default option
Note that pdexplorer
uses Altair's transform_fold
method under the hood. For further customization, the Altair methods can be used explicitly e.g.,
import altair as alt
from pdexplorer import *
webuse("cars", "vega")
circlechart_().transform_fold(
["Miles_per_Gallon", "Horsepower"] # Note that circlechart_ variable labels are accessible
).encode(
y="value:Q", x=alt.X("weight_in_lbs", title="Weight_in_lbs"), color="key:N"
)()
alt.Chart.transform_fold
is Altair's version of pandas.melt
. So another option is to first reshape the data using
pandas and then use Altair for charting.
import altair as alt
from pdexplorer import *
webuse("cars", "vega")
melt("miles_per_gallon horsepower, keep(weight_in_lbs)")
alt.Chart(current.df_labeled).mark_circle().encode(
y="value:Q", x='Weight_in_lbs', color="variable:N"
)()
Note that the Altair documentation suggests the latter approach in most cases where a data transformation is required.
Abbreviations
As mentioned ealier, Stata supports name abbreviations for both variable names as well as command names. In pdexplorer
, all the following regression statements are equivalent:
from pdexplorer import *
webuse("auto")
reg('price mpg weight')
regr('price mpg weight')
regre('price mpg weight')
regres('price mpg weight')
regress('price mpg weight')
reg('pr mpg wei')
Similarly, for charting,
from pdexplorer import *
webuse("cars", "vega")
circle("miles_per_gallon horsepower")
circlec("miles_per_gallon horsepower")
circlech("miles_per_gallon horsepower")
circlecha("miles_per_gallon horsepower")
circlechar("miles_per_gallon horsepower")
circlechart("miles_per_gallon horsepower")
circle("miles horse")
References
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdexplorer-0.0.36.tar.gz
.
File metadata
- Download URL: pdexplorer-0.0.36.tar.gz
- Upload date:
- Size: 111.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a236804d2b375e5a79ac624460e6025c4b03bd4f71a3ee503c4a5f36d7e3f4b |
|
MD5 | 71cda5d7d62c82c91c892020a1eb7725 |
|
BLAKE2b-256 | af0c5f4b93009606e0077a9786e0d823e13f7b173c1beb1b70f08cdba8b5eb05 |
File details
Details for the file pdexplorer-0.0.36-py3-none-any.whl
.
File metadata
- Download URL: pdexplorer-0.0.36-py3-none-any.whl
- Upload date:
- Size: 137.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 803b9d7f64bfee4b439b14fcb27110bedb409f16977ba90834f78f250177f5c9 |
|
MD5 | b11d1b5404dbbcfc283236c160c53292 |
|
BLAKE2b-256 | 200259a36c4e5f15a62c1c0fec8c90e378730cbeb43d43bade6a81cb1cbbfa5f |