Skip to main content

A Stata emulator for Python/pandas

Project description

pdexplorer is a Stata emulator for Python/pandas.

Installation

pdexplorer is available on PyPI. Run pip to install:

pip install pdexplorer

Usage

pdexplorer can be run in three modes:

1. Stata-like Emulation

from pdexplorer import *

This import adds Stata-like commands into the Python namespace. For example,

webuse('auto')
reg('mpg price')

2. Pure Stata Emulation

from pdexplorer import do
do() # Launches a Stata emulator that can run normal Stata commands

Now you can run regular Stata commands e.g.,

webuse auto
reg mpg price

do() also supports running the contents of do-file e.g.,

do('working.do')

Under the hoods, the Stata emulator translates pure Stata commands into their Pythonic equivalents. For example, reg mpg price becomes reg('mpg price').

3. Inline Stata
For example,

from pdexplorer import do, current
do(inline="""
webuse auto
reg mpg price
""") # Launches a Stata emulator that can run normal Stata commands
print(current.df) # access DataFrame object in Python

The rest of this documentation shows examples using Stata-like emulation, but these commands can all be run using pure Stata emulation as well.

How pdexplorer differs from Stata

  • pdexplorer uses Python libraries under the hood. (The result of a command reflects the output of those libraries and may differ slightly from equivalent Stata output.)
  • There is no support for mata. Under the hood, pdexplorer is just the Python data stack.

Philosophy

Stata is great for its conciseness and readability. But Python/pandas is free and easier to integrate with other applications. For example, you can build a web server in Python, but not Stata; You can run Python in AWS SageMmaker, but not Stata.

pdexplorer enables Stata to be easily integrated into the Python ecosystem.

In contrast to raw Python/pandas, Stata syntax achieves succinctness by:

  • Using spaces and commas rather than parentheses, brackets, curly braces, and quotes (where possible)
  • Specifying a set of concise commands on the "current" dataset rather than cluttering the namespace with multiple datasets
  • Being verbose by default i.e., displaying output that represents the results of the command
  • Having sensible defaults that cover the majority of use cases and demonstrate common usage
  • Allowing for namespace abbreviations for both commands and variable names
  • Employing two types of column names: Variable name are concise and used for programming. Variable labels are verbose and used for presentation.
  • Packages are imported lazily e.g., import statsmodels is loaded only when it's first used by a command. This ensures that from pdexplorer import * runs quickly.

Examples

Load Stata dataset and perform exploratory data analysis

webuse('auto')
li() # List the contents of the data

See https://www.stata.com/manuals/dwebuse.pdf

Summarize Data by Subgroups

webuse('auto')
with by('foreign'):
    summarize('mpg weight')

See https://www.stata.com/manuals/rsummarize.pdf

Ordinary Least Squares (OLS) Regression

webuse('auto')
regress('mpg weight foreign')
ereturnlist()

Return Values

In the last example, note the use of ereturnlist(), corresponding to the Stata command ereturn list. Additionally, a Python object may also be available as the command's return value. For example,

webuse('auto')
results = regress('mpg weight foreign')

Here, results is a RegressionResultsWrapper object from the statsmodels package.

Syntax summary

With few exceptions, the basic Stata language syntax (as documented here) is

[by varlist:] command [subcommand] [varlist] [=exp] [if exp] [in range] [weight] [, options]

where square brackets distinguish optional qualifiers and options from required ones. In this diagram, varlist denotes a list of variable names, command denotes a Stata command, exp denotes an algebraic expression, range denotes an observation range, weight denotes a weighting expression, and options denotes a list of options.

The by varlist: prefix causes Stata to repeat a command for each subset of the data for which the values of the variables in varlist are equal. When prefixed with by varlist:, the result of the command will be the same as if you had formed separate datasets for each group of observations, saved them, and then gave the command on each dataset separately. The data must already be sorted by varlist, although by has a sort option.

In pdexplorer, this gets translated to

with by('varlist'):
    command("[subcommand] [varlist] [=exp] [if exp] [in range] [weight] [, options]", *args, **kwargs)

where *args, and **kwargs represent additional arguments that might available in a pdexplorer command but not in the equivalent Stata command. (This is rarely used.)

Sometimes, Stata commands are two words. In such cases, the pdexplorer command is a concatenation of the two words. For example,

label data "label"

becomes

labeldata("label")

Command Dependencies

pdexplorer command package dependency
cf ydata-profiling or sweetviz
browse xlwings
regress statsmodels

Python-In-Excel Support

pdexplorer can be used with Excel's new Python-in-Excel feature.

Usage:

insert_pdexplorer my_excel_sheet.xlsm

This script inserts (or overrides) a worksheet called _pdexplorer into my_excel_sheet.xlsm which contains the core modules of pdexplorer.

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdexplorer-0.0.40.tar.gz (110.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdexplorer-0.0.40-py3-none-any.whl (137.6 kB view details)

Uploaded Python 3

File details

Details for the file pdexplorer-0.0.40.tar.gz.

File metadata

  • Download URL: pdexplorer-0.0.40.tar.gz
  • Upload date:
  • Size: 110.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for pdexplorer-0.0.40.tar.gz
Algorithm Hash digest
SHA256 e2c86b03e445b97dae4a01997b18b2a3fb7acf4bfeb7e72a19b470ddc163c215
MD5 f8124d913888ae0eb33bc1c17cbcd637
BLAKE2b-256 6c2d6e58c13db38e4f8801e4640ba49c322b56d2a14e79c6ef6853e940daa808

See more details on using hashes here.

File details

Details for the file pdexplorer-0.0.40-py3-none-any.whl.

File metadata

  • Download URL: pdexplorer-0.0.40-py3-none-any.whl
  • Upload date:
  • Size: 137.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.1

File hashes

Hashes for pdexplorer-0.0.40-py3-none-any.whl
Algorithm Hash digest
SHA256 da27cfccea870dec5778ed6b385368968291dcf28dbf4a890ce16f39b345e25a
MD5 25efd972b0e60111654a24689093ffda
BLAKE2b-256 d3e13504268cf47e062093b72d94ea60f9db49686786329951d57f940979f8cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page