Skip to main content

A Stata emulator for Python/pandas

Project description

"...succinctness is power... we take the trouble to develop high-level languages... so that we can say (and more importantly, think) in 10 lines of a high-level language what would require 1000 lines of machine language... -Paul Graham, Succinctness is Power

pdexplorer

pdexplorer is a Stata emulator for Python/pandas. In contrast to pandas, Stata syntax achieves succinctness by:

  • Using spaces and commas rather than parentheses, brackets, curly braces, and quotes (where possible)
  • Specifying a set of concise commands on the "current" dataset rather than cluttering the namespace with multiple datasets
  • Being verbose by default i.e., displaying output that represents the results of the command
  • Having sensible defaults that cover the majority of use cases and demonstrate common usage
  • Allowing for namespace abbreviations for both commands and variable names
  • Employing two types of column names: Variable name are concise and used for programming. Variable labels are verbose and used for presentation.

My Story

I used Stata for 7 years for both data exploration and programming. After that, I used Python/pandas for 3 years and found that pandas is just too verbose and "explicit" for rapid data exploration. So I started working on this project on September 3, 2023.

Why not use Stata instead of pandas?

Stata is great, but Python/pandas is free and easier to integrate with other applications. For example, you can build a web server in Python, but not Stata; You can run Python in AWS SageMmaker, but not Stata.

Additionally, even for devout Stata users, there is utility in being able to run Stata commands through a Python stack for comparison purposes.

How pdexplorer fulfills the Zen of Python (compared to pandas)

PASS FAIL
Beautiful is better than ugly. Explicit is better than implicit.
Simple is better than complex. In the face of ambiguity, refuse the temptation to guess.
Flat is better than nested.
Readability counts.
Although practicality beats purity.
There should be one-- and preferably only one --obvious way to do it.
Now is better than never.

How pdexplorer differs from Stata

  • Commands are implemented as Python functions and hence require at least one set of parentheses
  • pdexplorer uses Python libraries under the hood. The result of a command reflects the output of those libraries, even when they differ from Stata.
  • There is no support for mata. Under the hood, pdexplorer is just the Python data stack.

References

DELETEME

  • DataFrame singleton
  • concise language for data wrangling
  • easy export to excel
  • smf/patsy syntax best for regressions
  • plots using Stata data table meta data e.g., variable labels

General form of Stata command:

command(varlist, expression, if, in, weight, options, by)

Syntax summary

With few exceptions, the basic Stata language syntax (as documented here) is

[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [, options]

where square brackets distinguish optional qualifiers and options from required ones. In this diagram, varlist denotes a list of variable names, command denotes a Stata command, exp denotes an algebraic expression, range denotes an observation range, weight denotes a weighting expression, and options denotes a list of options.

The by varlist: prefix causes Stata to repeat a command for each subset of the data for which the values of the variables in varlist are equal. When prefixed with by varlist:, the result of the command will be the same as if you had formed separate datasets for each group of observations, saved them, and then gave the command on each dataset separately. The data must already be sorted by varlist, although by has a sort option.

In pdexplorer, this gets translated to

with by('varlist'):
    command("[varlist] [=exp] [if exp] [in range] [weight] [, options]", *args, **kwargs)

where *args, and **kwargs represent additional arguments that are available in a pdexplorer command but not in the equivalent Stata command.

Sometimes, Stata commands are two words. In such cases, the pdexplorer command is a concatenation of the two words. For example,

label data "label"

becomes

labeldata("label")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdexplorer-0.0.3.tar.gz (32.7 kB view hashes)

Uploaded Source

Built Distribution

pdexplorer-0.0.3-py3-none-any.whl (42.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page