survey_stats a simple and powerfull package for data processing and statistics

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

survey_stats Package

survey_stats is a simple, light and usefull package for data processing and statistics for emprical studies that tries to use research logic instead of data logic. You can use Github

User Guide

This package is under development, but until now it covers the most important tools needed for research, such as weighting non-random samples, preparing pivot tables in the sample, defining new variables using specific formulas or rules, filtering data, creating dummy variables, even reading and writing files such as csv, text, excel, access, ..., or saving and loading data by Pickle package, which makes the speed of reading and writing data much faster than files. In the latest version, it is possible to read and write large files without adding them all to RAM.

In this package, in addition to the tools will be developed separately, it is possible that the modules of the two famous packages, 'scikit-learn' and 'statsmodel', will be available with the logic of the current package. Until now, the exclusive tools of this package include ols and tree-regression. However, logistic and multinomial logistic regressions are also available from the above two packages.

Install

pip install survey-stats

Data Structure

from survey_stats.data_process import Data_Types, Data 
values = {
	'name': {1: 'Cyrus', 2: 'Mandana', 3: 'Atossa'},
	'age': {1: 32, 2: 65, 3: 40},
	'sex': {1:'male', 2: 'female', 3: 'female'}
}
data = Data(Data_Types.cross, values)
# or
data = Data('cross', values)

Pandas

Since the Pandas package is familiar to data analysts, it is necessary to explain that the data structure in this package is very close to the data structure in Pandas and they can be easily converted to each other. However, the current package has been tried to use research logic instead of data logic, which is more understandable and simple for researchers.

pandas.DataFrame --> survey_stats.Data

from survey-stats.data_process import Data
# df = a DataFrame of Pandas
data = Data(values=df.to_dict())

survey_stats.Data --> pandas.DataFrame

import pandas as pd
# data = an object of Data
df = pd.DataFrame(data.values)

Modules Structure

survey-stats

|______data_process

-------|____Data

-------|____TimeSeries(Data)

-------|____Sample

-------|____DataBase

|______basic_model

-------|____Model

-------|____Formula

-------|____Formulas

|______date

-------|____Date

|______functions

|______linear_regressions

-------|____ols

------------|____Model

------------|____Equation

|______classification

-------|____tree_based_regression

------------|____Model

------------|____Equation

|______statsmodels

-------|____logit

------------|____Model

------------|____Equation

-------|____multinominal_logit

------------|____Model

------------|____Equation

|______sklearn

-------|____ols

------------|____Model

------------|____Equation

-------|____logit

------------|____Model

------------|____Equation

-------|____multinominal_logit

------------|____Model

------------|____Equation

data_process

Data

some methods on data:

dtype
set_dtype
to_str
variables
items
index
fix_index
set_index
set_names
select_variables
select_index
drop
drop_index
add_a_dummy
add_dummies
dropna
drop_all_na
value_to_nan
to_numpy
add_data
transpose
count
add_trend
fillna
fill
sort
add_a_variable
to_timeseries
line_plot
add_index
add_a_group_variable
load and dump
read_text, to_text, and add_to_text
read_csv, to_csv, and add_to_csv
read_xls and to_xls
read_excel and to_excel
read_access and to_access

print(data)
variables_name = data.variables()
index = data.index()
data.set_index('id', drop_var=False)
data.set_names(['w1','w2'], ['weights1', 'weights2'])
new_data = data.select_variables(['w1','w2'])
new_data = data.select_index(range(50,100))
data.drop(['year'])
dummy = data.add_a_dummy([['height', '>', 160], ['height', '<=', 180]])
dummy = data.add_dummies([
	[('height', '>', 160), ('height', '<=', 180)],
	[('weight', '>', 60), ('height', '<=', 80)]
	])
data.dropna(['height', 'weight'])
num = data.to_numpy()
data.add_data(data_new)
data_t = data.transpose()
data.to_csv('data2.csv')

Sample

sample is sub-set of a data.

some method on Sample:

get_data
split
get_weights
group
Stats: weight, sum, average, var, std, distribution, median, mode, correl, min, max, percentile, gini

from survey_stats.data_process import Data, Sample
s = Sample(data, [0,5,6,10])
data_s = s.get_data()
train_sample, test_sample = main_sample.split(0.7,['train', 'test'], 'start')
# weighting
cond = [
	[('sex','=', 'female'),('age','<=',30)],
	[('sex','=', 'female'),('age','>',30)],
	[('sex','=', 'male'),('age','<=',30)],
	[('sex','=', 'male'),('age','>',30)]
	]
totals = [
	50,
	150,
	45,
	160
	]

sample = Sample(data, data.index())
sample.get_weights(cond, totals)
print(sample.data)

TimeSeries

timeseries is a special type of Data that index is 'jalali' date.

methods:

type_of_dates
complete_dates
reset_date_type
to_monthly
to_daily
to_weekly
to_annual
to_growth
to_moving_average
to_lead
to_lag

DataBase

database a dict of some Data: {'name':Data, ...}

methods:

dump
load
table_list
variable_list
query

basic_model

Formula

Formula is a expersion of mathematic operators and functions that can calculate on a data.\n

for example:

- formula: age + age**2 - exp(height/weight) + log(year)\n

operators: all operators on python.\n

- '+', '-', '*', '/', '//', '**', '%', '==', '!=', '>', '<', '>=', '<=', 'and', 'or', 'not', 'is', 'is not', 'in', 'not in'.\n

functions: all functions on 'math' madule.\n

- 'acos', 'acosh', 'asin', 'asinh', 'atan', 'atan2', 'atanh',

'ceil', 'comb', 'copysign', 'cos', 'cosh', 'degrees', 'dist',

'e', 'erf', 'erfc', 'exp', 'expm1', 'fabs', 'factorial', 'floor',

'fmod', 'frexp', 'fsum', 'gamma', 'gcd', 'hypot', 'inf', 'isclose',

'isfinite', 'isinf', 'isnan', 'isqrt', 'lcm', 'ldexp', 'lgamma',

'log', 'log10', 'log1p', 'log2', 'modf', 'nan', 'nextafter',

'perm', 'pi', 'pow', 'prod', 'radians', 'remainder', 'sin',

'sinh', 'sqrt', 'tan', 'tanh', 'tau', 'trunc', 'ulp'

methods:

split
filter
calculate

from survey_stats.basic_model import Formula
f1 = Formula('p=a')
f2 = Formula('3*x**2+p*x+x')
f3 = Formula('log(year)')
# calculate
data_new = f1.calculate(data)
# split
splits = f2.split() #-> ['3*x**2', 'p*x', 'x'] as Formulas
data_new = f2.split().calculate_all(data)
#filter
f = Formula('year>1397')
after1397 = f.filter(1,data)

Formulas

A list of Formula.

methods:

calculate_all

linear_regressions.ols

Linear regression consists of a equation, which are numerically independent variables and are combined linearly with each other. Categorical variables are converted to dummy variables and then used as a numerical variable in the model. We use the Formula and Formulas class to construct these variables. Simple regression is a linear regression with a numerically dependent variable that is estimated by the least squares method. In logistic regression, the dependent variable is a binary variable, or a numerical variable consisting of zeros and ones, or a categorical variable with only two values.

Model

methods:

estimate
estimate_skip_collinear
estimate_most_significant

Equation

methods:

anova
table
dump
load
wald_test
forecast

from survey_stats.linear_regressions import simple_regression, logistic_regression
model1 = simple_regression.Model('y', '1 + x1 + x2 + x2**2')
model2 = logistic_regression.Model('y', '1 + x1 + x2 + x2**2')
# samples of s_trian, s_test have already been defined.
eq1 = model1.estimate(s_train)
print(eq1)
data_f = eq1.forecast(s_test)
print(eq1.goodness_o_fit(s_test)
eq1.save('test')
# or instead of estimating a model, you can load a previously estimated model, and use it to predict.
eq2 = Equation.load('test')
eq2.goodness_of_fit(s_test)

classification.tree_based_regression

Decision tree is one of the modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity. Of course, since we also have regressions with discrete variables, such as logistic regressions, so in this package we have included both regression trees and classification trees in the tree_based_regression module. Currently, the sklearn package is used for the regression tree and the classification tree. But this package does not work by categorical variables, and this is very restrictive for survey researches because many of the variables in this researches are categorical. It also does not take into account the weight of the observations in its calculations. Therefore, this package has been significantly developed compared to it.

Model

methods:

estimate

Equation

methods:

forecast
goodness_of_fit
first_nodes
plot
dump
load

from survey_stats import tree_based_regression
model = tree_based_regression.Model(dep_var, indep_vars, min_sample=25, method=method)
eq = model.estimate(s_total, True, False)
print(eq)
print(eq.full_str)
# sample of s_test have already been defined.
forecast_test_leaf = eq.forecast(s_test, name='sample', output ='leaf')
forecast_test_dist = eq.forecast(s_test, name='sample', output ='dist')
forecast_test_point = eq.forecast(s_test, name='sample', output ='point')
# sample of s_total have already been defined.
eq.goodness_of_fit(s_total)
eq.save('total')
# or instead of estimating a model, you can load a previously estimated model, and use it to predict.
eq2 = Equation.load('total')
eq2.goodness_of_fit(s_total)

sklearn.logit

Model

methods:

estimate
estimate_skip_collinear

Equation

methods:

dump
load
forecast

sklearn.multinominal_logit

Model

methods:

estimate
estimate_skip_collinear

Equation

methods:

dump
load
forecast

sklearn.ols

Model

methods:

estimate
estimate_skip_collinear
estimate_most_significant

Equation

methods:

anova
table
dump
load
wald_test
forecast

statsmodels.logit

Model

methods:

estimate

Equation

methods:

dump
load
forecast
table

statsmodels.multinominal_logit

Model

methods:

estimate

Equation

methods:

dump
load
forecast
table

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.1.7

Mar 9, 2024

1.1.6

Feb 6, 2024

1.1.5

Feb 6, 2024

1.1.4

Feb 5, 2024

1.1.3

Jan 28, 2024

1.1.2

Dec 31, 2023

1.1.1

Dec 27, 2023

1.0.11

Mar 1, 2023

1.0.10

Jan 24, 2023

1.0.9

Jan 23, 2023

1.0.8

Oct 25, 2022

1.0.7

Oct 7, 2022

1.0.6

Sep 2, 2022

1.0.5

Jul 5, 2022

1.0.4

Jun 3, 2022

1.0.3

Jun 2, 2022

1.0.2

Feb 15, 2022

1.0.1

Feb 4, 2022

1.0.0

Jan 31, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

survey_stats-1.1.7.tar.gz (74.2 kB view hashes)

Uploaded Mar 9, 2024 Source

Built Distribution

survey_stats-1.1.7-py3-none-any.whl (74.5 kB view hashes)

Uploaded Mar 9, 2024 Python 3

Hashes for survey_stats-1.1.7.tar.gz

Hashes for survey_stats-1.1.7.tar.gz
Algorithm	Hash digest
SHA256	`b4b3de9eb9056fef2d63473b65a3b3d86ea8c34a0ead6a1b12d733943f0f076f`
MD5	`52990b383d8842d2f30a52b5b585cb0c`
BLAKE2b-256	`09b9f921bd612865e63e0d1b2cbba8d618bf65272aad99807c999d81c3e5b293`

Hashes for survey_stats-1.1.7-py3-none-any.whl

Hashes for survey_stats-1.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0ade1e2ca5470b7cdbb64810a638d0a8cbc26b16e00268d3520f3d599926179`
MD5	`94fd2c39b29cc53a851add862c867d5b`
BLAKE2b-256	`2d3c54808c3e1a13f622e8d1ee11d7b0f4dec2be494bffee850f28b2243b5646`

survey-stats 1.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

survey_stats Package

Install

Data Structure

Pandas

Modules Structure

data_process

Data

Sample

TimeSeries

DataBase

basic_model

Formula

Formulas

linear_regressions.ols

Model

Equation

classification.tree_based_regression

Model

Equation

sklearn.logit

Model

Equation

sklearn.multinominal_logit

Model

Equation

sklearn.ols

Model

Equation

statsmodels.logit

Model

Equation

statsmodels.multinominal_logit

Model

Equation

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution