Skip to main content

Subsample-based Model and Research Toolkit.

Project description

Clubear

Clubear is a Python-based (Python 3) open-source package for interactive massive data analysis. The key feature of clubear is that it enables users to conduct convenient and interactive statistical analysis of massive data with only a traditional single-computer system. Thus, clubear provides a cost-effective solution when mining large-scale datasets. In addition, the clubear package integrates many commonly used statistical and graphical tools, which are useful for most commonly encountered data analysis tasks.

Every component (class) within clubear have a 'demo()' function. Demo is used to demonstrate typical examples about this class. Users can use 'demo()' to quickly get instructions. For example, if we have an object 'sf':

sf=cb.shuffle('airline.csv')

we can get the instructions related to it by:

sf.demon()

Then, a typical case will be printed for reference.

How to use clubear

1. Install the package

pip install clubear

2. Import the package

We recommend to use clubear with Jupyter Notebook. This will lead to the best user experience.

import clubear as cb

3. (Optional) Shuffle a dataset

For example, shuffle a CSV file 'airline.csv'. The resulting file is 'airline.csv.shuffle'.

sf=cb.shuffle('airline.csv')
sf.dc()

4. Use 'pump' to process data

Build a bridge between the dataset and the memory (of a computer) using a 'pump'. Clubear supports various types of data sources to initialize the pump:

(1). CSV file

mypathfile='airline.csv.shuffle'
pm=cb.pump(type=1, pathfile=mypathfile)

(2). CSV file with a codebook (a dictionary contains specifications for variables)

mycodebook = {
'qlist': ['ActualElapsedTime', 'ArrDelay', 'ArrTime', 'CRSArrTime', 'CRSDepTime', 'CRSElapsedTime',
    'Cancelled', 'DayOfWeek', 'DayofMonth', 'DepDelay', 'DepTime', 'Distance', 'Diverted',
    'FlightNum', 'Month', 'Year', '_INTERCEPT_'], # specify quantitative variables
'drop': ['TailNum', 'Origin', 'Dest'], # specify variables that can be removed
'scale_level': {'UniqueCarrier': ['EV', 'OO', '9E']} # for any qualitative variable, users can further specify the scale levels
}
mypathfile='2008.csv.shuffle'
pm=cb.pump(type=2, pathfile=mypathfile, codebook=mycodebook)

(3). SQL databases

mysql_info = {
'host':'localhost', # SQL Server host
'user':'root', # SQL Server username
'passwd':'root', # SQL Server password
'port':3306, # SQL Server port
'db':'test_db', # name of database 
'tablename':'flight_2008' # name of table 
}
pm=cb.pump(type=3, sql_info=mysql_info)

Then, based on the pump, one can quickly access a part of the dataset as:

pm.subsize=10000
pm.go()

Or view the most frequently used descriptive statistics as:

ck=cb.check(pm).stats()

5. Find quantitative variables

Extract variables whose 'mp' values are less than 5%:

ck[ck.mp<5].index

Pass this list to 'pm' by the 'qlist' parameter:

pm.qlist=[...]

6. Discover levels for qualitative variables

Detect all levels by the 'table' function as:

tb=cb.check(pm).table(niter=100)

Detect all levels for a specified variable, e.g., 'AirTime':

tb=cb.check(pm).table('AirTime',tv=True)

7. Make statistical graphics

Clubear provides a number of graphical tools. For example, boxplot:

tk = cb.tank(pm)
pt=cb.plot(tk).box(x="Year",y="ArrDelay")

Histogram:

pt=cb.plot(tk).hist('Distance')

Barplot:

pt=cb.plot(tk).mu('ArrDelay','CRSDepTime')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clubear-0.0.37.tar.gz (25.6 kB view details)

Uploaded Source

File details

Details for the file clubear-0.0.37.tar.gz.

File metadata

  • Download URL: clubear-0.0.37.tar.gz
  • Upload date:
  • Size: 25.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0.post20200309 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for clubear-0.0.37.tar.gz
Algorithm Hash digest
SHA256 f6473d14c8e3aad3fa6ddca3110953330c63e5badc6dd247f618bf2bd6bb8a7b
MD5 95f4c1914c23a06f50a4be03cfc67675
BLAKE2b-256 8b58cd70ce885d0d9fc10f7bccc1f0b7d2ac5139cb0de024c9f5b701b2c262fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page