Skip to main content

A Python package for data analysis.

Project description

Data Analysis Library Shark

This repository contains a detailed note of a new data analysis library shark.

Objectives

Shark :

  • Has A DataFrame class with data stored in numpy arrays
  • Can Select subsets of data with the brackets operator
  • Can Use special methods defined in the Python data model
  • Have a nicely formatted display of the DataFrame in the notebook
  • Can Implement aggregation methods - sum, min, max, mean, median, etc...
  • Can Implement non-aggregation methods such as isna, unique, rename, drop
  • Can Group by one or two columns
  • Have methods specific to string columns
  • Can Read in data from a comma-separated value file

Functionalities of Shark

1. DataFrame constructor input types

Our DataFrame class is constructed with a single parameter.

Specifically, input types must qualify the following:

  • Shark will raise a TypeError if data is not a dictionary
  • Shark will raise a TypeError if the keys of data are not strings
  • Shark will raise a TypeError if the values of data are not numpy arrays
  • Shark will raise a ValueError if the values of data are not 1-dimensional

2. Array lengths

We are now guaranteed that data is a dictionary of strings mapped to one-dimensional arrays. Each column of data in our DataFrame must have the same number of elements.

3. Convert unicode arrays to object

Whenever you create a numpy array of Python strings, it will default the data type of that array to unicode. Take a look at the following simple numpy array created from strings. Its data type, found in the dtype attribute is shown to be 'U' plus the length of the longest string.

>>> a = np.array(['cat', 'dog', 'snake'])
>>> a.dtype
dtype('<U5')

Unicode arrays are more difficult to manipulate and don't have the flexibility that we desire. So, if our user passes us a Unicode array, we will convert it to a data type called 'object'. This is a flexible type and will help us later when creating methods just for string columns. Technically, this data type allows any Python objects within the array.

4. Find the number of rows in the DataFrame with the len function

The number of rows are returned when passing a pandas DataFrame to the builtin len function.

5. Return columns as a list

df.columns will return a list of the column names.

6. Set new column names

we can assign all new columns to our DataFrame by setting the columns property equal to a list. A example below shows how you would set new columns for a 3-column DataFrame.

df.columns = ['state', 'age', 'fruit']

Also shark will raise errors if the data that is inserted is invalid.

  • Shark will Raise a TypeError if the object used to set new columns is not a list
  • Shark will Raise a ValueError if the number of column names in the list does not match the current DataFrame
  • Shark will Raise a TypeError if any of the columns are not strings
  • Shark will Raise a ValueError if any of the column names are duplicated in the list

7. The shape property

The shape property will return a two-item tuple of the number of rows and columns.

8. Visual HTML representation in the notebook with the _repr_html_ method

The _repr_html_ method is made available to developers by iPython so that your objects can have nicely formatted HTML displays within Jupyter Notebooks. Read more on this method [here in the iPython documentation][12] along with other similar methods for different representations.

9. The values property

values is a property that returns a single array of all the columns of data.

10. The dtypes property

The dtypes property will return a two-column DataFrame with the column names in the first column and their data type as a string in the other. Use 'Column Name' and 'Data Type' as column names.

11. Select a single column with the brackets

In shark, you can select a single column with df['colname'].

12. Select multiple columns with a list

Shark will also be able to select multiple columns if given a list within the brackets. For example, df[['colname1', 'colname2']] will return a two column DataFrame.

13. Boolean Selection with a DataFrame

In shark, you can filter for specific rows of a DataFrame by passing in a boolean Series/array to the brackets. For instance, the following will select only the rows such that a is greater than 10.

>>> s = df['a'] > 10
>>> df[s]

14. Check for simultaneous selection of rows and columns

15. Select a single cell of data

Shark can select a single cell of data with df[rs, cs]. We will assume rs is an integer and cs is either an integer or a string.

16. Simultaneously select rows as booleans, lists, or slices

Shark can select rows and columns simultaneously with df[rs, cs]. We will allow rs to be either a single-column boolean DataFrame, a list of integers, or a slice.

17. Simultaneous selection with multiple columns as a list

18. Simultaneous selection with column slices

Shark will allow columns to be sliced with either strings or integers. The following selections will be acceptable.

19. Tab Completion for column names

20. Create a new column or overwrite an old column

21. head and tail methods

The head and tail methods each accept a single integer parameter n which is defaulted to 5.

22. Generic aggregation methods

Shark can implement several methods that perform an aggregation. These methods all return a single value for each column. The following aggregation methods are defined.

  • min
  • max
  • mean
  • median
  • sum
  • var
  • std
  • all
  • any
  • argmax - index of the maximum
  • argmin - index of the minimum

23. isna method

The isna method will return a DataFrame the same shape as the original but with boolean values for every single value. Each value will be tested whether it is missing or not. Use np.isnan except in the case for strings which you can use a vectorized equality expression to None.

24. count method

The count method returns a single-row DataFrame with the number of non-missing values for each column. You will want to use the result of isna.

25. unique method

This method will return the unique values for each column in the DataFrame. Specifically, it will return a list of one-column DataFrames of unique values in each column. If there is a single column, just return the DataFrame.

26. nunique method

Return a single-row DataFrame with the number of unique values for each column.

27. value_counts method

28. Normalize options for value_counts

29. rename method

The rename method renames one or more column names. Accept a dictionary of old column names mapped to new column names. Return a DataFrame. Raise a TypeError if columns is not a dictionary.

30. drop method

Accept a single string or a list of column names as strings. Return a DataFrame without those columns. Raise a TypeError if a string or list is not provided.

31. Non-aggregation methods

There are several non-aggregation methods that function similarly. All of the following non-aggregation methods return a DataFrame that is the same shape as the origin.

  • abs
  • cummin
  • cummax
  • cumsum
  • clip
  • round
  • copy

32. Arithmetic and Comparison Operators

All the common arithmetic and comparison operators will be made available to our DataFrame. For example, df + 5 uses the plus operator to add 5 to each element of the DataFrame. Take a look at some of the following examples:

df + 5
df - 5
df > 5
df != 5
5 + df
5 < df

33. sort_values method

This method will sort the rows of the DataFrame by one or more columns. Allow the parameter by to be either a single column name as a string or a list of column names as strings. The DataFrame will be sorted by this column or columns.

34. pivot_table method

35. Reading simple CSVs

The read_csv function, will read in simple comma-separated value files (CSVs) and return a DataFrame.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sharks-1.0.1.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

sharks-1.0.1-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file sharks-1.0.1.tar.gz.

File metadata

  • Download URL: sharks-1.0.1.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.1

File hashes

Hashes for sharks-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9b82c68fea1dfb52d0fa31969df09109a3f70e22d5c0ac4ee301e0b985a2e2b7
MD5 2e194016b330f82504740fcb60ff1805
BLAKE2b-256 f3da93ac900fc7fc6d70dc1464b8e18bf9c7968acb37a2b1e3085d1aff99a70a

See more details on using hashes here.

File details

Details for the file sharks-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: sharks-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.1

File hashes

Hashes for sharks-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 43cf031e2da68528baa9b352ddbe723dac4fc4847d08b604d2c8173ca213fbcc
MD5 25b1f5d9156efe8271200486475b76bc
BLAKE2b-256 40711ea6c7fd369c8bcdbf640ea9aa46632f0e393dc5dfbb32d0c755b7a4f424

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page