Skip to main content

A light and useful package to find columns in a Dataframe by its values.

Project description

dataframe_column_identifier

latest version: 0.0.5

What is this?

A light and useful package to find columns in a Dataframe by its values.

Installing

pip install dataframe-column-identifier==0.0.5

Importing

from dataframe_column_identifier import DataFrameColumnIdentifier

KBest - Feature Selection Using Example

import pandas as pd
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from dataframe_column_identifier import DataFrameColumnIdentifier

print(X_train.shape)
(1460, 282)

print(X_test.shape)
(1459, 282)

dfci = DataFrameColumnIdentifier()
kbest = SelectKBest(score_func=mutual_info_regression, k=10)
kbest.fit_transform(X_train, y_train)
kbest_get_support_output = kbest.get_support()

print(kbest_get_support_output)
array([False,  True, False,  True, False,  True, False,  True,  True,
       False, False,  True, False, False, False, False, False, False,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False])

print(dfci.select_columns_KBest(X_train, kbest_get_support_output, verbose=1))
[
  '1stFlrSF',
  'ExterQual_TA',
  'GarageArea',
  'GarageCars',
  'GarageYrBlt',
  'GrLivArea',
  'MSSubClass',
  'OverallQual',
  'TotalBsmtSF',
  'YearBuilt'
]

X_train = dfci.transform(X_train)
X_test = dfci.transform(X_test)

print(X_train.shape)
(1460, 10)

print(X_test.shape)
(1459, 10)

print(X_train.head(10))
   1stFlrSF  ExterQual_TA  GarageArea  GarageCars  GarageYrBlt  GrLivArea  MSSubClass  OverallQual  TotalBsmtSF  YearBuilt
0     856.0           0.0       548.0         2.0       2003.0     1710.0        60.0          7.0        856.0     2003.0
1    1262.0           1.0       460.0         2.0       1976.0     1262.0        20.0          6.0       1262.0     1976.0
2     920.0           0.0       608.0         2.0       2001.0     1786.0        60.0          7.0        920.0     2001.0
3     961.0           1.0       642.0         3.0       1998.0     1717.0        70.0          7.0        756.0     1915.0
4    1145.0           0.0       836.0         3.0       2000.0     2198.0        60.0          8.0       1145.0     2000.0
5     796.0           1.0       480.0         2.0       1993.0     1362.0        50.0          5.0        796.0     1993.0
6    1694.0           0.0       636.0         2.0       2004.0     1694.0        20.0          8.0       1686.0     2004.0
7    1107.0           1.0       484.0         2.0       1973.0     2090.0        60.0          7.0       1107.0     1973.0
8    1022.0           1.0       468.0         2.0       1931.0     1774.0        50.0          7.0        952.0     1931.0
9    1077.0           1.0       205.0         1.0       1939.0     1077.0       190.0          5.0        991.0     1939.0

print(X_test.head(10))
   1stFlrSF  ExterQual_TA  GarageArea  GarageCars  GarageYrBlt  GrLivArea  MSSubClass  OverallQual  TotalBsmtSF  YearBuilt
0     896.0           1.0       730.0         1.0       1961.0      896.0        20.0          5.0        882.0     1961.0
1    1329.0           1.0       312.0         1.0       1958.0     1329.0        20.0          6.0       1329.0     1958.0
2     928.0           1.0       482.0         2.0       1997.0     1629.0        60.0          5.0        928.0     1997.0
3     926.0           1.0       470.0         2.0       1998.0     1604.0        60.0          6.0        926.0     1998.0
4    1280.0           0.0       506.0         2.0       1992.0     1280.0       120.0          8.0       1280.0     1992.0
5     763.0           1.0       440.0         2.0       1993.0     1655.0        60.0          6.0        763.0     1993.0
6    1187.0           1.0       420.0         2.0       1992.0     1187.0        20.0          6.0       1168.0     1992.0
7     789.0           1.0       393.0         2.0       1998.0     1465.0        60.0          6.0        789.0     1998.0
8    1341.0           1.0       506.0         2.0       1990.0     1341.0        20.0          7.0       1300.0     1990.0
9     882.0           1.0       525.0         2.0       1970.0      882.0        20.0          4.0        882.0     1970.0

Feature Selection Using Example

import pandas as pd
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from dataframe_column_identifier import DataFrameColumnIdentifier

print(X_train.shape)
(1460, 282)

print(X_test.shape)
(1459, 282)

dfci = DataFrameColumnIdentifier()
kbest = SelectKBest(score_func=mutual_info_regression, k=10)
kbest_selected_features = kbest.fit_transform(X_train, y_train)

print(kbest_selected_features.shape)
(1460, 10)

print(pd.DataFrame(kbest_selected_features).head(10))
        0    1       2       3       4       5       6    7      8    9
 0   60.0  7.0  2003.0   856.0   856.0  1710.0  2003.0  2.0  548.0  0.0
 1   20.0  6.0  1976.0  1262.0  1262.0  1262.0  1976.0  2.0  460.0  1.0
 2   60.0  7.0  2001.0   920.0   920.0  1786.0  2001.0  2.0  608.0  0.0
 3   70.0  7.0  1915.0   756.0   961.0  1717.0  1998.0  3.0  642.0  1.0
 4   60.0  8.0  2000.0  1145.0  1145.0  2198.0  2000.0  3.0  836.0  0.0
 5   50.0  5.0  1993.0   796.0   796.0  1362.0  1993.0  2.0  480.0  1.0
 6   20.0  8.0  2004.0  1686.0  1694.0  1694.0  2004.0  2.0  636.0  0.0
 7   60.0  7.0  1973.0  1107.0  1107.0  2090.0  1973.0  2.0  484.0  1.0
 8   50.0  7.0  1931.0   952.0  1022.0  1774.0  1931.0  2.0  468.0  1.0
 9  190.0  5.0  1939.0   991.0  1077.0  1077.0  1939.0  1.0  205.0  1.0

print(dfci.select_columns_by_values(X_train, kbest_selected_features, n_validation_rows=100, verbose=1))
[
  '1stFlrSF',
  'ExterQual_TA',
  'GarageArea',
  'GarageCars',
  'GarageYrBlt',
  'GrLivArea',
  'MSSubClass',
  'OverallQual',
  'TotalBsmtSF',
  'YearBuilt'
]

X_train = dfci.transform(X_train)
X_test = dfci.transform(X_test)

print(X_train.shape)
(1460, 10)

print(X_test.shape)
(1459, 10)

print(X_train.head(10))
   1stFlrSF  ExterQual_TA  GarageArea  GarageCars  GarageYrBlt  GrLivArea  MSSubClass  OverallQual  TotalBsmtSF  YearBuilt
0     856.0           0.0       548.0         2.0       2003.0     1710.0        60.0          7.0        856.0     2003.0
1    1262.0           1.0       460.0         2.0       1976.0     1262.0        20.0          6.0       1262.0     1976.0
2     920.0           0.0       608.0         2.0       2001.0     1786.0        60.0          7.0        920.0     2001.0
3     961.0           1.0       642.0         3.0       1998.0     1717.0        70.0          7.0        756.0     1915.0
4    1145.0           0.0       836.0         3.0       2000.0     2198.0        60.0          8.0       1145.0     2000.0
5     796.0           1.0       480.0         2.0       1993.0     1362.0        50.0          5.0        796.0     1993.0
6    1694.0           0.0       636.0         2.0       2004.0     1694.0        20.0          8.0       1686.0     2004.0
7    1107.0           1.0       484.0         2.0       1973.0     2090.0        60.0          7.0       1107.0     1973.0
8    1022.0           1.0       468.0         2.0       1931.0     1774.0        50.0          7.0        952.0     1931.0
9    1077.0           1.0       205.0         1.0       1939.0     1077.0       190.0          5.0        991.0     1939.0

print(X_test.head(10))
   1stFlrSF  ExterQual_TA  GarageArea  GarageCars  GarageYrBlt  GrLivArea  MSSubClass  OverallQual  TotalBsmtSF  YearBuilt
0     896.0           1.0       730.0         1.0       1961.0      896.0        20.0          5.0        882.0     1961.0
1    1329.0           1.0       312.0         1.0       1958.0     1329.0        20.0          6.0       1329.0     1958.0
2     928.0           1.0       482.0         2.0       1997.0     1629.0        60.0          5.0        928.0     1997.0
3     926.0           1.0       470.0         2.0       1998.0     1604.0        60.0          6.0        926.0     1998.0
4    1280.0           0.0       506.0         2.0       1992.0     1280.0       120.0          8.0       1280.0     1992.0
5     763.0           1.0       440.0         2.0       1993.0     1655.0        60.0          6.0        763.0     1993.0
6    1187.0           1.0       420.0         2.0       1992.0     1187.0        20.0          6.0       1168.0     1992.0
7     789.0           1.0       393.0         2.0       1998.0     1465.0        60.0          6.0        789.0     1998.0
8    1341.0           1.0       506.0         2.0       1990.0     1341.0        20.0          7.0       1300.0     1990.0
9     882.0           1.0       525.0         2.0       1970.0      882.0        20.0          4.0        882.0     1970.0

dataframe_column_identifier.DataFrameColumnIdentifier

Creating a new instance

dfci = DataFrameColumnIdentifier()

Methods

  • select_columns_by_values :

    Returns the names of the Pandas DataFrame columns which are selected based on a matrix of values.

    dfci.select_columns_by_values(X, selected_values, n_validation_rows=100, verbose=1)

    Parameters:

    • X : Pandas DataFrame

      A DataFrame with the columns that must be found (the DataFrame must have the columns' values either).

    • X_columns_values : numpy matrix

      The values of the columns to be found.

    • n_validation_rows : int, optional (default=1000)

      The number of rows that must be equal in the columns comparison. If the informed number is greater than the number of rows in X, the numberrows in X will be used.

    • verbose : int, optional (default=0)

      It controls the verbosity when looking for the columns.

  • select_columns_KBest :

    Returns the names of the Pandas DataFrame columns which are selected based on the KBest.get_support method's output.

    dfci.select_columns_KBest(X, kbest_get_support_output, verbose=1)

    Parameters

    • X : Pandas DataFrame

      The same DataFrame used in the KBest.fit_transform method.

    • kbest_get_support_output : boolean array

      The KBest.get_support method's output.

    • verbose : int, optional (default=0)

      It controls the verbosity when looking for the columns.

  • transform :

    Returns a new Pandas DataFrame with only the columns which were selected on the select_columns_* method.

    dfci.transform(X)

    Parameters:

    • X : Pandas DataFrame

      The DataFrame to be transformed (the Pandas DataFrame must have the columns that should be found).

Attributes

  • selected_columns_ : Name of the given Pandas DataFrame columns which were selected based on the given values, after the select_columns_* method execution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataframe-column-identifier-0.0.5.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file dataframe-column-identifier-0.0.5.tar.gz.

File metadata

  • Download URL: dataframe-column-identifier-0.0.5.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for dataframe-column-identifier-0.0.5.tar.gz
Algorithm Hash digest
SHA256 86f782abcdef558b6d129cd56bd3f85a07d2a361c5ca3cd1b87466a8caf394ad
MD5 49d9220d69e2c8598aaf70c5e673436d
BLAKE2b-256 58932a65efab23870dded622c926771d32cf8b74444abd875632e8de2c6d9cc3

See more details on using hashes here.

File details

Details for the file dataframe_column_identifier-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: dataframe_column_identifier-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for dataframe_column_identifier-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8b4e3deaaa15e3528cf714ced613b38a51a51ebfab15aa9321e679d6ecc3495e
MD5 34faeffd21dc12f1116be7eb13f07063
BLAKE2b-256 0e1a8380bf3ca87390f8693c741286bdd1c3418d414f6d9ad254bf193f501249

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page