Feature Engineering: deriving statistical features using pandas aggregate function

Table of Contents

png

Many times when dealing with anonymized or machine-generated datasets, you find yourself out of ideas to come up with new features because it is unclear of what the dataset variables at hand represent. Take for example the following dataframe:

1
2
3
4
5
import pandas as pd

df = pd.read_csv("../datasets/Updated_Test.csv")
data = df[df.columns[2:13]].head(10)
data
absorbance0absorbance1absorbance2absorbance3absorbance4absorbance5absorbance6absorbance7absorbance8absorbance9absorbance10
00.5179510.5205080.5268520.5316110.5368160.5438280.5477610.5543790.5656220.5757620.590253
10.5178390.5223670.5251860.5346610.5419000.5461800.5516870.5567530.5664460.5782080.591039
20.5177020.5220180.5272370.5343740.5411550.5471520.5498370.5575130.5667930.5805740.592258
30.5250080.5274390.5368710.5396360.5465550.5531830.5588260.5635490.5756750.5872140.597155
40.5205320.5226830.5268420.5346340.5396760.5474880.5526880.5583550.5689590.5789050.591207
50.5267050.5321140.5367300.5413630.5496520.5530740.5588680.5640170.5761040.5834930.598938
60.5208700.5314240.5341010.5386260.5432720.5513150.5550330.5635710.5718610.5836860.596506
70.5272000.5253280.5328380.5371540.5439590.5499610.5572940.5594780.5720840.5842930.597304
80.5258300.5328050.5356700.5396970.5461120.5512540.5565570.5647900.5750070.5837330.598626
90.5331080.5302020.5379240.5401900.5483080.5536940.5587000.5629520.5741960.5849250.597308

This data is most probably machine generated, that is true because what we’re seeing here is 10 rows of blood spectroscopy readings. (There are well over 150 absorbance columns but since our focus is on generating features let’s just work with these 11 columns.)
We can use pandas aggregate function to map various statistical measures such as mean, median, variance etc of our data to come up with more features that can aid in improving our machine learning model performance when using this data in modeling a possible solution to the problem at hand. The modeling bit aside, for now let’s dive straight to designing our extra features.
The various statistical measures we’re going to aggregate are:

  • mean
  • median
  • sum
  • count
  • max
  • min
  • standard deviation
  • variance
  • skewness
  • kurtosis

Let’s quickly wrap them all up in a loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# To avoid including an aggregated feature to calculations of another aggregated feature
# we create a list of features
features = data.columns.to_list()

print("Dataframe before adding aggregate features:\n\n"); display(data.head(3))

for funct in ['mean', 'median', 'sum', 'count', 'max', 'min', 'std', 'var', 'skew', 'kurt']:
    data[f'agg_{funct}'] = data[features].agg(f'{funct}', axis=1)
    
print("\n\nDataframe after adding aggregate features:\n\n"); display(data.head(3))
Dataframe before adding aggregate features:
absorbance0absorbance1absorbance2absorbance3absorbance4absorbance5absorbance6absorbance7absorbance8absorbance9absorbance10
00.5179510.5205080.5268520.5316110.5368160.5438280.5477610.5543790.5656220.5757620.590253
10.5178390.5223670.5251860.5346610.5419000.5461800.5516870.5567530.5664460.5782080.591039
20.5177020.5220180.5272370.5343740.5411550.5471520.5498370.5575130.5667930.5805740.592258
Dataframe after adding aggregate features:
absorbance0absorbance1absorbance2absorbance3absorbance4absorbance5absorbance6absorbance7absorbance8absorbance9...agg_meanagg_medianagg_sumagg_countagg_maxagg_minagg_stdagg_varagg_skewagg_kurt
00.5179510.5205080.5268520.5316110.5368160.5438280.5477610.5543790.5656220.575762...0.5464860.5438286.011343110.5902530.5179510.0232360.0005400.622384-0.477046
10.5178390.5223670.5251860.5346610.5419000.5461800.5516870.5567530.5664460.578208...0.5483880.5461806.032264110.5910390.5178390.0234510.0005500.465583-0.609624
20.5177020.5220180.5272370.5343740.5411550.5471520.5498370.5575130.5667930.580574...0.5487830.5471526.036613110.5922580.5177020.0239110.0005720.520019-0.569977

3 rows × 21 columns

Quick and efficient! We have added ten more features to our dataset.
Other function names that can be passed to pandas aggregate function are listed in the documentation here. Note however that not all of them return a pandas series (column).