Getting To Know Your Features In Seconds With RAPIDS

Running Your Data Preprocessing Pipeline End-to-End on GPU

Louise Ferbach
Towards Data Science

--

This blog post intends to present the feature selection notebook I recently published in the ongoing Mechanisms of Action Kaggle competition.

Exploratory data analysis, and especially getting an idea of features importance, is a crucial step when confronted with a data science problem. However, this step can often be trying if your database is huge, with numerous interconnexions, making the analysis costly in computing time.

GPUs are great at improving computing time… if you can use them ! Indeed, their use is often dedicated to training neural networks, thanks to powerful frameworks (Tensorflow or Pytorch are the most famous) that have enabled millions of developers to unleash their incredible powers.

However, if you have some dirty preprocessing to get done with pandas or scikit-learn, that can be quite heavy if your data is big, since these packages do not feature any parameter helping you run them on GPUs. To sum things up, you won’t be able to get the most out of your GPU at this stage.

RAPIDS is a suite of packages developed by Nvidia AI, that intends to execute end-to-end data science and analytics pipelines entirely on GPUs.

The goal of this notebook is to perform univariate regressions of each target on every feature in the MoA competition, namely 872 x 206 = 179632 logistic models to estimate separately.

The good news ? This is possible within minutes with RAPIDS !

You can note that I never import neither pandas (replaced by cudf) nor scikit-learn (replaced by cuml).

This notebook will be a tutorial, aiming at helping you get familiar with the use of the libraries. At the end, you’ll be able to select, among your features, those which appear to be the most crucial in predicting your targets. That will enable you to build robust, medium-sized models with better interpretability.

If you want to skip the processing and jump to the results, you can directly take a look at the results here.

First step : Getting familiar with cuDF

First, I import all the necessary packages, and load the data.

You can notice that cuDF is the exact equivalent of pandas except that it does everything on GPU : the data is directly read to the GPU with read_csv, you can also try any commonplace operation on dataframes, the functions and syntaxes are similar, as you can tell from my use of the merge method.

Second step : Getting familiar with cuML

This function is conceived to return a feature importance metric.

Initially, I report the binary cross-entropy loss of univariate logistic regressions performed, for each target, on every feature (I have nearly 180k of them to do in total). Univariate analyses are a raw, but quite reliable and widely-used method for estimating a feature importance on a given target. It enables you to isolate its solo explainability power for a given model.

You can notice that the cuML classes (here LogisticRegression), functions (here log_loss) and methods (here fit or predict_proba) also are exact equivalent of their sklearn counterparts which you are probably used to.

However, I have to scale the outputs (for comparison purposes between targets that could be more or less difficult to predict, leveraging the average loss on features) and invert values order to gain better interpretability : indeed, you would expect a feature importance to be decreasing in the loss of estimating a given target on this feature. For this step, I use the cuML MinMaxScaler preprocessing class that works just as in sklearn.

Third step : Interpretation

Calculating a feature’s average importance on all targets is important to get a global view of how crucial it can be on your overall model.

However, I’m not sure you should rely entirely on this indicator when discriminating between features to keep and features to dismiss : indeed, a given feature could have a very high explainability power on one specific target, and be irrelevant on all others, therefore even though it would receive a low average importance score, keeping it would be decisive in final model quality for that target. We will aim at keeping the features that either have a high average explainability power on all targets (mean importance threshold), or which are particularly relevant for some of them (maximal importance threshold). Feel free to play with the thresholds !

You will probably have noticed the small to_pandas() method I applied to my data : as it is stored on GPU, for plotting purposes only I need to convert it to cpu. That is the equivalent of the to_cpu() you perform on PyTorch at the end of the cuda computing procedure, for example.

Further data analysis, based on feature selection operated with the chosen thresholds, leads to the following conclusions :

  • the features that fulfill both criterias have, on average on all targets, 0.15 more average importance than the features that don’t fulfill any.
  • the features that fulfill one criteria have, on average on all targets, 0.19 more maximum importance than the features that don’t fulfill any.

This is a quite reassuring observation on our analysis.

Fourth step : Conclusion

To get a global overview, we can just compare heatmaps of scaled feature importance by target on the two subsets, the features that fulfill both criterias and the ones that don’t fulfill any. It becomes visible immediately that the first graph is much lighter — the solo importances are higher.

You can now rely on your selected features and use them for building a lighter model, easier to run, and much more interpretable as well !

--

--