NLP With RAPIDS, Yes It’s Possible!

Principal Component Analysis on Sentence Embeddings

Published in

Towards Data Science

7 min readApr 10, 2021

In this blog post, I will present how you can perform usual data analysis on complex data, such as sentence embeddings from transformers. The full notebook can be found here.

In the Actuarial Loss Competition on Kaggle, we have to predict insurance claim costs for workers injuries. We are given a number of very classical features, such as age, salary, number of dependent children etc., but also a short text description of the nature and circumstances of the accident, and we have to predict the ultimate cost of that claim.

The specificity of this work is, first, that I have performed it end-to-end on GPU with RAPIDS, whose libraries cuPY, cuDF and cuML enable you to reproduce every Numpy, Pandas or Scikit-Learn manipulation on GPU; and second, that I try to extract information on a numerical target (the ultimate claim cost) based only on text data (description of the injury).

Environment Setup

The sentence embeddings will be generated with a PyTorch Transformer, then they will be processed by Principal Component Analysis on RAPIDS.

First, you have to make sure that you have a RAPIDS environment installed (I recommend version 0.18). I am using a ZBook Studio with an NVIDIA RTX5000, that came with the Z by HP Data Science Software Stack, so I already had everything installed and properly set up (creating and managing environments can be a pain, I know). If you’re running on a Kaggle kernel, be sure to add the RAPIDS repository and install RAPIDS with the following :

Then, install the sentence-transformers repository and other requirements :

You’re all set!

Reading and Processing Data

Reading the data and extracting the claim description is quite straightforward. Moreover, the text data is already cleaned, we only have to convert it to lower case (most Bert derivatives are built for uncased text data).

Raw Embeddings from Torch Sentence Transformer

For sentence embeddings, we want to map a variable length input text to a fixed sized dense vector. This can be done with sentence tranformers : as you know, transformers are encoder-decoder architectures, where the encoder maps the input to an inner representation, that is then mapped to the output by the decoder. The embeddings will be this inner vectorized representation of the inputs outputed by the encoder, before passing through the decoder to be adapted to the task we are training on.

Sentence Transformers are just the usual transformers that everybody involved in NLP has heard about, that is to say, Bert originally and all its derivatives: other languages (CamemBert, etc), robust (RoBerta), and light (DistilBert, DistilRoberta) versions. The only difference is the tasks they have been trained on, that involve processing text at the sentence-level instead of the original word-level. Their uses are very wide :

The model I have chosen to compute my embeddings is DistilRoBerta for paraphrase scoring.

Why did I choose the paraphrase model? Well, it’s a purely personal choice, feel free to use another model. I just thought this model would be better at capturing the overall meaning of a sentence without focusing too much on the choice of one word or a synonym (that’s basically what paraphrase is), and that was my objective since the description is very brief and functional.

It is actually quite a big model, despite being the lighter version of the original Roberta: it still has more than 82 millions trainable parameters…

The overall process of getting sentence embeddings is actually very simple.

First, for each input sentence, you have to map words to their ids in the vocabulary used by the model, adding <START> and <END> tokens.

Eventually, we have to add <PAD> tokens at the end in order for our input to fit the model max input sequence length.

In my case, the original model’s maximum sequence length was 128, but the max length of my inputs (including <START> and <END> tokens) was 21, so I set this max sequence length at 25 for safe generalization purposes. That will save us a lot of unuseful calculations and computation time.

Then, our model maps the input ids to their corresponding token embeddings, which are the concatenation of word embeddings, position embeddings (are we the first, second, third word ?) and token type embeddings (are we a normal word ? a <START> ?). Finally, we just normalize the features.

Now, for each token, we have an initial tensor representation, that we feed to the encoder (I will not go into the details here, for more on this refer to the paper Attention Is All You Need introducing the encoder-decoder architecture).

The encoder outputs a 768-dimensional tensor representation of any input sequence. Now, what do we do with it ? First of all, we have to make sure these embeddings would be useful when fed to a regression model for predicting the ultimate claim amount. That would mean, ideally, having orthonormal axes, with coordinates bearing a great explainability power and enabling clustering of accident types or injury severity.

Sadly, it is far from being the case: indeed, the original RoBerta paraphrase model had been trained on a huge and eclectic dataset, whereas our inputs have great semantic similarity, all revolving around health.

For all 768 dim of my normalized sentence embeddings, I have computed the coordinate’s variance among my 90_000 text samples. You can see on this histogram that most variances are very low: as a reference, the variance of a U([0,1]) is 1/12=0.0833.

My representations are therefore likely to be highly similar.

Processed Embeddings from PCA with RAPIDS

At this point, we are facing 2 problems : first, the embeddings are too big (768 components, when we have only 12 additional training features), and we would prefer them to bring more explainability and separatability with respect to our target, the ultimate incurred claim cost.

PCA is used in exploratory data analysis and for making predictive models. It performs dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. Then, the second component is the direction orthogonal to the first that maximizes data variance, etc.

That procedure iteratively gives orthonormal vectors in decreasing order of explainability power on data variance. Projecting the data on the first n components performs a dimensionality reduction by projecting it in R^n, though preserving as much of the data variance as possible.

Note here that I have performed the principal component analysis on all claim descriptions embeddings, both from train and test datasets, to get a better fitting that will suit both datasets. The first principal component accounts for 7.6% of the total data variance, the second for 3.9%. In total, they explain 11.5% of raw embeddings variability.

Now, that’s when I tried to visualize data points projections on this 2-d space that I had a striking surprise :

Isn’t that cute? Now, there are several interpretations to draw from this. First, we very clearly see orthogonal variability directions. Second, there seem to be 2 clusters of datapoints, merging partially for lower values on the 2 axes.

Target Explainability

Based on that observation, I wondered whether the 2 clusters could bring predictive power on the target, the ultimate incurred claim cost. Therefore, I decided to plot all train datapoints projections (we obviously don’t have target values for test datapoints…) while coloring them based on target value.

At that point, I met a lisibility problem : target values are obviously lower-bounded (a claim cost is necessarily positive), but not upper-bounded, and some of them were so extreme that it was making my colormap explode, with most values crushed to the bottom and a few of them skyrocketing to top colors. Therefore, I decided to behead my data of its uppermost quintile in order to get a more regular distribution.

That was a really impressive result, as it illustrates the clustering power on the target provided by the principal component analysis on text embeddings, without any additional feature. Here, we very clearly see that the right half of the heart accounts for higher costs, while the left half englobes lower claims.

Thanks for reading me! You can find the full source Kaggle notebook here.