Generalized Semantic Regression using Contextual Embeddings

In many applications actuaries, data scientists or researchers are confronted with datasets like shown in Table 1. While nominal variables, as in the first or second column, can be used in a model like a GLM without further adjustments for categorical variables need further treatment. Popular approaches are to perform a dummy regression or apply techniques like target encoding. Still an open question is how to treat free-text fields like as to be seen in the last column. In Table 1 there is a strong indication that the text field affects the output column. In the given example, one can think about an legal protection insurance application where the output columns is the number of court cases for downloading films. If the user has access to a media centre the probability to download film and to be sued afterwards might be bigger. But maybe it is the name of the console, we don’t know. This article provides a contribution on how to treat the text field for further statistical analysis.

Table 1:
CPU RAM Categorie Description Outcome
8 16 A The XBOX has a media center 6
1 0.5 B The Atari has no media center 1
1 2 B The FireStick has a media center 7

To model a numeric representation for the Text part one can rely on BERT. BERT, stands for Bidirectional Encoder Representations from Transformers and is a natural language processing (NLP) model developed by Google. It has gained significant popularity for its ability to understand the context and meaning of words in a sentence. The integration of BERT [1] embeddings into a Generalized Linear Model (GLM) presents an exciting opportunity to combine the power of contextual embeddings with regression analysis. GLMs provide a flexible framework for modeling the relationship between an outcome variable and covariates. By incorporating contextual embeddings from BERT, the GLM can benefit from the rich semantic information captured by BERT’s representations.

This article assumes that the reader is familiar with the transformer architecture [2], because this will not be covered in detail here.

1. Word (sentence) Embeddings

In a nutshell using an embedding is a clever way to represent a word or a sentence to be used in a latter estimator. We already covered a very simple tokenized representation in our article about LSTM. While a tokenizer assigns numerical values to individual tokens without considering their relationships, embeddings enable a direct connection between tokens.

Mathematically, an embedding can be represented as a mapping function, let \phi be an embedding mapping W \rightarrow \mathbb{R}^N where W is the word space and \mathbb{R}^n is an N-dimensional vector space.

1.1 Example

Let’s start with a simple example, for instance, considering the alphabet as our word space W=\mathbb{A}=\{a,b,c, \dots, y,z, A,B,C, \dots, Y,Z\}. A simple tokenization representation would assign numerical values to each token, such as \phi_1(a)=1,\phi_1(b)=2,\phi_1(z)=26,\dots,\phi_1(A)=27 \dots. However, embeddings provide a more flexible representation that can incorporate additional information without significantly increasing memory consumption. Using a different representation, such as \phi_2(a)=(1,0),\phi_2(b)=(2,0),\phi_2(z)=(26,0),\dots,\phi_2(A)=(1,1)..., enables the embedding to relate lowercase letters to uppercase letters and lowercase “a” to lowercase “b” through different dimensions. It’s important to note that the embedding function, \phi, is usually learned by the model itself and fine-tuned to the specific task, making embeddings a form of automatic feature engineering tailored to the problem at hand.

Another useful feature of embeddings is their ability to capture semantic relationships between words like \phi(king)-\phi(man)+\phi(woman)=\phi(queen). This was observed in a prior work [3] and has been further advanced with transformer-based models. Since [2], contextual embeddings, obtained through transformers, have become the state-of-the-art technique. The actual choice of the embedding function, \phi, may vary depending on the specific task. In practice, due to limited training data or budget restrictions, a base model is often used, and the corresponding embeddings are fine-tuned to derive a suitable \phi for the designated task. BERT (Bidirectional Encoder Representations from Transformers) is a popular choice for such a base model. BERT utilizes the encoder part of the transformer model to derive sentence embeddings and employs a stack of 12 encoders (or 24 for BERT large).

2. Generalized Semantic Regression

Moving on to the integration of contextual embeddings in a Generalized Linear Model (GLM), this section explores a way to combine the power of embeddings with regression analysis. GLMs are a class of models that assume a specific distribution for the outcome variable and relate it to the covariates and sentences using embeddings. To learn more about GLMs, this was already covered GLM in this article while Neuronal Network are are covered in this post.

Let the outcome Y be a random variable with a particular distribution of the exponential family, e.g. the density can be written as f(y|\theta)=exp(\frac{y\theta - b(\theta)}{a(\gamma)} + c(y,\gamma) ) where \gamma\in \mathbb{R}^+ is a scaling parameter also known as dispersion parameter. We assume that there exists some link function g such that for outcome Y with independent covariates X and sentences S as well as an unknown embedding \phi, \mu=E(Y|X_S)= g^{-1}(\theta). This X_S=(X_1,\dots,X_K,  L ^{-1}  \sum_{i=1}^L \phi_1(S_i), \dots, L ^{-1} \sum_{i=1}^L \phi_N(S_i)) may include K covariates as well as L sentences embedded into a N-Dimensional contextual embedding space. A nice feature is that for given \phi the variance function has the form

(1)   \begin{equation*}$Var(Y\vert X_S)=a(\gamma)⋅V( E[Y \vert X_S] ),\end{equation*}

meaning that the variance is a function of the mean. To model \theta we rely on a Feed Forward network (there is no restriction to use a more complex net here) such that

(2)   \begin{equation*}\arraycolsep=1.4pt\def\arraystretch{2.2}\begin{array}{l}Z_m= \sigma( \alpha_{0m} + \alpha^T_{m} X_S), m=1,\dots, M \\\theta= o(\beta_{0}+\beta^T\mathbf{Z})\end{equation*}

As an activation function we use \sigma(x)= tanh(x) while the output function of the net corresponds to the log link function o(x) = g^{-1}(x)=exp(x). It’s important to note that if \sigma is the identity function, then the entire model is a regular linear model (with M=1). Hence, a neural network can be thought of as a nonlinear generalization of a classic linear model.

To derive the loss function to be used for fitting the network, we rely on maximum likelihood estimation. In general, the negative log-likelihood given n observations where the outcome variables are from the exponential family is given by:

(3)   \begin{equation*}l=-\sum_{i=1}^n \frac{y_i\theta_i -b(\theta_i)}{a_i(\gamma)} + c(y_i, \gamma_i) \end{equation*}

In the latter we will use the negative maximum likelihood loss function to fine-tune an uncased BERT model and the corresponding feedforward net at once. However the concept is not restricted to BERT, any method that generates a contextual embedding can be used to be combined with a GLM.

3. RiskBERT

Insurance risk models traditionally rely on structured data, such as policyholder demographics, claims history, and property characteristics, to assess risk and calculate insurance premiums. However, unstructured data, including textual information from policy documents, claims descriptions, and external data sources, can provide valuable insights for risk assessment.

Incorporating BERT embeddings into insurance risk models allows for a more nuanced analysis of this unstructured textual data, enabling a deeper understanding of the risks associated with specific policyholders, properties, or events. We will call these models RiskBERT. In insurance risk evaluation, various probability distributions are commonly used to model and assess different aspects of risk. The choice of distribution depends on the type of risk being analyzed and the characteristics of the data available. Popular choices here are the Poisson distribution, commonly used to model the frequency or count of rare events, such as the number of claims in a given time period or the number of accidents. Another popular choice is the gamma distribution, used to represent aggregate claim amounts. Both distributions are member of the exponential family and can thus be combined with textual information with using the method described above.

In BERT, the input text is tokenized into smaller units, such as words or subwords, and each token is assigned a unique representation. One crucial token in BERT is the [CLS] token, short for “classification.” The [CLS] token is always inserted at the beginning of the input sequence, and it plays a vital role in various NLP tasks. In RiskBERT we use the [CLS] token of the Bert as contextual embedding, because [CLS] token’s significance lies in its ability to condense the entire input sequence into a single vector representation that encapsulates the semantic meaning of the text.

Figure 1 The entire RiskBERT Structure. In this version we do not use the Pooling Layer (thus this can be activated if necessary) and use the last hidden state of the CLS Token to be passed in the glmModel

During training, BERT learns to encode contextual risk information into the [CLS] token representation by employing the attention mechanism and transformer layers.

4. Simulation

Since there is no real data for the given topic we decide to generate an artificial dataset using the GLUE dataset. The GLUE dataset is a collection of diverse NLP tasks designed to evaluate the general language understanding capabilities of models. We focus in particular at AX dataset which is an extension of the GLUE dataset that focuses on evaluating the robustness of NLP models against adversarial examples.

To generate contextual embeddings to be used as in the simulation we use the “all-MiniLM-L6-v2” modell denoted with \phi_{MiniLM}. “all-MiniLM-L6-v2” belongs to the class of a sentence transformer developed by [4], a feature of these models is that sematically similar sentences are close to each other in the embedding space. To avoid the danger that the by chance BERT embeddings are close to the “all-MiniLM-L6-v2” embedding we modify the embedding by the following \phi_{simu}(S_{\text{premsie}})= \phi_{MiniLM}(S_{\text{premsie}}), \phi_{simu}(S_{\text{hypothesis}})=- \phi_{MiniLM}(S_{\text{premsie}}). This modification is also in favor of our Risk application, since an adverbial expression should be assosiated to an adverbial risk.

In the insurance context a Poisson distribution is often used to model claim frequencies, the Poisson distribution is part of the exponential family with a(\gamma)=1, b(\theta)=exp(\theta)=\lambda, c(y,\gamma)= -log(y!) with leads to a certain loss function when inserted in (3).
Thus to simulate our method we decide to use a simulate Poisson distributed outcomes with \lambda=exp(1+ a_1 X_{1} + a_2 X_{2}  + \phi_{simu}(\text{S})^T b ) with X_1 \sim \mathcal{N}(0,1), X_2 \sim \mathcal{N}(0,1), a_1=0.2, a_2=0.4, the fixed scores for the semantic embedding are drawn once with b_i \sim \mathcal{U}(0,1) while S is randomly chosen from premise or hypothesis. We sample 20000 and do a 70/20/10 train, validation, test split to compare 3 models:

  • GLM – A GLM as with two the covariates X_1,X_2. For training we use SGD a batch size of 500 and 100 epochs.
  • BERT freezed – A RiskBERT Model as described in Section 3 but the BERT parameter are freezed, in training only the glmModel part is fitted. For training we use SGD a batch size of 500 and 100 epochs.
  • BERT full – A RiskBERT Model as described in Section 3 where all parameter, including the BERT parameter are fitted. For training we use SGD a batch size of 250 and 100 epochs.

The code for RiskBERT and the simulation can be found on GitHub.

4.1 Results:

To compare the 3 methods we use the same train and validation sets for all methods and compare the loss.

LossValidation Loss
BERT freezed-6.168268-6.724352
BERT full-10.516733-9.806701

Figure 2 We can observe significant jumps in the loss function when fitting the full RiskBert Model. This is not surprising, since the amount of parameter to be fitted is much bigger than in the other variants. However after 50 epochs the training stabilized and converged to the smallest loss over all compared methods. The strategy of fitting just the GLM and keeping taking the BERT embeddings as given lies in between outperforming a GLM using only the two covariates.

Figure 3 The picture on the left hand side shows attention for Layer 1-1 of a fully trained RiskBERT model while the right picture shows a RiskBERT model where the BERT parameter where freezed. By construction (of the simulation) negations have a large effect on risk, this has been recognized by the model and is reflected in the attention mechanism.

5. Conclusion

In conclusion, the combination of BERT embeddings and a Generalized Linear Model presents a powerful framework for incorporating contextual information into regression analysis. The [CLS] token in BERT plays a crucial role in capturing the semantic meaning of the input sequence and provides a condensed representation for classification tasks. By integrating BERT embeddings into the GLM, the model can leverage contextual information and enhance the regression analysis with a richer representation of the input text. This approach opens up new opportunities for accurate and context-aware regression modeling in various natural language processing tasks.


[1] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: pre-training of deep bidirectional transformers for language understanding,” Arxiv, vol. abs/1810.04805, 2019.
title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
author={Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
[2] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Nips, 2017.
title={Attention is All you Need},
author={Ashish Vaswani and Noam M. Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
[3] T. Mikolov, W. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in North american chapter of the association for computational linguistics, 2013.
title={Linguistic Regularities in Continuous Space Word Representations},
author={Tomas Mikolov and Wen-tau Yih and Geoffrey Zweig},
booktitle={North American Chapter of the Association for Computational Linguistics},
[4] N. Reimers and I. Gurevych, “Sentence-bert: sentence embeddings using siamese bert-networks,” in Conference on empirical methods in natural language processing, 2019.
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Nils Reimers and Iryna Gurevych},
booktitle={Conference on Empirical Methods in Natural Language Processing},

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.