Bayesian Estimate of Learning Potential

Given a dataset, can we use Bayes’ Theorem to estimate how much we expect to learn from a set of features for a classification task?

Introduction

If you have ever been given a set of features and labels and asked to build a model to predict labels from features, you have probably wondered if the task is even feasible. Sure, you may know that it is possible to train some machine learning (ML) model on the data, compute accuracy metrics, and then measure viability of the prediction task based on those metrics, but surely there should be a more model-agnostic way to determine this viability.

For example, say you are a SaaS company that creates SuperApp™.

Customers pay a yearly subscription to use the app, and at the end of each customer’s subscription, they either renew or they churn, the former meaning they keep using the app and the latter meaning they end their contract and disappear into the void. Renewing is good for your company because it means you get more money. Churning is bad because it means your customers take your money with them into the void.

So to limit the possibility of churning, you want to predict whether a customer is likely to churn well before it happens. This way you could intervene and maybe convert that “probable churn” into a “probable renew.”

In your SuperApp, you can track how often people log into the app and how many pages they visit. From this and related information, you define a holistic “Activity Score” for each customer.

From past data, you have a record of past customers who have churned and renewed along with their associated activity scores two months prior to their decision. This past information leads to a standard ML question:

Can we predict whether a customer will renew or churn based on their activity score?

We would typically make such a prediction by 1) collecting historical data, 2) doing some exploratory analysis, 3) choosing an appropriate ML model, and 4) finally training said model on the historical data (with some train/test split and cross-validation) to get a final model that can predict churn.

However, it would be worth exploring the feasibility of this prediction task well before we attempt to train a model. Understanding whether our historical data even contains a signal which allows for prediction should come before attempting to build a model.

Thus another way to phrase the question “Can we predict whether a customer will renew or churn based on their activity score?” is

How much should we expect to learn about the labels from the features?

Most of the subsequent discussion will be focused on quantitatively defining what it means to “learn” about labels from features, but one way to qualitatively answer this question is through some simple exploratory data analysis. Let’s say we had some historical data for activity scores of past churned and renewed customers. We plot the two histogram distributions for the two categories of customers on the same plot. The overlaid distributions might look like

(Figure 1: Activity score distributions for churned and renewed customers. Large overlap between the distributions suggests activity score cannot tell us much about churn probability)

or they might look like

(Figure 2: Activity score distributions for churned and renewed customers. Separation between the distributions suggests activity score cannot tell us much about churn probability)

Intuitively we know that if we are in a situation like Figure 1, we’re in a pretty bad position to predict churn/renewal based on activity score. But Figure 2 represents exactly where we would like to be: Clean distribution separation between the classes. In the case of Figure 2, we can be relatively sure that a high accuracy (assuming balanced classes) ML model would correctly distinguish the two classes, but if we got a high accuracy model in the case of Figure 1, we would be skeptical.

The two reactions to Figure 1 and Figure 2 are based on a qualitative inspections of the distributions. We intuitively know there is more to learn in Figure 2 than in Figure 1. Now, can we frame this intuition more quantitatively?

Formal Framing

Say that we have a data set consisting of points in feature space \(\textbf{x}_i \in \mathbb{R}^{M}\) and their corresponding classes \(y_i\in U\) where \(U\) is the set of all classes. We will one-hot-encode these classes so that \(y_{i, \alpha} \in \{0, 1\}\) and \(\alpha = 1,\ldots,C\) for \(C\) classes. The principal task of ML is to generate a function \(\hat{p}(\alpha|\textbf{x})\), i.e., an estimate¹ of the probability that a data point with feature representation \(\textbf{x}\) has class \(\alpha\). One way some ML models approach this task is to find the \(\hat{p}(\alpha|\textbf{x})\) that maximizes the probability that we get the labels from the features

\begin{equation}
\text{Probability of getting labels \({y_{i, \alpha}}\) given features \({\textbf{x}_i}\)} = \prod_{i=1}^N \prod_{\alpha=1}^C \hat{p}(\alpha|\textbf{x}_i)^{y_{i, \alpha}} \qquad (1)
\label{eq:prob_labels}
\end{equation}

Typically, in ML language we frame this probability maximization as the minimization of a “loss” which is defined as

\begin{equation}
{\cal L} = – \frac{1}{N}\sum_{\alpha=1}^C\sum_{i=1}^N y_{ i, \alpha} \ln \hat{p}(\alpha|\textbf{x}_i) \qquad \text{[Categorical Cross Entropy]} \qquad (2)
\label{eq:cross_ent}
\end{equation}

The negative of the log of Eq.(1) is proportional to Eq.(2), and thus “minimizing the categorical cross-entropy loss” is equivalent to “maximizing the probability that we get the labels from the features,” thus we can use both routes to frame the objective of ML. Both equations also depend on \(\hat{p}(\alpha|\textbf{x}_i)\) and determining this probability requires a detour into Bayes’ Theorem.

Bayes’ Theorem and Kernel Density Estimation

In our framing of a general ML classification task, we want to estimate Eq.(2), i.e., the loss we expect to obtain from training a model on a data set \(\{\{\textbf{x}_i, y_{i,\alpha}\}_{i=1}^N \}\) for \(\alpha = 1, \ldots, C\). Above we denoted as \(\hat{\rho}(\textbf{x}|\alpha)\) the estimate of the probability density in feature space for class \(\alpha\). Alternatively, the quantity \(\hat{p}(\alpha|\textbf{x})\) is the estimate of the probability (i.e., not density) of being in class \(\alpha\) given feature point \(\textbf{x}\). This latter quantity is what ML problems aim to find and what we will use Bayes’ theorem to compute.

By Bayes’ theorem, we have

\begin{equation}
\hat{p}(\alpha|\textbf{x}) = \frac{\hat{\rho}(\textbf{x}|\alpha) \hat{p}(\alpha)}{\hat{\rho}(\textbf{x})} = \frac{\hat{\rho}(\textbf{x}|\alpha) \hat{p}(\alpha)}{\sum_{\alpha’=1}^C\hat{\rho}(\textbf{x}|\alpha’) \hat{p}(\alpha’) }. \qquad (3)
\label{eq:palphabayes}
\end{equation}

The quantity \(\hat{p}(\alpha)\) is the data-based estimate for the probability of being in class \(\alpha\). Given how we have defined our data set, we have

\begin{equation}
\hat{p}(\alpha) = \frac{1}{N} \sum_{i=1}^{N} y_{i, \alpha}. \qquad (4)
\label{eq:palpha_def}
\end{equation}

The quantity \(\hat{\rho}(\textbf{x}|\alpha)\) is the data-based estimate for the probability density at point \(\textbf{x}\) given that we are in class \(\alpha\). Using the explicit definition of the probability density of samples, we have

\begin{equation}
\hat{\rho}(\textbf{x}|\alpha) = \frac{1}{\sum_{i=1}^N y_{i, \alpha}} \sum_{j=1}^N \delta(\textbf{x}- \textbf{x}_j)\, \delta_{1, y_{j, \alpha}} \qquad (5)
\end{equation}

or defining \({\cal S}_{\alpha}\) as the set of data points \(i\) in class \(\alpha\)

\begin{equation} \hat{\rho}(\textbf{x}|\alpha) = \frac{1}{|{\cal S}_{\alpha}|} \sum_{j\in {\cal S}_{\alpha}} \delta(\textbf{x}- \textbf{x}_j), \qquad (6)
\label{eq:pxalpha}
\end{equation}

where \(\delta(X)\) is the Dirac delta function. Eq.(6) is a theoretical definition of the data-based estimate for the theoretical quantity \(\rho(\textbf{x}|\alpha)\). In practical circumstances to compute Eq.(6), we use what is known as kernel density estimation. This involves replacing the Dirac delta function with another function \(K(x)\) with a given width \(h\). We then have

\begin{equation}
\hat{\rho}(\textbf{x}|\alpha) = \frac{1}{|{\cal S}_{\alpha}|} \sum_{j\in {\cal S}_{\alpha}} \delta(\textbf{x}- \textbf{x}_j), \quad \to \quad \frac{1}{|{\cal S}_{\alpha}|h^{M}} \sum_{j\in {\cal S}_{\alpha}} K\left(\frac{\textbf{x}- \textbf{x}_j}{h}\right) \equiv \hat{\rho}_{\text{KDE}}(\textbf{x}|\alpha). \qquad (7)
\label{eq:pxalpha2}
\end{equation}

where the “\(\to\)” signifies that the right-hand-side is a numerical approximation of the left-hand-side. Typically \(K(X)\) is the gaussian function \(e^{-X^2/2}/\sqrt{2\pi}\). For our purposes, we will replace \(\hat{\rho}(\textbf{x}|\alpha)\) with \(\hat{\rho}_{\text{KDE}}(\textbf{x}|\alpha)\) (thus, in effect, approximating the estimate of a theoretical quantity) and use the kernel density estimate in our Bayes’ formulas. In particular, Eq.(3) becomes

\begin{equation}
\hat{p}(\alpha|\textbf{x}) \to \frac{\hat{\rho}_{\text{KDE}}(\textbf{x}|\alpha) \hat{p}(\alpha)}{\sum_{\alpha’=1}^C\hat{\rho}_{\text{KDE}}(\textbf{x}|\alpha’) \hat{p}(\alpha’) } \qquad (8)
\label{eq:pKDEbayes}
\end{equation}

With Eq.(8) we have the final result from Bayes’ Theorem: an estimate of class probability given a feature vector. This estimate is dependent on our particular choice of a kernel function and a width, but given these model choices, we can estimate how much we expect to learn from our data.

Expected Loss and Learning Potential

With Eq.(8), we have a class probability that we can now use in the loss estimate Eq.(2). Making the substitution, we find

\begin{equation}
{\cal L} = -\sum_{\alpha=1}^C \hat{p}(\alpha)\ln \hat{p}(\alpha) \, – \frac{1}{N} \sum_{\alpha=1}^C \sum_{i=1}^N y_{i, \alpha} \ln \left[\frac{\hat{\rho}_{\text{KDE}}(\textbf{x}_i|\alpha)}{\sum_{\alpha’=1}^C\hat{\rho}_{\text{KDE}}(\textbf{x}_i|\alpha’) \hat{p}(\alpha’) }\right].
\label{eq:cross_ent1} \qquad (9)
\end{equation}

The first term in Eq.(9) represents the loss we would expect from just predicting class probabilities from a frequency count of the classes in the data; we denote this term \({\cal L}_0\). If our model has managed to learn anything about class assignment from the data, then we further expect the loss to be lower than \({\cal L}_0\). In other words, we expect the second term in Eq.(9) to be negative.

With Eq.(9), we can define how much we expect to learn from from a given data set. We define the learning quotient \({\cal Q}\) as

\begin{equation}
{\cal Q} \equiv \frac{{\cal L}_0-{\cal L}}{{\cal L}_0} = \frac{1}{N{\cal L}_0} \sum_{\alpha=1}^C \sum_{i=1}^N y_{i, \alpha} \ln \left[\frac{\hat{\rho}_{\text{KDE}}(\textbf{x}_i|\alpha)}{\sum_{\alpha’=1}^C\hat{\rho}_{\text{KDE}}(\textbf{x}_i|\alpha’) \hat{p}(\alpha’) }\right] \qquad (10)
\label{eq:learn_quotient}
\end{equation}

representing the fractional decrease in loss we expect from the naive class-frequency-based estimate of loss. Given that \({\cal L}\geq 0\) and \({\cal L} \leq {\cal L}_0\), we find that \({\cal Q} \in [0, 1]\). The quantity \({\cal Q}\) answers the following question.

Definition of Learning Quotient \( {\cal Q} \): The learning quotient is the fraction of total information we can extract from the data set \({\cal D}= \{\{\textbf{x}_i, y_i\}; i=1, \ldots, N\}\) by representing the classes \(\{\alpha\}\) in terms of features \(\textbf{x}\).

This information-based definition comes from the form or \({\cal L}_{0}\). Given the identification \({\cal L}_{0} = -\sum_{\alpha=1}^C \hat{p}(\alpha)\ln \hat{p}(\alpha)\), we can interpret \({\cal L}_0\) as the amount of information it takes to specify the class of a data element in this data set. On the other hand, \({\cal L}\) is the amount of information this specification requires when we represent the data element as \(\textbf{x}\). If we find that \({\cal L} \ll {\cal L}_0\), then we can infer that the feature representation \(\textbf{x}\) almost completely determines the class \(\alpha\) and thus there is a lot we can learn about the classes from the features. But if \({\cal L} \simeq {\cal L}_0\), then we can infer the features do not add much beyond the observed class frequencies themselves.

This discussion of loss and learning potential is true regardless of whatever framework we use to compute \(\hat{p}(\alpha|\textbf{x})\). What is special here is that we are using Bayes’ Theorem and Kernel Density Estimation² to perform this estimation rather than a particular ML model.

Code Representation

Above we defined the learning quotient and the estimated loss through a kernel density estimate of class probabilites. We want to apply these results to our prevous churn analysis to understand whether, given a distribution of activity scores for churned and renewed customers, we should expect to learn anything about class probabilities from these activity scores.

To apply our formalism, we need to translate some of our previous results into code. First, we define a class to compute the Kernel Density Estimate of a sample of points. A simple function can be written for the one-dimensional case

def kde_from_scratch(x, data_points, width):
    '''
    Computes value of KDE function at point x given sampled data and a width.

    Args:
        x (float): point at which KDE is evaluated
        data_points (numpy.array): data points
        width (float): Width for estimate

    Returns:
        rho_res (float): value of KDE at point
    '''
    rho_res = 0
    for x_ in xvals:
        u = (x-x_)/width
        rho_res+=np.exp(-u**2)/np.sqrt(2*np.pi)
        
    rho_res = rho_res*(1/width)*(1/len(data_points))
        
    return rho_res

For the multi-dimensional case, we have to do some additional data transformations to ensure that the independent variable \(\textbf{x}\) has dimensions consistent with the sampled data points. So we write a larger class to make these assurances (there is of course the numpy object gaussian_kde to do this, but defining the class from scratch is a nice exercise in method writing.)

class KDE:
    
    def __init__(self, data_points, width):
        
        """
        Kernel Density Estimation (KDE) class for estimating probability density functions.

        Args:
            data_points (numpy.ndarray): An array of data points for which to estimate the density.
            width (float): The width of the kernel function. If not provided, it is determined using
                          Scott's factor for multi-dimensional data.

        Attributes:
            data_points (numpy.ndarray): The input data points for density estimation.
            num_points (int): The number of data points.
            dim (int): The dimensionality of the data points.
            width (float): The width of the kernel function.

        Methods:
            __call__(x):
                Estimate the probability density at the given point(s).
        """

        if len(data_points.shape)==1:
            self.data_points = data_points.reshape(-1,1)
        else:
            self.data_points = data_points
        self.num_points, self.dim = self.data_points.shape
        
        # scott's factor; https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html
        if width is None: 
            width = (self.num_points)**(-1/(self.dim+4))

        self.width = width        
    
    def __call__(self, x):
        
        """
        Estimate the probability density at the given point(s).

        Args:
            x (float, int, list, numpy.ndarray): The point(s) at which to estimate the density.

        Returns:
            float: The estimated probability density at the given point(s).
        """

        # converting float to matrix
        if isinstance(x, float) or isinstance(x, int):
            x = np.array([[x]])
            assert self.dim == 1

        # converting list to matrix     
        elif isinstance(x, list):
            x = np.array([x])
            assert self.dim == 1
        
        # converting one dimensional array to matrix
        if len(x.shape)==1:
            x = x.reshape(1,-1)
                    
        # ensuring same shape between input point and data values
        assert x.shape[1] == self.data_points[:,:1].shape[1]
        
        # computing density estimate
        rho_res = np.sum(np.exp(-(x-self.data_points)**2/self.width**2))/(np.sqrt(2*np.pi*self.width**2)**self.dim)/self.num_points
        
        return rho_res

The important section in the __call__ method above is the line

# computing density estimate
rho_res = np.sum(np.exp(-(x-self.data_points)**2/self.width**2))/(np.sqrt(2*np.pi*self.width**2)**self.dim)/self.num_points

which goes from data points and an \(x\) value to the KDE distribution (defined by the data points) at that \(x\) value. The other lines are just to ensure that \(x\) and the data points have the proper form for computation.

Next, we would like to compute the bayesian estimate of the learning potential \({\cal Q}\) and the related quantities \({\cal L}\) (expected loss) and \({\cal L}_0\) (base loss). We write the function in full:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

def bayesian_loss_estimate_KDE(df=None, class_label=None,  X=None, y=None, width = None):
    
    """
    Computes the base_loss, data-inferred cross-entropy, learning potential based on probabilistic analysis.

    Args:
        df (pandas.DataFrame): The input DataFrame containing features and class labels.
        class_label (str): The name of the column in the DataFrame containing class labels.
        X (numpy.array): The feature vectors for the dataset.
        y (numpy.array): The class labels for the dataset.
        width (float): The width used for kernel density estimation.

    Returns:
        tuple: A tuple containing the following values:
            - L_0 (float): Base cross-entropy computed from class label distribution.
            - L (float): Data-inferred cross-entropy loss based on feature distributions.
            - Q (float): Learning potential, indicating the potential improvement in loss.

    This function performs probabilistic analysis to compute the minimum loss and learning potential
    based on Gaussian KDE (Kernel Density Estimation) of sampled features.

    """    
    
    # data can be input as a dataframe 
    if (df is not None) and (class_label is not None) and (X is None and y is None):   
        # feature and class definitions and one-hot encoding
        X = df.drop(columns=[class_label]).values
        y = list(df[class_label])
        y_matrix = np.array(pd.get_dummies(df[class_label]))  

    # or data can be input as X and y arrays   
    elif (X is not None) and (y is not None):
        y = y.reshape(-1, 1) # reshape the array to a column vector (required by OneHotEncoder)
        one_hot_encoded = OneHotEncoder().fit_transform(y) # fit and transform the data to one-hot encoding
        y_matrix = one_hot_encoded.toarray() # convert the one-hot encoded sparse matrix to an array

    else:
        print('Missing arguments! Must provide data frame AND class label OR X AND y values')
        return

    # getting base probabilities 
    p_alpha = np.sum(y_matrix, axis = 0)/len(y_matrix)

    # getting samples based on classes
    num_classes = y_matrix.shape[1]
    sampled_features_dict = dict()
    for j in range(num_classes):
        sampled_features_dict[j] = np.array([X[k,:].tolist() for k in range(len(y_matrix)) if y_matrix[k,j] == True])

    # gaussian kde of sampled features
    rho_kde_dict = dict()
    for j in range(num_classes):
        rho_kde_dict[j] = KDE(data_points=sampled_features_dict[j], width=width)  

    # defining base feature probability
    prob_x = lambda x: np.sum([p_alpha[j]*rho_kde_dict[j](x) for j in range(num_classes)])  

    # getting matrix of probability ratios 
    prob_ratios = np.array([[rho_kde_dict[j](X[i])/prob_x(X[i]) for j in range(num_classes)] for i in range(len(y_matrix))])

    # eliminate possibility of divide by zero errors
    prob_ratios = prob_ratios + 1e-10
    
    # base cross entropy
    L_0 = -np.sum(p_alpha*np.log(p_alpha))

    # computing data inferred loss
    L = L_0 -np.sum((y_matrix*np.log(prob_ratios)))/len(y_matrix)

    # computing learning potential
    Q = (L_0-L)/L_0

    return L_0, L, Q

Now, we can start applying these functions to some data.

Return to Churn

With our coded functions to compute the learning potential given a dataset, we can now return to the question of predicting churn, or rather the question of whether it is even possible to predict churn from activity scores.

We return to the two example plots

How much could we expect to learn about churn status from activity score distributions on the left versus those on the right?

The distribution on the left came from sampling \(N_C = 400\) points from a gaussian with mean \(\mu_C = 70\) and standard deviation \(\sigma_C = 10\) (representing churned customers), and also \(N_R = 600\) points from a gaussian with mean \(\mu_R = 75\) and standard deviation \(\sigma_R = 15\) (representing renewed customers). For the distribution on the right, we did similar sampling with \( (\mu_C, \sigma_C, \mu_R, \sigma_R) = (30, 10, 80, 15)\).

For the left plot, we find

showing a learning potential of \(Q \approx 0.093\).

For the right plot, we find

showing a learning potential of \({\cal Q} \approx 0.92\). Thus we see that for the left plot, knowing the activity score only provides us with a \(9\%\) reduction in the information needed to determine the classes, whereas for the second plot, the activity score is associated with a \(92\%\) reduction.

This is the quantitative representation we were looking for! Consistent with our intuition, there is much less to learn from the overlapping distributions than for separated distributions.

Returning to our problem setup, we recall that we wanted to use historical data to determine how much we expected to learn about labels from features.

Now, we have a procedure to answer this question. To determine how much we can learn about the class labels from the features, we can

Compute the learning potential \({\cal Q}\) given the distributions in feature space for various classes and thereby determine via a Bayesian argument how much we expect these features to teach us.

Further Applications

Above we explored a relatively simple application of the Bayesian learning potential:

[Application #1] Intuition: The learning quotient Eq.(10) can tell us how much information about class distinctions is given in a feature representation of a data set

But there are two other applications we did not consider here but are still quite useful for machine learning problems

[Application #2] Overtraining: Computing the Bayesian expected loss Eq.(9) and the loss of a trained model can provide a signal for when a model has been overtrained. If the model loss is much less than the Bayesian expected loss, the model has likely been over-trained.

[Application #3] Feature Importances: Comparing learning quotients Eq.(10) for various combinations and selections of features can tell us which combinations/selections are most important in making class distinctions

These other applications are explored in the notebook [NOTEBOOK LINK].

Footnotes

As in typical statistical notation, we will use “\(\hat{x}\)” to denote data-based estimates of theoretical quantities \(x.\) For example, the data-based estimate of the theoretical probability \(p(\alpha)\) (i.e., probability of having class \(\alpha\) ) is \(\hat{p}(\alpha).\)} ↩︎
We are glossing over the fact that KDE is doing a lot here. We have to choose a function and a width. Typically a gaussian function is used, but the width is not as easy to select. scipy uses “Scott’s Rule” as a default value which amounts to setting \(h = 1/N^{1/(M+4)}\). However, sometimes it’s better to do a grid search through values of \(h\) and select the siginficant inflection point in the curve as \(h\). ↩︎