True Error vs Sample Error

In machine learning and statistics, understanding the concepts of true error and sample error is crucial for evaluating the performance of models. These errors help us assess how well our models generalize from training data to unseen data. Let’s delve into these concepts and see how they differ.

True Error

True error refers to the probability that a hypothesis will misclassify a single randomly drawn sample from the entire population. The population, in this context, includes all potential data points that the model might encounter.

For a given hypothesis ( h(x) ) and the actual target function ( f(x) ), the true error can be expressed as:

T.E. = P[f(x) ≠ h(x)]

This indicates the likelihood that the model’s predictions do not match the true values.

Sample Error

Sample error, on the other hand, measures the proportion of misclassified examples within a specific sample. It is calculated based on the data points that were used to train or test the model. The formula for sample error is:

S.E. = Number of misclassified instances / Total number of instances

Alternatively, it can also be expressed in terms of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN):

S.E. = (FP + FN) / (TP + FP + FN + TN)

Or simply:

S.E. = 1 - Accuracy

For example, if a hypothesis misclassifies 7 out of 33 examples, the sample error would be:

S.E. = 7 / 33 = 0.21

Bias & Variance

Bias

Bias measures the difference between the average prediction of a model and the actual value. High bias typically indicates that a model is too simplistic and is likely to underfit the data.

Bias = E[h(x)] - f(x)

Variance

Variance assesses how much the model's predictions vary for different training sets. A high-variance model is overly complex and can lead to overfitting.

Var(X) = E[(X - E[X])²]

Confidence Interval

Calculating true error directly can be complex and challenging. Instead, it can be estimated using a confidence interval, which is derived from the sample error. The process involves:

Randomly drawing ( n ) samples from the population (where ( n > 30 )).
Calculating the sample error for these samples.

The formula for estimating the true error based on the sample error is:

T.E. = S.E. ± z_s * √(S.E.(1 - S.E.) / n)

Where ( z_s ) is the z-score corresponding to the desired confidence level.

Example Code for Confidence Interval Estimation

Here's how you can implement the estimation of true error using a confidence interval in Python:

# Imports
import numpy as np
import scipy.stats as st

# Define sample data
np.random.seed(0)
data = np.random.randint(10, 30, 10000)

alphas = [0.90, 0.95, 0.99, 0.995]
for alpha in alphas:
    print(st.norm.interval(alpha=alpha, loc=np.mean(data), scale=st.sem(data)))

Confidence Interval Output

This code will output confidence intervals for different confidence levels:

90%: (17.87, 19.89)
95%: (17.67, 20.09)
99%: (17.30, 20.46)
99.5%: (17.15, 20.61)