True Error vs Sample Error
In machine learning and statistics, understanding the concepts of true error and sample error is crucial for evaluating the performance of models. These errors help us assess how well our models generalize from training data to unseen data. Let’s delve into these concepts and see how they differ.
True Error
True error refers to the probability that a hypothesis will misclassify a single randomly drawn sample from the entire population. The population, in this context, includes all potential data points that the model might encounter.
For a given hypothesis ( h(x) ) and the actual target function ( f(x) ), the true error can be expressed as:
T.E. = P[f(x) ≠ h(x)]
This indicates the likelihood that the model’s predictions do not match the true values.
Sample Error
Sample error, on the other hand, measures the proportion of misclassified examples within a specific sample. It is calculated based on the data points that were used to train or test the model. The formula for sample error is:
S.E. = Number of misclassified instances / Total number of instances
Alternatively, it can also be expressed in terms of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN):
S.E. = (FP + FN) / (TP + FP + FN + TN)
Or simply:
S.E. = 1 - Accuracy
For example, if a hypothesis misclassifies 7 out of 33 examples, the sample error would be:
S.E. = 7 / 33 = 0.21
Bias & Variance
Bias
Bias measures the difference between the average prediction of a model and the actual value. High bias typically indicates that a model is too simplistic and is likely to underfit the data.
Bias = E[h(x)] - f(x)
Variance
Variance assesses how much the model's predictions vary for different training sets. A high-variance model is overly complex and can lead to overfitting.
Var(X) = E[(X - E[X])²]
Confidence Interval
Calculating true error directly can be complex and challenging. Instead, it can be estimated using a confidence interval, which is derived from the sample error. The process involves:
Randomly drawing ( n ) samples from the population (where ( n > 30 )).
Calculating the sample error for these samples.
The formula for estimating the true error based on the sample error is:
T.E. = S.E. ± z_s * √(S.E.(1 - S.E.) / n)
Where ( z_s ) is the z-score corresponding to the desired confidence level.
Example Code for Confidence Interval Estimation
Here's how you can implement the estimation of true error using a confidence interval in Python:
# Imports
import numpy as np
import scipy.stats as st
# Define sample data
np.random.seed(0)
data = np.random.randint(10, 30, 10000)
alphas = [0.90, 0.95, 0.99, 0.995]
for alpha in alphas:
print(st.norm.interval(alpha=alpha, loc=np.mean(data), scale=st.sem(data)))
Confidence Interval Output
This code will output confidence intervals for different confidence levels:
90%: (17.87, 19.89)
95%: (17.67, 20.09)
99%: (17.30, 20.46)
99.5%: (17.15, 20.61)
True Error vs Sample Error Summary
True Error | Sample Error |
Represents the probability of misclassification in the population. | Represents the fraction of misclassified instances within the sample. |
Used to estimate errors across the entire population. | Used to assess errors within the sample data. |
Difficult to calculate directly; often estimated using confidence intervals. | Easier to calculate by analyzing the sample data. |
Can be influenced by poor data collection methods or bias. | Can be affected by selection errors or non-response errors. |
Understanding true error and sample error is essential for building robust machine learning models. By estimating these errors, you can make informed decisions about model performance and improve the predictive capabilities of your algorithms.
For more content, follow me at — https://linktr.ee/shlokkumar2303