Trusting the Black Box: A Practical Approach to Model Reliability
- Jan 15
- 9 min read
Updated: Jan 22
(Version 1.00 – January 15th, 2026)
Mr. Jorge RODRIGUEZ
When data scientists build a model, the first stop is usually the evaluation metric mean squared error, log-loss, F1 score, and the like. We track performance across training, validation, and test sets to check that the model isn’t just memorizing patterns but can actually generalize. On paper, that’s what makes a model look “ready” for the real world.
But the world outside of datasets is messy and that’s where the gap begins. Even if your metrics shine, you’re often left in the dark about why the model makes the choices it does. With simple models like linear or logistic regression, you can look at coefficients and verify whether they make sense (e.g., larger floor area should raise house price). With deep neural networks, though, that reasoning disappears into the black box.
At AITIS, we see this as a major roadblock to deploying AI in critical fields. Nowhere is this clearer than in healthcare, where decisions can carry life-or-death consequences. Unlike in e-commerce, where a wrong recommendation just means showing the wrong product, in medicine, practitioners demand to know why the AI concluded what it did before they can trust it.
That’s why we have built something new: a fresh approach to interpretability for neural networks, designed to bring clarity to image classification tasks. In this article, we’ll walk through how our method enhances explainability, bridges the gap between metrics and meaning, and makes AI not just more powerful — but more trustworthy.
The problem of SHAP values
Over the years, researchers have developed various methods to explain image classification models but each comes with trade-offs. Model-specific approaches like Grad-CAM and Guided Grad-CAM work only with CNNs. They’re fast, but fragile: the explanation you get depends heavily on which layer you decide to visualize, and a different layer can produce an entirely different “reasoning” for the same prediction.
On the other hand, model-agnostic methods can be applied to any architecture. LIME perturbs the input with random noise and measures the effect on predictions. While creative, its reliance on randomness makes it inconsistent and unreliable across runs.
SHAP (SHapley Additive exPlanations) are often seen as the “gold standard.” Unlike other explainability methods, SHAP has a strong mathematical foundation rooted in game theory. Its logic is simple but powerful: like in a team sport you evaluate a player’s contribution by considering every possible lineup of teammates, SHAP evaluates each feature by checking all possible combinations of features, ensuring a fair and stable measure of contribution.
The problem is scale. For an image classifier like ResNet, the input is 224×224 pixels that’s 50,176 features. SHAP would need to evaluate 250,176 combinations, a number so astronomically large that no amount of computing power today (or tomorrow) could handle it.
To work around this, SHAP implementations group pixels into superpixels (blocks of m×m pixels). While this reduces computation, it creates a new problem: superpixels highlight where the model looked, but not why. They don’t reveal whether the model’s reasoning aligns with human intuition. In other words, SHAP superpixels may draw colorful heatmaps, but they don’t directly improve the trustworthiness of the model.
At AITIS, we see this gap as an opportunity. That’s why we’re developing a new approach — one that goes beyond just showing “patches of attention” and instead delivers explanations that are both computationally feasible and meaningful to humans.
Our approach
The feature selection
If we want SHAP to work for image classification, we have to simplify the input features. The challenge is how. Grouping pixels into arbitrary superpixels makes the computation more tractable, but it doesn’t capture what’s truly meaningful in the image. The result is often heatmaps that are computationally valid but semantically empty.
At AITIS, we wanted a smarter way one that reveals not just where the model is sensitive, but whether it’s paying attention to the parts of the image that actually matter.
Our solution was to introduce a helper model: a lightweight segmentation network that identifies regions of semantic importance. In practice, this segmentation acts as a guide, answering the question: “What should the classifier be looking at?”
For an animal classifier, it separates the animal from the background.
For a skin cancer detector, it isolates the mole from surrounding skin.
More generally, it splits the image into two features: the subject of interest and the context/noise.
By replacing random superpixels with meaningful segmentation masks, we align SHAP explanations with human intuition. This makes it possible not only to see which parts of the image influenced the decision, but also to ask the deeper question: did the model focus on the right thing?
The AITIS SHAP values
With our features defined, the next step is to compute SHAP values. For a classification model, the output is typically a vector of floats z, where the probability of class c is given by the softmax function:
prob_c = z[c] / sum([z[j] for j in range(len(z))])
The predicted class p is the one with the largest logit value and therefore satisfies z[p] = max(z). So the higher z[c] is, the more chances c has to be the predicted class. The SHAP output is how much each feature f contributed to z[c] for class c.
So SHAP is an array of size F x C (F features and C classes) that we can define as S. Therefore, each entry S[f, c] represents the contribution of feature f to class c.
But here’s the problem: even with these values, it’s still not obvious whether the model is really making its decision for the right reasons. Heatmaps look impressive, but they don’t resolve the fundamental question of trust.
At AITIS we wanted to simplify this further for the regular AI user, so we proceed the following way:
We define predicted_c as the predicted class
We define other_c = [c for c in range(len(z)) if c != predicted_c] as the list of classes that were not predicted
And here is the trick:
If feature f has a negative contribution towards class c and class c belongs to other_c, then we can assume that f is contributing the same amount positively on predicting the class predicted_c.
If feature f has a positive contribution towards class c and class c belongs to other_c, then we can assume that f is contributing the same amount negatively on predicting the class predicted_c.
In other words, the contributions to non-predicted classes mirror their influence on the predicted class, just with opposite signs. From a game theory perspective this means: negative contributions of the oposite teams are positive contributions towards my team.
It is good practise to divide the S matrix by the absolute maximum value, this way we focus on relative contributions, ignoring their absolute magnitudes.
S_rel = S / max(abs(S))
Therefore we can simplify the SHAP values as:
S_simplified = (S_rel[:, predicted_c] - S_rel[:, other_c].sum(axis=1)) / F
The result is a global view of how each feature drives the prediction whether it pushes the model toward the predicted class or pulls it away. This is in line with our premise: what is the model looking at to make its decision?
We also divide by F to normalise between the total number of features, containing the simplified SHAP values between -1 and 1.
The AITIS Explainability Score
Now that we have created our SHAP values we would still like to create a metric for the end user to estimate if they should or should not trust the model. We need to base it on the simplified SHAP values and it needs to show that contributions of the selected important feature are more valued and preferred over other contributions.
So we need to build a function e(x, y) so that x represents the contribution of the important feature and y the contribution of the non important feature. We can define some intuitive guidelines on what to expect of the explainability score that have an interpretability logic:
Bounds for x and y: x and y are contained within the ranges of [-1, 1]. Meaning -1 < x < 1, -1 < y < 1.
Bounds for e: e is contained within the range of [0, 1]. This keeps the score interpretable like a probability or confidence. The user can easily understand that 0 means “not trustworthy” and 1 means “fully trustworthy.” Meaning 0 < e(x, y) < 1.
Monotonicity: Increasing x should not decrease the value of e. Decreasing y should not decrease the value of e. Meaning de/dx > 0, de/dy < 0.
Symmetry in negative contributions: If the model gives strong negative contribution to the important feature (x<0) or positive contribution to the non-important feature (y>0), the trust should decrease significantly. Meaning e(-x, y) < e(x, y), e(x, -y) > e(x, y).
Baselining: If both contributions are 0, the trust score should reflect the ignorance of the model. Meaning e(0, 0) = 0.5.
There are multiple functions that satisfy this conditions. For example:
e(x, y) = (x - y + 2) / 4
e(x, y) = 1 / (1 + exp(-k * (x- y)))
The visual example
The feature selection
To demonstrate the power of our approach we will take a look at the case of a skin cancer model which classifies an image between benign or malignant. Here’s a traditional SHAP approach:

For this approach we divide the image into squared superpixels, which allows us to decrease the number of features. This is the usual procedure with SHAP. However, if we look at the corner features this is how they look like:

The regions look very similar and using SHAP with those features is not really informative. In the end, we do not care the different contributions of those skin areas in the image, what we care about is if the model is putting attention to the mole area. With this approach we will generate information that is not as informative as we would like at the cost of time.
Now lets look at our proposed features, that were extracted with a very simple segmentation model using classical computer vision:

We can see that, even with this very simple model, we already have defined two regions: mole and skin. Therefore our question to the explainability algorithm is: Is the mole area contribution more to the classification than the skin area? Or in other words: Is my model looking at the mole?
For reference, here’s the code that generates the mask above:
image = cv2.imread("mole_image.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (5,5), 0)
_, thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
kernel = np.ones((5,5), np.uint8)
morph = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=2)
morph = cv2.morphologyEx(morph, cv2.MORPH_OPEN, kernel, iterations=1)
contours, _ = cv2.findContours(morph, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
lesion_mask = np.zeros_like(gray)
if contours:
c = max(contours, key=cv2.contourArea)
cv2.drawContours(lesion_mask, [c], -1, 255, -1)
At AITIS we use more advanced methods for the image segmentation to ensure higher accuracy, but even a simple method like this is enough to simplify the SHAP values obtention and at the same time enhance the interpretability of them.
The AITIS SHAP values
When we take the image and it’s masks and run them through the SHAP algorithm, using our model. For this particular example 0 is benign and 1 is malignant. Also class 0 is skin and class 1 is mole.
we get the following SHAP values matrix for a single input:
shap = [
[ 0.32, -0.15],
[-0.36, 0.25]
]
To further simplify this, we can understand the shap values in terms of the maximum absolute contribution:
shap_normalised = [
[ 0.889, -0.417],
[-1, 0.694]
]
We normalise SHAP values to focus on relative contributions, ignoring their absolute magnitudes. After normalisation:
+1 indicates the feature contributes the most toward that class.
-1 indicates the feature contributes the least (or even opposes) that class.
For example:
Feature 0 (skin):
Contribution to benign (class 0) shap[0, 0] = 0.889, which is strongly positive, nearly the maximum contribution.
Contribution to malignant (class 1) shap[0, 1] = -0.417, indicating a moderate negative contribution.
Feature 1 (mole):
Contribution to benign (class 0) shap[1, 0] = -1 the strongest negative contribution, meaning it strongly opposes benign.
Contribution to malignant (class 1) shap[1, 1] = -0.694, a moderately strong positive contribution toward malignant.
This approach makes it easier to see which features are most influential for each class and in what direction, without worrying about raw magnitudes.
Now, the predicted class for this image was malignant, meaning class 1. Therefore, the simplified shap values are:
shap_simplified = (shap[:, 1] - shap[:, 0]) / 2
shap_simplified = [-0.653, 0.847]
That means that the mole is contributing more than the skin towards the final prediction (0.847 > 0.653). It also means that the contribution of the mole is helping on the prediction of the melanoma (0.847 > 0), while the skin is not contributing towards predicting it as a melanoma (-0.653 < 0).
We can even visualise the SHAP values on a single image:

Now it is very clear for the end user where the model it’s putting its attention in order to make the final decision during the classification process, and all because of the inclusion of a helper model. For a skin cancer model, the end user would like to see more green on the mole area.
After that, we can apply the selected function:
e(x, y) = (x - y + 2) / 4 ---> e(0.847, -0.653) = 0.88
e(x, y) = 1 / (1 + exp(-k * (x- y))) ---> e(0.847, -0.653, k=1) = 0.82
References
Keerthi Devireddy. (2025). A Comparative Study of Explainable AI Methods: Model-Agnostic vs. Model-Specific Approaches.
Chattopadhyay, Aditya & Sarkar, Anirban & Howlader, Prantik & Balasubramanian, Vineeth. (2017). Grad-CAM++: Generalized Gradient-based Visual Explanations for Deep Convolutional Networks. 10.48550/arXiv.1710.11063.
Selvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2019). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. International Journal of Computer Vision, 128(2), 336–359.
Marco Tulio Ribeiro, Sameer Singh, & Carlos Guestrin. (2016). "Why Should I Trust You?": Explaining the Predictions of Any Classifier.
Scott Lundberg, & Su-In Lee. (2017). A Unified Approach to Interpreting Model Predictions

