When I’m not diving into data science stuff, I enjoy chilling out with K-dramas. Lately, I’ve been hooked on “The First Responders,” especially I love how they talk about medical stuff.

But here’s the twist: even though I’m all about data, I’ve come up with this simple test in my spare time. It’s super cheap and easy, meant for newborn babies. And get this — it can predict whether they might get Pulmonary Arterial Hypertension (PAH) with more than 98% accuracy. Crazy, right?

Now, here’s the kicker: my test is based on something kind of silly — the baby’s name. Yep, you heard that right. According to my idea, if a baby is named “Paul” or “Ari,” they might have a higher chance of getting PAH. Sounds ridiculous, I know, but bear with me.

I even ran it by a family member who’s a doctor. They said it’s not something you should even tell out, which was a bummer. So, I thought I’d share it here with you guys.

But here’s the thing about my test: it’s accurate, sure, but it’s also kind of dumb. And that’s a lesson in itself. See, when you’re making a model to predict something, like whether an email is spam or not, there are four possibilities:

True positive: “This message is spam, and we correctly predicted spam.”False positive (Type 1 Error): “This message is not spam, but we predicted spam.”False negative (Type 2 Error): “This message is spam, but we predicted not spam.”True negative: “This message is not spam, and we correctly predicted not spam.”

We often represent these as counts in a confusion matrix:

My test might seem cool with its high accuracy, but it’s a reminder that accuracy isn’t everything. There’s more to it, and we need to think about the bigger picture when we’re dealing with data.

Now, let’s fit my Pulmonary Arterial Hypertension test into this framework. So, these days, around 5 out of every 1,000 babies are named Paul or Ari. About 1.4% of people will deal with Pulmonary Arterial Hypertension in their lifetime, which means roughly 14 out of every 1,000 folks.

If we assume these factors are independent and throw my “Paul or Ari equals Pulmonary Arterial Hypertension” test into the mix for a million people, here’s what our confusion matrix might look like:

Now, let’s crunch some numbers to gauge how well our model performs. Accuracy is a good starting point. It’s just the proportion of correct predictions out of all predictions made. Here’s how we calculate it:

def accuracy(tp, fp, fn, tn):
correct = tp + tn
total = tp + fp + fn + tn
return correct / total
print accuracy(70, 4930, 13930, 981070) # 0.98114

At first glance, that accuracy figure looks pretty impressive. But hold your horses; this test isn’t all it’s cracked up to be. We can’t just rely on raw accuracy alone. It’s time to delve deeper.

Enter precision. This metric tells us how accurate our positive predictions were:

def precision(tp, fp, fn, tn):
return tp / (tp + fp)
print precision(70, 4930, 13930, 981070) # 0.014

And then there’s recall, which sheds light on what fraction of the actual positives our model identified:

def recall(tp, fp, fn, tn):
return tp / (tp + fn)
print recall(70, 4930, 13930, 981070)

These numbers paint a grim picture. Both precision and recall are dismal, reflecting the sorry state of our model. Often, we combine precision and recall into the F1 score, a metric that strikes a balance between the two:

def f1_score(tp, fp, fn, tn):
p = precision(tp, fp, fn, tn)
r = recall(tp, fp, fn, tn)
return 2 * p * r / (p + r)

This score, the harmonic mean of precision and recall, usually lies somewhere between the two. And typically, in choosing a model, there’s a trade-off between precision and recall. A model that shouts “yes” at the slightest hint will likely boast high recall but poor precision. Conversely, a model that screams “yes” only when it’s absolutely certain may have low recall but high precision.

Alternatively, let’s think of this as a trade-off between false positives and false negatives. Saying “yes” too freely leads to loads of false positives, while playing it safe with “no” results in heaps of false negatives.

Now, picture a scenario where there are 10 risk factors for Pulmonary Arterial Hypertension. The more of these factors you possess, the likelier you are to develop the condition. In such a scenario, you can visualize a spectrum of tests: “predict Pulmonary Arterial Hypertension if at least one risk factor,” “predict Pulmonary Arterial Hypertension if at least two risk factors,” and so forth.

As you crank up the threshold, you beef up the test’s precision. After all, folks with more risk factors are more prone to the disease. However, this uptick in precision comes at a cost: the test’s recall takes a hit. With each notch up the threshold ladder, fewer of the eventual disease sufferers meet the criteria.

In cases like these, finding the Goldilocks threshold — a sweet spot where precision and recall strike a harmonious balance — is key. It’s a delicate dance, navigating the trade-off to ensure our test isn’t too trigger-happy or overly cautious.

The Bias-Variance Trade-off

Another lens through which to view the overfitting dilemma is the trade-off between bias and variance. These concepts gauge what would happen if we trained our model multiple times on different sets of training data, all drawn from the same larger population.

In an ideal scenario, any two randomly selected training sets should yield fairly similar models. Why? Well, because these sets should exhibit similar average values. Hence, we label this scenario as having low variance.

Now, let’s delve into what happens when bias and variance pull the strings. High bias and low variance usually signal underfitting. Picture this: your model just can’t seem to capture the essence of your data, even on its home turf — the training data.

On the flip side, let’s consider a model of degree 9, snugly fitting the training set like a glove. It boasts minimal bias, as it flawlessly moulds itself to the data. However, its Achilles’ heel lies in its sky-high variance. Swap out the training set, and you’re likely to encounter a drastically different model. This, my friend, is the epitome of overfitting.

This perspective on model woes offers valuable insights into troubleshooting strategies. If your model suffers from high bias (translating to lackluster performance even on familiar ground), consider beefing up its feature set. On the other hand, if variance runs rampant, you might want to trim down the features. But don’t forget — a game-changer could be obtaining more data, if that’s an option.

Here’s an example: let’s fit a degree 9 polynomial to different-sized samples. When we train the model on just 10 data points, it’s all over the map, just as we’ve seen. But, if we up the ante to 100 data points, the overfitting significantly decreases. And interestingly, when we train on a hefty 1,000 data points, the model starts resembling the simplicity of a degree 1 model.

Keeping the model complexity constant, the more data you throw into the mix, the trickier it becomes to overfit. However, more data won’t magically erase bias. If your model isn’t equipped with enough features to grasp the underlying patterns in the data, no amount of data will come to the rescue.

Conclusion

In conclusion, while my silly test for predicting PAH in newborns may not be practical, it does highlight the importance of considering multiple factors when evaluating the effectiveness of a model. As data scientists, we must not only focus on accuracy but also take into account other factors such as precision, recall, and F1 score. I hope you enjoyed reading about my love for K-dramas and my wacky prediction model. If you have any thoughts or questions, feel free to leave a comment. Thanks again for stopping by, and I look forward to sharing more insights and stories with you in the future. Check out my latest blog on PCA in Layman’s terms with Math and Code.

PCA in Layman’s terms with Math and Code

✍🏽 Check out my profile for more content like this.

🥇Sign up for my email newsletter to receive updates on new posts!

Have fun reading!

How Good Your AI Is: Accuracy Is Not Everything was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

​ Level Up Coding – Medium

about Infinite Loop Digital

We support businesses by identifying requirements and helping clients integrate AI seamlessly into their operations.

Gartner
Gartner Digital Workplace Summit Generative Al

GenAI sessions:

  • 4 Use Cases for Generative AI and ChatGPT in the Digital Workplace
  • How the Power of Generative AI Will Transform Knowledge Management
  • The Perils and Promises of Microsoft 365 Copilot
  • How to Be the Generative AI Champion Your CIO and Organization Need
  • How to Shift Organizational Culture Today to Embrace Generative AI Tomorrow
  • Mitigate the Risks of Generative AI by Enhancing Your Information Governance
  • Cultivate Essential Skills for Collaborating With Artificial Intelligence
  • Ask the Expert: Microsoft 365 Copilot
  • Generative AI Across Digital Workplace Markets
10 – 11 June 2024

London, U.K.