Jonathan Yahav, Author at Towards Data Science

When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI

Jonathan Yahav — Wed, 26 Feb 2025 01:25:29 +0000

Imagine you’re building your dream home. Just about everything is ready. All that’s left to do is pick out a front door. Since the neighborhood has a low crime rate, you decide you want a door with a standard lock — nothing too fancy, but probably enough to deter 99.9% of would-be burglars.

Unfortunately, the local homeowners’ association (HOA) has a rule stating that all front doors in the neighborhood must be bank vault doors. Their reasoning? Bank vault doors are the only doors that have been mathematically proven to be absolutely secure. As far as they’re concerned, any front door below that standard may as well not be there at all.

You’re left with three options, none of which seems particularly appealing:

Concede defeat and have a bank vault door installed. Not only is this expensive and cumbersome, but you’ll be left with a front door that bogs you down every single time you want to open or close it. At least burglars won’t be a problem!
Leave your house doorless. The HOA rule imposes requirements on any front door in the neighborhood, but it doesn’t technically forbid you from not installing a door at all. That would save you a lot of time and money. The downside, of course, is that it would allow anyone to come and go as they please. On top of that, the HOA could always close the loophole, taking you back to square one.
Opt out entirely. Faced with such a stark dilemma (all-in on either security or practicality), you choose not to play the game at all, selling your nearly-complete house and looking for someplace else to live.

This scenario is obviously completely unrealistic. In real life, everybody strives to strike an appropriate balance between security and practicality. This balance is informed by everyone’s own circumstances and risk analysis, but it universally lands somewhere between the two extremes of bank vault door and no door at all.

But what if instead of your dream home, you imagined a medical AI model that has the power to help doctors improve patient outcomes? Highly-sensitive training data points from patients are your valuables. The privacy protection measures you take are the front door you choose to install. Healthcare providers and the scientific community are the HOA.

Suddenly, the scenario is much closer to reality. In this article, we’ll explore why that is. After understanding the problem, we’ll consider a simple but empirically effective solution proposed in the paper Reconciling privacy and accuracy in AI for medical imaging [1]. The authors propose a balanced alternative to the three bad choices laid out above, much like the real-life approach of a typical front door.

The State of Patient Privacy in Medical AI

Over the past few years, artificial intelligence has become an ever more ubiquitous part of our day-to-day lives, proving its utility across a wide range of domains. The rising use of AI models has, however, raised questions and concerns about protecting the privacy of the data used to train them. You may remember the well-known case of ChatGPT, just months after its initial release, exposing proprietary code from Samsung [2].

Some of the privacy risks associated with AI models are obvious. For example, if the training data used for a model isn’t stored securely enough, bad actors could find ways to access it directly. Others are more insidious, such as the risk of reconstruction. As the name implies, in a reconstruction attack, a bad actor attempts to reconstruct a model’s training data without needing to gain direct access to the dataset.

Medical records are one of the most sensitive kinds of personal information there are. Although specific regulation varies by jurisdiction, patient data is generally subject to stringent safeguards, with hefty fines for inadequate protection. Beyond the letter of the law, unintentionally exposing such data could irreparably damage our ability to use specialized AI to empower medical professionals.

As Ziller, Mueller, Stieger, et al. point out [1], fully taking advantage of medical AI requires rich datasets comprising information from actual patients. This information must be obtained with the full consent of the patient. Ethically acquiring medical data for research was challenging enough as it was before the unique challenges posed by AI came into play. But if proprietary code being exposed caused Samsung to ban the use of ChatGPT [2], what would happen if attackers managed to reconstruct MRI scans and identify the patients they belonged to? Even isolated instances of negligent protection against data reconstruction could end up being a monumental setback for medical AI as a whole.

Tying this back into our front door metaphor, the HOA statute calling for bank vault doors starts to make a little bit more sense. When the cost of a single break-in could be so catastrophic for the entire neighborhood, it’s only natural to want to go to any lengths to prevent them.

Differential Privacy (DP) as a Theoretical Bank Vault Door

Before we discuss what an appropriate balance between privacy and practicality might look like in the context of medical AI, we have to turn our attention to the inherent tradeoff between protecting an AI model’s training data and optimizing for quality of performance. This will set the stage for us to develop a basic understanding of Differential Privacy (DP), the theoretical gold standard of privacy protection.

Although academic interest in training data privacy has increased significantly over the past four years, principles on which much of the conversation is based were pointed out by researchers well before the recent LLM boom, and even before OpenAI was founded in 2015. Though it doesn’t deal with reconstruction per se, the 2013 paper Hacking smart machines with smarter ones [3] demonstrates a generalizable attack methodology capable of accurately inferring statistical properties of machine learning classifiers, noting:

“Although ML algorithms are known and publicly released, training sets may not be reasonably ascertainable and, indeed, may be guarded as trade secrets. While much research has been performed about the privacy of the elements of training sets, […] we focus our attention on ML classifiers and on the statistical information that can be unconsciously or maliciously revealed from them. We show that it is possible to infer unexpected but useful information from ML classifiers.” [3]

Theoretical data reconstruction attacks were described even earlier, in a context not directly pertaining to machine learning. The landmark 2003 paper Revealing information while preserving privacy [4] demonstrates a polynomial-time reconstruction algorithm for statistical databases. (Such databases are intended to provide answers to questions about their data in aggregate while keeping individual data points anonymous.) The authors show that to mitigate the risk of reconstruction, a certain amount of noise needs to be introduced into the data. Needless to say, perturbing the original data in this way, while necessary for privacy, has implications for the quality of the responses to queries, i.e., the accuracy of the statistical database.

In explaining the purpose of DP in the first chapter of their book The Algorithmic Foundations of Differential Privacy [5], Cynthia Dwork and Aaron Roth address this tradeoff between privacy and accuracy:

“[T]he Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy in a spectacular way. The goal of algorithmic research on differential privacy is to postpone this inevitability as long as possible. Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population.” [5]

The notion of “learning nothing about an individual while learning useful information about a population” is captured by considering two datasets that differ by a single entry (one that includes the entry and one that doesn’t). An (ε, δ)-differentially private querying mechanism is one for which the probability of a certain output being returned when querying one dataset is at most a multiplicative factor of the probability when querying the other dataset. Denoting the mechanism by M, the set of possible outputs by S, and the datasets by x and y, we formalize this as [5]:

Pr[M(x) ∈ S] ≤ exp(ε) ⋅ Pr[M(y) ∈ S] + δ

Where ε is the privacy loss parameter and δ is the failure probability parameter. ε quantifies how much privacy is lost as a result of a query, while a positive δ allows for privacy to fail altogether for a query at a certain (usually very low) probability. Note that ε is an exponential parameter, meaning that even slightly increasing it can cause privacy to decay significantly.

An important and useful property of DP is composition. Notice that the definition above only applies to cases where we run a single query. The composition property helps us generalize it to cover multiple queries based on the fact that privacy loss and failure probability accumulate predictably when we compose several queries, be they based on the same mechanism or different ones. This accumulation is easily proven to be (at most) linear [5]. What this means is that, rather than considering a privacy loss parameter for one query, we may view ε as a privacy budget that can be utilized across a number of queries. For example, when taken together, one query using a (1, 0)-DP mechanism and two queries using a (0.5, 0)-DP mechanism satisfy (2, 0)-DP.

The value of DP comes from the theoretical privacy guarantees it promises. Setting ε = 1 and δ = 0, for example, we find that the probability of any given output occurring when querying dataset y is at most exp(1) = e ≈ 2.718 times greater than that same output occurring when querying dataset x. Why does this matter? Because the greater the discrepancy between the probabilities of certain outputs occurring, the easier it is to determine the contribution of the individual entry by which the two datasets differ, and the easier it is to ultimately reconstruct that individual entry.

In practice, designing an (ε, δ)-differentially private randomized mechanism entails the addition of random noise drawn from a distribution dependent on ε and δ. The specifics are beyond the scope of this article. Shifting our focus back to machine learning, though, we find that the idea is the same: DP for ML hinges on introducing noise into the training data, which yields robust privacy guarantees in much the same way.

Of course, this is where the tradeoff we mentioned comes into play. Adding noise to the training data comes at the cost of making learning more difficult. We could absolutely add enough noise to achieve ε = 0.01 and δ = 0, making the difference in output probabilities between x and y virtually nonexistent. This would be wonderful for privacy, but terrible for learning. A model trained on such a noisy dataset would perform very poorly on most tasks.

There is no consensus on what constitutes a “good” ε value, or on universal methodologies or best practices for ε selection [6]. In many ways, ε embodies the privacy/accuracy tradeoff, and the “proper” value to aim for is highly context-dependent. ε = 1 is generally regarded as offering high privacy guarantees. Although privacy diminishes exponentially with respect to ε, values as high as ε = 32 are mentioned in literature and thought to provide moderately strong privacy guarantees [1].

The authors of Reconciling privacy and accuracy in AI for medical imaging [1] test the effects of DP on the accuracy of AI models on three real-world medical imaging datasets. They do so using various values of ε and comparing them to a non-private (non-DP) control. Table 1 provides a partial summary of their results for ε = 1 and ε = 8:

Table 1: Comparison of AI model performance across the RadImageNet [7], HAM10000 [8], and MSD Liver [9] datasets with δ = 8⁻⁷⋅10 and privacy budgets of ε = 1, ε = 8, and without DP (non-private). A higher MCC/Dice score indicates higher accuracy. Although providing strong theoretical privacy guarantees in the face of a worst-case adversary, DP significantly degrades model accuracy. The negative impact on performance is especially noticeable in the latter two datasets, which are considered small datasets. Image by the author, based on image by A. Ziller, T.T. Mueller, S. Stieger, et al from Table 3 in Reconciling privacy and accuracy in AI for medical imaging [1] (use under CC-BY 4.0 license).

Even approaching the higher end of the typical ε values attested in literature, DP is still as cumbersome as a bank vault door for medical imaging tasks. The noise introduced into the training data is catastrophic for AI model accuracy, especially when the datasets at hand are small. Note, for example, the huge drop-off in Dice score on the MSD Liver dataset, even with the relatively high ε value of 8.

Ziller, Mueller, Stieger, et al. suggest that the accuracy drawbacks of DP with typical ε values may contribute to the lack of widespread adoption of DP in the field of Medical Ai [1]. Yes, wanting mathematically-provable privacy guarantees is definitely sensible, but at what cost? Leaving so much of the diagnostic power of AI models on the table in the name of privacy is not an easy choice to make.

Revisiting our dream home scenario armed with an understanding of DP, we find that the options we (seem to) have map neatly onto the three we had for our front door.

DP with typical values of ε is like installing a bank vault door: costly, but effective for privacy. As we’ll see, it’s also complete overkill in this case.
Not using DP is like not installing a door at all: much easier, but risky. As mentioned above, though, DP has yet to be widely applied in medical AI [1].
Passing up opportunities to use AI is like giving up and selling the house: it saves us the headache of dealing with privacy concerns weighed against incentives to maximize accuracy, but a lot of potential is lost in the process.

It looks like we’re at an impasse… unless we think outside the box.

High-Budget DP: Privacy and Accuracy Aren’t an Either/Or

In Reconciling privacy and accuracy in AI for medical imaging [1], Ziller, Mueller, Stieger, et al. offer the medical AI equivalent of a regular front door — an approach that manages to protect privacy while giving up very little in the way of model performance. Granted, this protection is not theoretically optimal — far from it. However, as the authors show through a series of experiments, it is good enough to counter almost any realistic threat of reconstruction.

As the saying goes, “Perfect is the enemy of good.” In this case, it is the “optimal” — an insistence on arbitrarily low ε values — that locks us into the false dichotomy of total privacy versus total accuracy. Just as a bank vault door has its place in the real world, so does DP with ε ≤ 32. Still, the existence of the bank vault door doesn’t mean plain old front doors don’t also have a place in the world. The same goes for high-budget DP.

The idea behind high-budget DP is straightforward: using privacy budgets (ε values) that are so high that they “are near-universally shunned as being meaningless” [1] — budgets ranging from ε = 10⁶ to as high as ε = 10¹⁵. In theory, these provide such weak privacy guarantees that it seems like common sense to dismiss them as no better than not using DP at all. In practice, though, this couldn’t be further from the truth. As we will see by looking at the results from the paper, high-budget DP shows significant promise in countering realistic threats. As Ziller, Mueller, Stieger, et al. put it [1]:

“[E]ven a ‘pinch of privacy’ has drastic effects in practical scenarios.”

First, though, we need to ask ourselves what we consider to be a “realistic” threat. Any discussion of the efficacy of high-budget DP is inextricably tied to the threat model under which we choose to evaluate it. In this context, a threat model is simply the set of assumptions we make about what a bad actor interested in obtaining our model’s training data is able to do.

Table 2: Comparison of threat models. For all three, we also assume that the adversary has unbounded computational ability. Image by A. Ziller, T.T. Mueller, S. Stieger, et al from Table 1 in Reconciling privacy and accuracy in AI for medical imaging [1] (use under CC-BY 4.0 license).

The paper’s findings hinge on a calibration of the assumptions to better suit real-world threats to patient privacy. The authors argue that the worst-case model, which is the one typically used for DP, is far too pessimistic. For example, it assumes that the adversary has full access to each original image while attempting to reconstruct it based on the AI model (see Table 2) [1]. This pessimism explains the discrepancy between the reported “drastic effects in practical scenarios” of high privacy budgets and the very weak theoretical privacy guarantees that they offer. We may liken it to incorrectly assessing the security threats a typical house faces, wrongly assuming they are likely to be as sophisticated and enduring as those faced by a bank.

The authors therefore propose two alternative threat models, which they call the “relaxed” and “realistic” models. Under both of these, adversaries keep some core capabilities from the worst-case model: access to the AI model’s architecture and weights, the ability to manipulate its hyperparameters, and unbounded computational abilities (see Table 2). The realistic adversary is assumed to have no access to the original images and an imperfect reconstruction algorithm. Even these assumptions leave us with a rigorous threat model that may still be considered pessimistic for most real-world scenarios [1].

Having established the three relevant threat models to consider, Ziller, Mueller, Stieger, et al. compare AI model accuracy in conjunction with the reconstruction risk under each threat model at different values of ε. As we saw in Table 1, this is done for three exemplary Medical Imaging datasets. Their full results are presented in Table 3:

Table 3: Comparison of AI model performance and reconstruction risk per threat model across the RadImageNet [7], HAM10000 [8], and MSD Liver [9] datasets with δ = 8⁻⁷⋅10 and various privacy budgets, including some as high as ε = 10⁹ and ε = 10¹². A higher MCC/Dice score indicates higher accuracy. Image by A. Ziller, T.T. Mueller, S. Stieger, et al from Table 3 in Reconciling privacy and accuracy in AI for medical imaging [1] (use under CC-BY 4.0 license).

Unsurprisingly, high privacy budgets (exceeding ε = 10⁶) significantly mitigate the loss of accuracy seen with lower (stricter) privacy budgets. Across all tested datasets, models trained with high-budget DP at ε = 10⁹ (HAM10000, MSD Liver) or ε = 10¹² (RadImageNet) perform nearly as well as their non-privately trained counterparts. This is in line with our understanding of the privacy/accuracy tradeoff: the less noise introduced into the training data, the better a model can learn.

What is surprising is the degree of empirical protection afforded by high-budget DP against reconstruction under the realistic threat model. Remarkably, the realistic reconstruction risk is assessed to be 0% for each of the aforementioned models. The high efficacy of high-budget DP in defending medical AI training images against realistic reconstruction attacks is made even clearer by looking at the results of reconstruction attempts. Figure 1 below shows the five most readily reconstructed images from the MSD Liver dataset [9] using DP with high privacy budgets of ε = 10⁶, ε = 10⁹, ε = 10¹², and ε = 10¹⁵.

Note that, at least to the naked eye, even the best reconstructions obtained when using the former two budgets are visually indistinguishable from random noise. This lends intuitive credence to the argument that budgets often deemed too high to provide any meaningful protection could be instrumental in protecting privacy without giving up accuracy when using AI for medical imaging. In contrast, the reconstructions when using ε = 10¹⁵ closely resemble the original images, showing that not all high budgets are created equal.

Based on their findings, Ziller, Mueller, Stieger, et al. make the case for training medical imaging AI models using (at least) high-budget DP as the norm. They note the empirical efficacy of high-budget DP in countering realistic reconstruction risks at very little cost in terms of model accuracy. The authors go so far as to claim that “it seems negligent to train AI models without any form of formal privacy guarantee.” [1]

Conclusion

We started with a hypothetical scenario in which you were forced to decide between a bank vault door or no door at all for your dream home (or giving up and selling the incomplete house). After an exploration of the risks posed by inadequate privacy protection in medical AI, we looked into the privacy/accuracy tradeoff as well as the history and theory behind reconstruction attacks and differential privacy (DP). We then saw how DP with common privacy budgets (ε values) degrades medical AI model performance and compared it to the bank vault door in our hypothetical.

Finally, we examined empirical results from the paper Reconciling privacy and accuracy in AI for medical imaging to find out how high-budget differential privacy can be used to escape the false dichotomy of bank vault door vs. no door and protect Patient Privacy in the real world without sacrificing model accuracy in the process.

If you enjoyed this article, please consider following me on LinkedIn to keep up with future articles and projects.

References

[1] Ziller, A., Mueller, T.T., Stieger, S. et al. Reconciling privacy and accuracy in AI for medical imaging. Nat Mach Intell 6, 764–774 (2024). https://doi.org/10.1038/s42256-024-00858-y.

[2] Ray, S. Samsung bans ChatGPT and other chatbots for employees after sensitive code leak. Forbes (2023). https://www.forbes.com/sites/siladityaray/2023/05/02/samsung-bans-chatgpt-and-other-chatbots-for-employees-after-sensitive-code-leak/.

[3] Ateniese, G., Mancini, L. V., Spognardi, A. et al. Hacking smart machines with smarter ones: how to extract meaningful data from machine learning classifiers. International Journal of Security and Networks 10, 137–150 (2015). https://doi.org/10.48550/arXiv.1306.4447.

[4] Dinur, I. & Nissim, K. Revealing information while preserving privacy. Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp Principles Database Syst 202–210 (2003). https://doi.org/10.1145/773153.773173.

[5] Dwork, C. & Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 211–407 (2014). https://doi.org/10.1561/0400000042.

[6] Dwork, C., Kohli, N. & Mulligan, D. Differential privacy in practice: expose your epsilons! Journal of Privacy and Confidentiality 9 (2019). https://doi.org/10.29012/jpc.689.

[7] Mei, X., Liu, Z., Robson, P.M. et al. RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiol Artif Intell 4.5, e210315 (2022). https://doi.org/10.1148/ryai.210315.

[8] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 5, 180161 (2018). https://doi.org/10.1038/sdata.2018.161.

[9] Antonelli, M., Reinke, A., Bakas, S. et al. The Medical Segmentation Decathlon. Nat Commun 13, 4128 (2022). https://doi.org/10.1038/s41467-022-30695-9.

The post When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI appeared first on Towards Data Science.

Statistical Learnability of Strategic Linear Classifiers: A Proof Walkthrough

Jonathan Yahav — Wed, 08 Jan 2025 17:01:43 +0000

In the previous article in this series, we examined the concept of strategic VC dimension (SVC) and its connection to the Fundamental Theorem of Strategic Learning. We will make use of both of those in this article, alongside the ideas of achievable labelings and strategic shattering coefficients that we explored in the lead-up to them.

Quantifying the Complexity and Learnability of Strategic Classification Problems

If you haven’t read the first article in this series yet, I encourage you to start from there before moving on to the aforementioned article on SVC:

Extending PAC Learning to a Strategic Classification Setting

With the context from the other articles in this series in place, a basic grasp of set theory and geometry will be all you’ll need to understand the theorem and its proof.

How an Instance-Wise Cost Function Affects SVC: Stating the Theorem

As we saw, SVC can be used as a tool to estimate the expressive power of a hypothesis class within a strategic classification context. Having carefully defined SVC as a generalization of the canonical VC dimension, we understand that the two have much in common. When, though, does SVC diverge from its canonical counterpart? Can we come up with a scenario in which the strategic aspect of a classification problem significantly increases its complexity? It turns out we can, with relative ease: linear classification.

Linear classification involves determining whether a data point should be positively or negatively classified based on a linear function applied to its features. Geometrically, we can imagine a linear classifier inducing a linear decision boundary in d-dimensional real space (ℝᵈ ). Anything on one side of the boundary is positively classified, and anything on the other side is negatively classified. In one-dimensional space, the decision boundary is a threshold (as we saw in the previous article). In two-dimensional space, it’s a line dividing the plane. In general, it’s a hyperplane.

In canonical binary classification, the VC dimension of the hypothesis class comprising all linear classifiers in ℝᵈ is d + 1, which is finite. For example, for d = 2 (linear classifiers in ℝ²), the VC dimension is 3. The Fundamental Theorem of Statistical Learning dictates that canonical linear classification is therefore PAC learnable.⁽¹⁾

Intuitively, we might expect the same conclusion to hold for the strategic analog of the problem. After all, linear classifiers are some of the simplest classifiers there are, and reasoning about them can be rather natural.⁽²⁾

However, that simplicity goes out the window as soon as we throw instance-wise cost functions into the mix. As we will prove:

_Given a strategic linear classification problem Sᴛʀᴀᴄ⟨H, R, c⟩, there exists an instance-wise cost function **c(z; x) = ℓₓ(z – x)** such that SVC(H, R, c) =_ ∞.

In other words, using the Fundamental Theorem of Strategic Learning, we find that linear classification in a strategic setting equipped with an instance-wise cost function is not generally PAC learnable. Interestingly, it will not be PAC learnable even if we strip away as much complexity as we can. In this case, we will do so by focusing on strategic linear classification on the Cartesian plane ( X ⊆ ℝ²) with the homogeneous preference class (R = { 1 }).

The more general conclusion will follow from the counterexample we will show under those simplifying conditions. If strategic linear classification is not PAC learnable in ℝ², there is no way it could be PAC learnable in any higher dimension. Likewise, every other preference class we laid out in our setup is a strict generalization of the homogeneous preference class. If we could prove PAC learnability for any of those preference classes, we would also be able to do so for the simpler case where R = { 1 }.

From Labelings to Points on the Unit Circle: Proof Setup

Based on the assumptions above, we begin by turning our attention to the special case Sᴛʀᴀᴄ⟨Hₗ, { 1 }, c⟩, with Hₗ being the hypothesis class comprising all linear classifiers in ℝ². We then initialize n two-dimensional feature vectors at the origin: ∀ i ≤ n . xᵢ = (0, 0). Since we’re using the homogeneous preference class, we have that ∀ i ≤ n . rᵢ = 1. The only difference between the data points will be in how our cost function behaves on each of them. This is where the crux of the proof lies, as we will soon see.

Before we discuss the cost function at length, though, we need to geometrize the possible labelings of our unlabeled data points. As we saw last time, a set of n unlabeled data points must have exactly 2ⁿ possible labelings. Representing a set of labelings (n-tuples) geometrically in ℝ² is relatively straightforward: we simply select an arbitrary point for each possible labeling. In particular, we will choose 2ⁿ such representative points on the unit circle, each assigned to a possible labeling. While the particular coordinates of the representative points themselves are unimportant, we do require that each such point be unique. We also require that no two points be origin-symmetric with respect to one another.

We will denote this set of representative points by S. Having selected our representative points, we use them to define the origin-symmetric set S’, i.e., S’ = { (-x, –y) : (x, y) ∈ S }. Note that S and S’ are disjoint (S ∩ S’ = ∅) as a consequence of how we selected the points in S.

For a particular _x_ᵢ, we define Sᵢ as the subset of S that includes only the points that represent labelings in which _x_ᵢ is positively classified. Similarly, we derive the origin-symmetric Sᵢ’ ⊂ S’ from Sᵢ. In the example below, the points above the x-axis are those representing labelings in which _x_ᵢ is positively classified, i.e., Sᵢ. Those below the x-axis comprise their origin-symmetric set Sᵢ’ (with the numbering matching between symmetric pairs of points). Note that the selection of points above the x-axis is completely arbitrary.

Figure 1: An example of labeling geometrization for an arbitrary _x_ᵢ. Recall that we started by initializing all unlabeled data points as (0, 0). Points in Sᵢ are numbered in black. Their origin-symmetric counterparts in Sᵢ’ are numbered in blue. Image by the author, based on image by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

We proceed to construct a convex polygon Gᵢ, with its vertices being the points in Sᵢ ∪ Sᵢ’. The Gᵢ for each unlabeled data point will be key in designing an instance-wise cost function c with which we will always be able to achieve all possible labelings, thus showing that SVC(Hₗ, { 1 }, c) = ∞. Towards this end, the convexity of Gᵢ will prove critical, as will its origin symmetry (stemming from our choice of Sᵢ’ ).

Figure 2: The convex, origin-symmetric polygon Gᵢ derived from Sᵢ and Sᵢ’ as shown in Figure 1. Image by the author, based on image by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

Turning Polygons into Preferences: Constructing the Cost Function

For each of the n origin-initialized, unlabeled data points we started with, we now have a convex, origin-symmetric polygon that represents the labelings in which it is positively classified. Each Gᵢ can now be used to define the behavior of our instance-wise cost function c on its corresponding xᵢ. We will use Gᵢ to define a seminorm⁽³⁾:

∥ y _∥_ɢᵢ = inf { ε ∈ ℝ⁺ : y _∈ εGᵢ }_

This definition implies that the distance between xᵢ and some point z is less than 1 if and only if z lies within Gᵢ. I.e.:

∥ z – xᵢ _∥ɢᵢ < 1 ⇔ z ∈_ Gᵢ

For the rest of the proof, it is sufficient that we understand this connection between ∥ ⋅ ∥ɢᵢ and a point being inside _Gᵢ._ (See Footnote (3) for a discussion of why ∥ ⋅ ∥ɢᵢ qualifies as a seminorm and for more details about its geometric interpretation.)

We thus define the instance-wise cost function c:

c_(z; xᵢ) = ℓᵢ(z – xᵢ)_

Where:

_ℓᵢ(z – xᵢ) = ∥ z – xᵢ ∥_ɢᵢ

That is, for each unlabeled data point xᵢ, c behaves as ∥ ⋅ ∥ɢᵢ would. Note that this behavior is different for each data point. This is because we constructed a unique Gᵢ for every xᵢ, and each ∥ ⋅ ∥ɢᵢ is derived from its corresponding polygon Gᵢ.

Data Point Best Response as a Result of the Cost Function

With the instance-wise cost function c in place, we may turn our attention to how our data points interact with linear classifiers. Recall that we have constrained our consideration to the homogeneous preference class, meaning that r = 1 for all of our points. I.e., xᵢ stands to gain a reward of magnitude 1 for being positively classified. Given a linear classifier, each data point will thus be willing to incur any cost less than 1 to manipulate its feature vector to ensure it falls on the positive side of the decision boundary. This will guarantee it receives positive utility as a result of the manipulation.

c is designed so that a data point with feature vector xᵢ has to pay ∥ z – xᵢ ∥ɢᵢ to change its feature vector to z. As we saw, **** as long as __ z lies inside Gᵢ, this cost will be less than 1.

Suppose we have a decision boundary that crosses Gᵢ (intersects it at two points) with xᵢ falling on its negative half-plane. As illustrated in Figure 3 below, this creates a sub-polygon such that for any z within it:

The cost to move to z is less than 1: c(z; xᵢ) = ∥ z – xᵢ ∥ɢᵢ < 1
The realized reward for moving is precisely 1: 𝕀(h_(z) = 1) ⋅ r =_ 1

Whereby the utility for data point i, 𝕀(h_(z) = 1) ⋅ r – c(z;_ xᵢ), is positive, in turn making any such z __ a better response than non-manipulation. In other words, the data point will always want to manipulate its feature vector into one that lies in this sub-polygon.

Figure 3: An example of a linear classifier with a decision boundary that properly intersects Gᵢ, with xᵢ falling on its negative half-plane. The resulting "Goldilocks zone" between the decision boundary and the perimeter of Gᵢ is shaded in green. Changing xᵢ to any z lying in this area confers positive utility. This is because the reward gained is 1 and the cost incurred is less than 1. Image by the author, based on image by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

Conversely, given a decision boundary that does not cross Gᵢ, no such sub-polygon exists. The cost of manipulating xᵢ to cross the boundary will always be greater than 1, and thus not worth the reward. The data point best response will be the original feature vector, meaning it’s best to stay put.

Figure 4: An example of a polygon Gᵢ and a linear classifier with a decision boundary that does not cross it. Note how there are no points that are both on the positive side of the decision boundary and inside _Gᵢ. Equivalently, there are no vectors into which the data point could manipulate its feature vector for positive utility._ Image by the author, based on image by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

Isolating Representative Points Using Linear Classifiers

We now understand the strategic implications of whether or not a certain decision boundary crosses Gᵢ. Calling to mind the role of our points on the unit circle as representatives of possible labelings, we can demonstrate the connection between labelings where a data point is positively classified and linear classifiers.

Let 𝓛 be an arbitrary labeling of our n __ data points, and let sₗ ∈ S be its unique representative point on the unit circle. Let _x_ᵢ be one of our unlabeled data points. We will explore the behavior of the data point with respect to a particular linear classifier, denoted hₗ. We require that the decision boundary induced by hₗ do the following:

Cross the unit circle.
Strictly separate sₗ from all other points in S ∪ S’.
Positively classify sₗ.

The structure of S ∪ S’ guarantees that such an hₗ exists.⁽⁴⁾

With hₗ at our disposal, we may explore how our cost function c interacts with hₗ for xᵢ depending on whether or not _x_ᵢ should be positively classified under 𝓛. In fact, we will prove that a data point is positively classified by hₗ if and only if it is positively labeled under 𝓛.

Let us first consider the case in which we want xᵢ to be positively labeled (see Figure 5). Recall that we defined Sᵢ as "the subset of S that includes only the points that represent labelings in which xᵢ is positively classified." We know, then, that sₗ ∈ Sᵢ. In particular, sₗ must be one of the vertices of Gᵢ. The fact that hₗ strictly separates sₗ from all other points in S ∪ S’ means that it is strictly separated from the other vertices of Gᵢ. Hence, hₗ must cross Gᵢ, incentivizing the data point to manipulate its feature vector.

Figure 5: If _x_ᵢ should be positively labeled under 𝓛, ** hₗ crosses Gᵢ. This incentivizes the data point to manipulate its feature vector (see Figure 3**). Image by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

We proceed to examine the case in which we want xᵢ to be negatively labeled under 𝓛 (see Figure 6). As a result of how we constructed Sᵢ, sₗ ∉ Sᵢ. Additionally, having required that the origin-symmetric S_’ be disjoint from S,_ we know that sₗ ∉ Sᵢ’. It follows that s_ₗ i_s not a vertex of Gᵢ. Once again, hₗ strictly separates sₗ from all other points in S __ ∪ S’, including all the vertices of Gᵢ. Because Gᵢ is convex, we conclude that any point in Gᵢ is on the opposite side of hₗ as sₗ. In other words, hₗ does not cross Gᵢ. Consequently, the data point will choose to stay put rather than "overpaying" to manipulate its feature vector to cross hₗ.

Figure 6: If _x_ᵢ should be positively labeled under 𝓛, hₗ does not cross G_ᵢ. This incentivizes the data point not to manipulate its feature vector (see Figure 4). I_mage by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

In summary, our unlabeled data point xᵢ will engage in manipulation to cross hₗ if and only if 𝓛 dictates that the data point should be positively classified. In our strategic classification setting, this means that h_ₗ p_ositively classifies a data point if and only if that data point should be positively labeled according to 𝓛.

Putting It All Together: Inducing an Arbitrary Labeling

Using what we have seen so far, we are able to demonstrate that we can achieve any labeling of our n data points we want. Overlaying all of our data points and their respective polygons (see Figure 7), ** we can see that given a labeling 𝓛, we are able to achieve it** with the help of a corresponding linear classifier hₗ.

Figure 7: Simplified visualization of overlaid data points with their corresponding cost function polygons. sₗ represents a labeling 𝓛 i_n which data point i should be positively classified and data point j should be negatively classified. The unmanipulated feature vectors of the two overlap at (0, 0). However, data point i will be positively classified by hₗ because it will move to the positive side of the decision boundary induced by hₗ (since the boundary crosses Gᵢ). Data point j will stay on the negative side because the boundary does not cross Gⱼ. I_mage by the author, based on images by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

Any data point xᵢ that 𝓛 rules should be positively classified will manipulate its feature vector and move to the positive side of the decision boundary created by h_ₗ (_like the case in Figure 5). At the same time, any data point xⱼ that should be negatively classified will not be sufficiently incentivized to manipulate its feature vector, causing it to stay on the negative side of the decision boundary. Across all n __ data points, those that will be positively classified will be exactly the ones that 𝓛 dictates should be positively classified. In other words, we can induce any labeling we wish.

We therefore have a sample of n unlabeled, potentially-manipulated data points that is strategically shattered by Hₗ, the hypothesis class of all linear classifiers in ℝ². Based on how we defined strategic shattering coefficients, we find that σₙ(Hₗ, { 1 }, c) = 2ⁿ. It follows that SVC(Hₗ, { 1 }, c) = ∞.

Conclusion

First, we posed linear classification as a problem that is canonically PAC learnable, but not generally strategic PAC learnable. Second, we simplified our strategic classification problem by limiting our consideration to the homogeneous preference class on the Cartesian plane. Given n origin-initialized data points, we mapped each of their 2ⁿ possible labelings to a unique representative point on the unit circle. We then created a polygon for each data point using the representative points corresponding to the labelings under which it should be positively classified.

Based on those polygons, we constructed an instance-wise cost function, c, for which the cost of feature manipulation is less than 1 only within each polygon. Next, we showed that for any labeling 𝓛, we could find a linear classifier that isolates its respective representative point. Such a classifier, paired with c, ensured that only the data points that are s_upposed t_o be positively classified according to 𝓛 are incentivized to manipulate their feature vectors to cross the decision boundary and end up with a positive label. Finally, we explained why that means that the SVC of the problem is infinite.

Acknowledgments

Writing this series of articles has been a wonderful journey, and I’m truly grateful to everyone who took the time to read them, especially those who reached out with feedback. This series wouldn’t have been possible without the authors of PAC-Learning for Strategic Classification: Ravi Sundaram, Anil Vullikanti, Haifeng Xu, and Fan Yao. They set the stage beautifully for a deep dive into this fascinating meeting point between machine learning and Game Theory. Thank you to the TDS Editors, particularly Ben Huberman and Ludovic Benistant, for their support and for giving me such a fantastic platform to share my writing. Lastly, a huge thank you to Prof. Inbal Talgam-Cohen for sowing the seeds that grew into this series in the seminar she taught last winter.

If you enjoyed these articles, please consider following me on Medium and LinkedIn to keep up with future articles and projects.

Footnotes

(1) See a proof of this result, as well as a great interactive explanation of linear classification, here. What we refer to as a "linear classifier" throughout this article is technically an affine classifier.

(2) Indeed, the paper shows that restricting our consideration to instance-invariant cost functions yields a problem that is PAC learnable, just like the in the canonical setting. For a refresher on the difference between instance-invariant and instance-wise cost functions, see the first article in this series.

(3) ∥ ⋅ ∥ɢᵢ as we defined it is the _Minkowski functional_ of Gᵢ. In this context, we view Gᵢ as the set of points enclosed in the polygon we constructed. The Minkowski functional generalizes the notion of a seminorm. Recall that in the first article in this series, we assumed our cost function is induced by seminorms. We would be remiss not to reconcile this assumption with the construction of ∥ ⋅ ∥ɢᵢ.

Fortunately, it can be proven that the Minkowski functional of a set A is a seminorm under certain conditions [2]:

A is a subset of a vector space over ℝ.
A is convex.
A is symmetric.
A is an absorbing set.

Even more fortunately (or rather, by meticulous design), **Gᵢ fulfills __ all of these conditions**:

Gᵢ is a subset of ℝ², which is a vector space over ℝ.
Gᵢ is a convex set since it is bounded by a convex polygon.
Gᵢ is origin-symmetric. __ (This is why we needed the Sᵢ‘ in the first place!)
Gᵢ is a convex subset of the topological vector space ℝ² equipped with the standard topology induced by the Euclidean norm. Gᵢ also contains the origin in its interior. Therefore, Gᵢ is an absorbing set [3].

∥ ⋅ ∥ɢᵢ is therefore a seminorm for all i, enabling us to continue the proof without violating our assumptions.

(4) Let P be the convex hull of (S ∪ S’) _ s_ₗ. Note that S ∪ S’ is finite. It can be proven that the convex hull of a finite set in ℝ² is compact [4]. As a singleton, { sₗ } is closed. Clearly, _ (S ∪_ S‘) sₗ and sₗ are disjoint. It follows from the hyperplane separation theorem that there exists a hyperplane that strictly separates the two.

Let us also show that the linear decision boundary induced by the aforementioned hyperplane must intersect the unit circle at two points. Recall that a line may intersect a circle at exactly zero, one, or two points.

Suppose for the sake of contradiction that the decision boundary and the unit circle are disjoint. Then all points in the unit circle must lie on the same side of the decision boundary, contradicting the assumption that hₗ separates (S ∪ S’) _ s_ₗ from sₗ.
Due to the fact that the separation is strict, sₗ may not lie on the decision boundary, whereby the latter cannot be tangent to the unit circle.
The only remaining possibility is that the decision boundary crosses the unit circle.

As for ensuring hₗ positively classifies sₗ, it is completely up to us which side of the boundary we wish to classify positively.

References

[1] R. Sundaram, A. Vullikanti, H. Xu, F. Yao. PAC-Learning for Strategic Classification (2021), International Conference on Machine Learning.

[2] ProofWiki contributors. Minkowski functional of symmetric convex absorbing set in real vector space is seminorm.

[3] ProofWiki contributors. Convex subset of topological vector space containing zero vector in interior is absorbing set.

[4] Math Stack Exchange contributors. Is convex hull of a finite set of points in R² closed?

The post Statistical Learnability of Strategic Linear Classifiers: A Proof Walkthrough appeared first on Towards Data Science.

Quantifying the Complexity and Learnability of Strategic Classification Problems

Jonathan Yahav — Fri, 12 Apr 2024 19:35:33 +0000

Image generated by the author using DALL-E 3.

In the first article in this series, we formally defined the strategic classification problem, denoted Sᴛʀᴀᴄ⟨H, R, c⟩, as a generalization of canonical binary classification. We did so based on the paper PAC-Learning for Strategic Classification (Sundaram, Vullikanti, Xu, & Yao, 2021). Along the way, we explored why we should care about considering the various preferences of rational agents during classification and how we can do so (subject to certain assumptions). We will rely heavily on the concepts introduced in the previous article, so I encourage you to read it if you haven’t already.

Extending PAC Learning to a Strategic Classification Setting

We’ll pick up where we left off, using the definition of Sᴛʀᴀᴄ⟨H, R, c⟩ as a jumping-off point for the useful concept of strategic VC dimension (SVC). Once we make sense of SVC, what I call the Fundamental Theorem of Strategic Learning will follow rather naturally.

While helpful, prior familiarity with shattering coefficients, the canonical VC dimension, and the Fundamental Theorem of Statistical Learning will not be necessary for you to follow along. However, there’s far more depth to each of them than I could ever hope to cover as part of this series, let alone in a single article. The curious reader is referred to Andrew Rothman‘s wonderful and very thorough articles on the (canonical) shattering coefficient and VC dimension.

Statistical Learning Theory Part 5: Shattering Coefficient

Statistical Learning Theory Part 6: Vapnik–Chervonenkis (VC) Dimension

As we’ll see, strategic shattering coefficients and SVC are fairly natural generalizations of their canonical (i.e., non-strategic) counterparts. We will therefore begin with a brief rundown of each of those counterparts before explaining how they can be modified to work in a strategic setting.

Counting Achievable Labelings: Canonical Shattering Coefficients

Verbally defining shattering coefficients seems straightforward at first glance:

Given a hypothesis class H, **its nᵗʰ shattering coefficient, denoted Sₙ_(H), represents the largest number of labelings achievable by classifiers in_ H on a sample of n feature vectors.**

But what is a "labeling"? And what makes it "achievable"? Answering those questions will help us lay some groundwork in pursuit of a more formal definition.

In the context of binary classification, a labeling of a sample of feature vectors is simply any one of the ways we can assign values from the set { -1, 1 } to those vectors. As a very simple example, consider two one-dimensional feature vectors (i.e., points on a number line), _x_₁ = 1 and _x_₂ = 2.

A visualization of the four possible labelings of the sample _x_₁ = 1, _x_₂ = 2. Red points are negatively classified, blue ones are positively classified. Image by the author.

The possible labelings are any combination of the classification values we can assign the individual feature vectors independent of one another. We can represent each labeling as a vector, where the first and second coordinate represent the values assigned to _x_₁ and _x_₂, respectively. The set of possible labelings is thus { (-1, -1), (-1, 1), (1, -1), (1, 1) }. Note that a sample of size 2 yields 2² = 4 possible labelings – we’ll see how this generalizes to arbitrarily-sized samples soon.

We say that a labeling is achievable by a hypothesis class H if there exists a classifier h ∈ H from which that labeling can result. Continuing with our simple example, suppose we are limited to classifiers of the form x ≥ k, k ∈ ℝ, that is, one-dimensional thresholds such that anything to the right of the threshold is positively classified. The labeling (1, -1) is not achievable by this hypothesis class. x₂ being greater than x₁ implies that any threshold that classifies x₁ positively must do the same for x₂. The set of achievable labelings is therefore { (-1, -1), (-1, 1), (1, 1) }.

Examples of one-dimensional threshold classifiers that can be used to achieve all but one of the possible labelings of the sample _x_₁ = 1, _x_₂ = 2. Image by the author.

Having understood the basic terminology, we can start to develop some notation to formally express elements of the verbal definition with which we started.

We stick to representing labelings as vectors as we did in our simple example, with each coordinate representing the classification value assigned to the corresponding feature vector. There are 2ⁿ possible labelings in total: there are two possible choices for each feature vector, and we can think of a labeling as a collection of n such choices, each made independently of the rest. If a hypothesis class H can achieve all possible labelings of a sample 𝒞ₙ, i.e., if the number of a_chievable l_abelings of 𝒞ₙ _e_quals 2ⁿ, **we say that H _shatters 𝒞_ₙ.**

Finally, using the notation from above, we converge on a more rigorous definition of Sₙ(H):

In keeping with our explanation of shattering, Sₙ(H) equalling 2ⁿ implies that there exists a sample of size n that is shattered by H.

Estimating Hypothesis Class Expressiveness: Canonical VC Dimension

The Vapnik–Chervonenkis (VC) dimension is a way to gauge the expressive power of a hypothesis class. It’s based on the idea of shattering we just defined, and it plays an important role in helping us determine which hypothesis classes are PAC learnable and which aren’t.

Let’s begin by attempting to intuitively define the canonical VC dimension:

Given a hypothesis class H_, its VC dimension, denoted VCdim(H), is defined to be the greatest natural number_ n for which there exists a sample of size n that is shattered by H.

Using Sₙ(H) enables us to express this much more cleanly and succinctly:

_VCdim(H)_ = max{ n ∈ ℕ : Sₙ_(H) = 2ⁿ }_

However, this definition isn’t precise. Note that the set of numbers for which the shattering coefficient equals 2ⁿ may be infinite. (Consequently, it is possible that VCdim(H) = ∞.) If that’s the case, the set has no well-defined maximum. We address this by taking the supremum instead:

_VCdim(H)_ = sup{ n ∈ ℕ : Sₙ_(H) = 2ⁿ }_

This rigorous and concise definition is the one we’ll use moving forward.

Adding Preferences to the Mix: Strategic Shattering Coefficients

Generalizing the canonical notions we just went over so that they work in a strategic setup is fairly straightforward. Redefining shattering coefficients in terms of the data point best response we defined in the previous article is practically all we’ll have to do.

Given a hypothesis class H, a preference set R, and a cost function c, **the n_ᵗʰ shattering coefficient of Sᴛʀᴀᴄ⟨H, R, c⟩, denoted σₙ(H, R, c), represents the largest number of labelings achievable by classifiers in H_ on a set of n _potentially-manipulated feature vectors, i.e., _n data point best responses.

As a reminder, this is how we defined the data point best response:

We can tweak the notation we used in our discussion of canonical shattering coefficients to further formalize this:

The main difference is that each x in the sample has to have a corresponding r. Other than that, putting the data point best response where we had x in the canonical case works smoothly.

As a quick sanity check, let’s consider what happens if R = { 0 }. The realized reward term 𝕀(h_(z)_ = 1) ⋅ r _w_ill be 0 across all the data points. Maximizing utility thus becomes synonymous with minimizing cost. The best way to minimize the cost incurred by a data point is trivial: never manipulating its feature vector.

Δ(x, r; h) ends up always just being x, placing us firmly within the territory of canonical classification. It follows that σₙ(H, { 0 }, c) = Sₙ(H) for all H, c. This is consistent with our observation that the impartial preference class represented by R = { 0 } is equivalent to canonical binary classification.

Expressiveness with Preferences: Strategic VC Dimension (SVC)

Having defined the _nᵗʰ **_ strategic shattering coefficient, we can simply swap out the _Sₙ(H)_ in the canonical definition of the VC dimension for _σₙ(_H, _ R,_ c**).

_SVC(H, R, c)_ = sup{ n _∈ ℕ : σₙ(H, R, c) = 2ⁿ }_

Based on the example we considered above, we find that SVC(H, { 0 }, c) = VCdim(H) for any H, c. Indeed, SVC is to VCdim as the strategic shattering coefficient is to its canonical equivalent: both are elegant generalizations of non-strategic concepts.

From SVC to Strategic PAC Learnability: The Fundamental Theorem of Strategic Learning

We can now use SVC to state the Fundamental Theorem of Strategic Learning, which relates the complexity of a strategic classification problem to its (agnostic) PAC learnability.

_A strategic classification instance Sᴛʀᴀᴄ⟨H, R, c⟩ is agnostic PAC learnable if and only if SVC(H, R, c) is finite. The sample complexity for strategic agnostic PAC learning is **m(δ, ε) ≤ Cε ⁻² ⋅ (SVC(H, R, c) + log⁡(1/δ))**, with_ C being a constant.

We won’t elaborate too much on how this can be proven. Suffice it to say that it boils down to a clever reduction to the (well-documented) Fundamental Theorem of Statistical Learning, which is essentially the non-strategic version of the theorem. If you’re mathematically inclined and interested in the nuts and bolts of the proof, you can find it in Appendix B of the paper.

This theorem essentially completes our generalization of classic PAC learning to a strategic classification setting. It shows that the way we defined SVC actually doesn’t just make sense in our heads; it actually works as a generalization of VCdim where it matters most. Armed with the Fundamental Theorem, we are well-equipped to analyze strategic classification problems much as we would any old binary classification problem. In my opinion, having the ability to determine whether a strategic problem is theoretically learnable or not is pretty incredible!

Conclusion

We began by introducing the canonical shattering coefficient and VC dimension, which are central to PAC learning (and the __ Fundamental Theorem of Statistical Learning). Next, we leveraged the data point best response to generalize the aforementioned concepts so that they would work in our strategic setup. We defined the strategic VC dimension (SVC) and showed that when faced with the impartial preference class, it degenerates back into the canonical VC dimension. Finally, we demonstrated how SVC relates to strategic PAC learnability by means of the Fundamental Theorem of Strategic Learning.

In the final article in this series, we’ll build on the concepts I’ve introduced here to break down my favorite proof in the paper, which I think is a beautiful illustration of strategic classification and SVC in action.

Statistical Learnability of Strategic Linear Classifiers: A Proof Walkthrough

References

[1] R. Sundaram, A. Vullikanti, H. Xu, F. Yao. PAC-Learning for Strategic Classification (2021), International Conference on Machine Learning.

The post Quantifying the Complexity and Learnability of Strategic Classification Problems appeared first on Towards Data Science.

Extending PAC Learning to a Strategic Classification Setting

Jonathan Yahav — Thu, 04 Apr 2024 17:26:15 +0000

Last semester, I took a seminar in Incentives and Learning. The papers we discussed throughout the course dealt with the overlap between the fields of game theory and machine learning. I had very little familiarity with formal game theory beforehand, and I thought finding out more about it through the lens of its intersection with machine learning was fascinating. By the end of this article, I hope you’ll think so, too!

The paper my group chose to present was PAC-Learning for Strategic Classification (Sundaram, Vullikanti, Xu, & Yao, 2021). It generalizes the basic machine learning notion of PAC learning to work in a strategic binary classification setting. The word "strategic" here signifies that the data points we want to classify aren’t just data points, but rather represent rational agents with their own individual preferences.

This will be a three-part series on my takeaways from the paper. In this article, I’ll lay the intuitive and formal foundations required to understand the strategic classification model and setup. In the next one, I’ll cover the concept of strategic VC dimension as a generalization of the canonical notion of VC dimension. The final post will be a walkthrough of my favorite proof in the paper, which will tie together the definitions and ideas introduced in the first two.

An understanding the idea of binary classification and the basic notation used for it in the context of machine learning should be all you need to understand the articles in this series. Ultimately, the goal is to present the concepts in a way that makes them as approachable as possible, regardless of your background.

Why Strategic Classification Is Useful: Motivation

Binary classification is a cornerstone of machine learning. It was the first topic I was taught when I took an introductory course on the subject; the real-world example we examined back then was the problem of classifying emails as either spam or not spam. Other common examples include diagnosing a disease and screening resumes for a job posting.

The basic binary classification setup is intuitive and easily applicable to our day-to-day lives, and ** it can serve as a helpful demonstration of the ways we can leverage machine learning to solve human problems. But how often do we stop to consider the fact that people usually have a vested interest in the classification outcome of such problems? Spammers want their emails to make it through spam filters, not everyone wants their COVID test to come back positive, and job seekers may be willing to stretch the truth to score an interview. The data points aren’t just data points – they’re active participants in the classification process, often aiming to game the system to their own benefit**.

In light of this, the canonical binary classification setup seems a bit simplistic. However, the complexity of reexamining binary classification while tossing out the implicit assumption that the objects we wish to classify are uninfluenced by external stakes sounds unmanageable. The preferences that could affect the classification process come in so many different forms – how could we possibly take all of them into account?

It turns out that, under certain assumptions, we can. Through a clever generalization of the canonical binary classification model, the paper’s authors demonstrate the feasibility of designing computationally-tractable, gaming-resistant Classification Algorithms.

From Data Points to Rational Agents: Preference Classes

First, if we want to be as realistic as possible, we have to properly consider the wide breadth of forms that real-world preferences can take among rational agents. The paper mentions five increasingly general categories of preferences (which I’ll call preference classes). The names I’ll use for them are my own, but are based on the terminology used in the paper.

Impartial: No preferences, just like in canonical binary classification.
Homogeneous: Identical preferences across all the agents involved. For example, within the set of people who are willing to fill out the paperwork necessary to apply for a tax refund, we can reasonably expect that everyone is equally motivated to get their money back (i.e., to be classified positively).
Adversarial: Equally-motivated agents aim to induce the opposite of their true labels. Think of bluffing in poker – a player with a weak hand (negatively classified) wants their opponents to think they have a strong hand (positively classified), and vice versa. For the "equally-motivated" part, imagine all players bet the same amount.
Generalized Adversarial: Unequally-motivated agents aim to induce the opposite of their true labels. This isn’t too different from the plain Adversarial case. Still, it should be easy to understand how a player with $100 dollars on the line would be willing to go to greater lengths to deceive their opponents than a player betting $1.
General Strategic: _"Anything goes."_ This preference class aims to encompass any set of preferences imaginable. All four of the previously mentioned preference classes are strict subsets of this one. Naturally, this class is the main focus of the paper, and most of the results demonstrated in the paper apply to it. The authors give the wonderful example of college applications, where "students [who] have heterogeneous preferences over universities […] may manipulate their application materials during the admission process."

How can the canonical classification setup be modified to account for such rich agent preferences? The answer is astoundingly simple. Instead of limiting our scope to (x, y) ∈ X × { -1, 1 }, we consider data points of the form (x, y, r) ∈ X × { -1, 1 } × R. A point’s r value represents its preference, which we can break down into two equally important components:

The sign of r indicates whether the data point wants to be positively or negatively classified (r > 0 or r < 0, respectively).
The absolute value of r specifies how strong the data point’s preference is. For example, a data point with r = 10 would be much more strongly motivated to manipulate its feature vector x to ensure it ends up being positively classified than a data point with r = 1.

What determines the preference class we operate within is the set R. We can formally define each of the aforementioned preference classes in terms of R and see how the formal definitions align with their intuitive descriptions and examples:

Impartial: R = { 0 }. (This makes it abundantly clear that the strategic setup is just a generalization of the canonical setup.)
Homogeneous: R = { 1 }.
Adversarial: R = { -1, 1 }, with the added requirement that all data points prefer to be classified as the opposite of their true labels.
Generalized Adversarial: R ⊆ ℝ (and all data points prefer to be classified as the opposite of their true labels.)
General Strategic: R ⊆ ℝ.

Giving Preference Magnitude Meaning: Cost Functions

Clearly, though, R on its own isn’t enough to construct an entire general strategic framework. The very idea of a data point’s preference having a certain magnitude is meaningless without tying it to the cost the data point incurs in manipulating its feature vector. Otherwise, any data point with a positive r, no matter how small, would have no reason not to manipulate its feature vector ad infinitum. This is where the concept of cost functions comes into play.

Let c: X × X → ℝ⁺. For simplicity, we will assume (as the paper’s authors do) that c is induced by seminorms. We say that a test data point (x, y, r) may transform its feature vector x into z ∈ X with cost c(z; x). It’s important to note in this context that the paper assumes that the training data is unmanipulated.

We can divide cost functions into two categories, with the former being a subset of the latter. An instance-invariant cost function is the same across all data points. To put it more formally:

∃ℓ: X × X _→ ℝ⁺ . ∀(x, y, r) ∈ X × { -1, 1 } × R . ∀z ∈ X . c(z; x) = ℓ(z – x)_

I.e., there exists a function ℓ such that for all data points and all potential manipulated feature vectors, c(z ; x) simply takes the value of ℓ(z – x).

An instance-wise cost function may vary between data points. Formally:

_∀(x, y, r) ∈ X × { -1, 1 } × R . ∃ℓₓ: X × X → ℝ⁺ .∀z ∈ X . c(z; x) = ℓₓ(z – x)_

I.e., each data point can have its own function, ℓₓ, and c(z; x) takes the value of ℓₓ(z – x) for each individual data point.

As we will see in the final article in this series, while the difference between the two types of cost functions may seem subtle, instance-wise cost functions are significantly more expressive and harder to learn.

Preference Classes and Cost Functions in Action: An Example

Let’s take a look at an example given in the paper to help hammer home the aspects of the setup we’ve covered so far.

Image by R. Sundaram, A. Vullikanti, H. Xu, F. Yao from PAC-Learning for Strategic Classification (use under CC-BY 4.0 license).

In this example, we have a decision boundary induced by a linear binary classifier and four data points with individual preferences. General strategic is the only applicable preference class in this case.

The dotted perimeter around each xᵢ shows the manipulated feature vectors z to which it would cost the point exactly 1 to move. Since we assume the cost function is induced by seminorms, everything inside a perimeter has a cost of less than 1 for the corresponding data point to move to. We can easily tell that the cost function in this example varies from data point to data point, which means it is instance-wise.

As we can see, the leftmost data point (_x_₁, -1, -1) has no incentive to cross the decision boundary since it is on the negative side of the decision boundary while also having a negative preference. (_x_₄, -1, 2), however, wants to be positively classified, and since the reward for manipulating _x_₄ to cross the boundary (which is 2) outweighs the cost of doing so (which is less than 1), it makes sense to go through with the manipulation. (_x_₃, 1, -2) is symmetric to (_x_₄, -1, 2), also deciding to manipulate its feature to achieve its desired classification outcome. Lastly, (_x_₂, -1, 1), the cost function of which we can see is based on taxicab distance, opts to stay put regardless of its preference to be positively classified. This is because the cost of manipulating _x_₂ to cross the decision boundary would be greater than 1, surpassing the reward the data point would stand to gain by doing so.

Assuming the agents our data points represent are rational, we can very easily tell when a data point should manipulate its feature vector (benefits outweigh costs) and when it shouldn’t (costs outweigh benefits). The next step is to turn our intuitive understanding into something more formal.

Balancing Costs & Benefits: Defining Data Point Best Response

This leads us to define the data point best response:

So we’re looking for the feature vector(s) z ∈ X that maximize… what exactly? Let’s break down the expression we’re aiming to maximize into more manageable parts.

h: A given binary classifier (h: X → { -1, 1 }).
c(z; x): As stated above, this expresses the cost of modifying the feature vector x to be z.
𝕀(h_(z)_ = 1): Here, 𝕀(p) is the indicator function, returning 1 if the predicate p _is upheld or 0 if it isn’t. The predicate h(z)_ = 1 is true if the vector z under consideration is positively classified by h. Putting that together, we find that 𝕀(h_(z)_ = 1) evaluates to 1 for any z _t_hat is positively classified. If r __ is positive, that’s good. If it’s negative, that’s bad.

The bottom-line is that we want to find vector(s) z for which 𝕀(h_(z) = 1) ⋅ r,_ which we can call the realized reward, outweighs the cost of manipulating the original x __ into z by as much as possible. To put it in game theoretic terms, the data point best response maximizes the utility of its corresponding agent in the context of the binary classification under consideration.

Putting It All Together: A Formal Definition of the Strategic Classification Problem

Finally, we’ve laid all the necessary groundwork to formally define the strategic classification problem.

A diagram illustrating the formal definition of the strategic classification problem. Image by author.

Given a hypothesis class H, a preference class R, a cost function c, and a set of n data points drawn from a distribution D, we want to find a binary classifier h‘ that minimizes the loss as defined in the diagram above. Note that the loss is simply a modification of the canonical zero-one loss, plugging in the data point best response instead of h(x).

Conclusion

Starting from the canonical binary classification setup, we introduced the notion of preference classes. Next, we saw how to formalize that notion using an r value for each data point. We then saw how cost functions complement data point preferences. After that, we broke down an example before defining the key concept of data point best response based on the ideas we explored beforehand. Lastly, we used the data point best response to define the modified zero-one loss used in the definition of the strategic classification problem.

Join me next time as I define and explain the strategic VC dimension, which is the natural next step from where we left off this time.

Quantifying the Complexity and Learnability of Strategic Classification Problems

References

[1] R. Sundaram, A. Vullikanti, H. Xu, F. Yao. PAC-Learning for Strategic Classification (2021), International Conference on Machine Learning.

The post Extending PAC Learning to a Strategic Classification Setting appeared first on Towards Data Science.