Sarah. Thanks for being with us. What are the important contributions that this paper makes to the field?
This paper has really the first of its kind to unify two different sub-fields. So really this paper is looking at the field of compression, which is all about how do we make machine learning accessible? How do we make machine learning work in resource constrained environments. And it's asking if optimizing for compactness, which is what allows models to work with is low latency requirements. Low memory available. It's asking whether optimizing for that can compromise other qualities that we care about in this case fairness. So does compression amplify algorithmic bias in datasets? And we find that yes, it does. And it paves the way for future work, which in many ways tries to reconcile these subfields.Can we have compact models that also don't disparately harm, underrepresented features?
Do you want to just quickly explain what a few of those key terms are or things like [00:01:00] compression and compactness and the algorithmic bias.
So compression is a really this idea that a lot of the history of deep neural networks as being models of increasing size. So we strive with something. LeNet, which had 70,000 parameters. And we've ended up with something like , which has more than a billion parameters. And this is intention with our desire to make AI democratic and accessible.So often if I went to deploy a model to your phone, and most of you have some type of model on your phone, we have to do something which is. Or the call compression. We have to use a series of techniques that takes these large models and mix them for our smaller, some popular ways to do that are pruning removing weights or quantizing, which removes a bit representation on your phone.The other, question that you asked, we just explain as disparate. So what do you mean by desperate is a lot of the concern when we deploy models, is that. It performs uniformly on different parts of the population. So we want, when we deliver a [00:02:00] motto to community and we want it to work the same in that say a country in Latin America, as it does on people in the us.And what we mean by desperate, if a model performs in a way, which is unexpected, that is disproportionately correlated with a protected attribute, like nationality or age or race and disparate, it means it performs worse.
Thanks. Let's get into something a little deeper, as far as your motivation. Why did you work on this? What were you going for?
Aye drives my research in general is how do we deploy models in the wild. In a way that really serves everyone. So a lot of my motivation for this was just the widespread use of compression. We use something on such a pervasive basis, most models in your forearm phone, use compression. If you have AI on your phone, and I really want to ask Are we compromising other things that we want of a model when we compress so much.[00:03:00] And I believe in the bio compression as well, I grew up in Africa. So a lot of my interest is in how do we make AI accessible in these resource constrained environments that I grew up in, but. It's also important to understand that often when we take a about, talk about properties that we want a model to have.So we often talk about tests that accuracy, which is most commonly referenced desirable property. So we don't really think about how adding and other desirable properties may interact. So we might not just want the model to be compact. But we may want it to be interpretable. We may want it to be robust.We may want it to be fair. And a lot of just zooming out of this this trade-off, which this paper in particular pointed a lot of the message of this paper is that. By optimizing for one property, we may be compromising others. We have to look at all these properties that we care about in a more holistic sense.And we need to come up with rigorous frameworks to measure the trade-offs between the properties that we want, because many of these subfields have operated in [00:04:00] isolation. I mentioned, this is really the first paper within the sub field of model compression. That's trying to connect model compressions, treat us with fairness, but that's what I'm usual a lot of work on.Interpretability doesn't really think about how interpretability links to compactness. And the same for robustness. There were only links recently made between robustness and interpretability showing that more robust models tend to be more interpretable.
Okay, at this point, maybe we understand the motivation. We've gotten some of the key contributions at just the beginning level. And we also have given us a couple of examples that we can play with. Do you want to dive in a little bit about the details and what else should we really know about what this work did and anything you really want to highlight about going forward?
This work is really looking at as specific to specific techniques in compression. One is pruning, which removes a certain fraction of weights in the network in order to make it right. A smaller required, less memory to store. And so what we do specifically is we look at mottos trained to different [00:05:00] levels of sparsity or different levels of pruning.And we're asking how does the decision boundary change? So how does the predictive capacity of the model change at these different levels? And in particular, we want to know how it impacts underrepresented features. So the features that are low frequency in your dataset and the reason why is that.Often fairness, considerations coincide with how underrepresented features are matte or represented at particularly if those are protected features. So a lot of fenders work.
Do you mind just giving an example to guide that part?
Yeah. So a great way of thinking about this is a recent work, which is how does a public API is performed on different parts of the population and that work is gender shades and it shows that API has performed far less I'm women who have colored and women who are white.And so that's an example of a subpopulation, which is underrepresented in the training set being treated. Or impaired [00:06:00] by the performance and the moderate test time. We look at a similar example in our in our paper, which is we look at a dataset where the most underrepresented attribute is blonde males.So in the terms of the overall data distribution, the lowest frequency attribute is males who are blonde and were classified blonde versus non blind. And what we show is that when you prude the pruning mirrors. The disparate harm or the printing mirrors that representation of features. So the best represented features are the least harm by things like pruning and the least representative features are the most impacted by methods like pruning.
So if I were just to try and play that back to you, what I think you're saying is that the act of pruning, which is reducing the weights to a fewer number that are on at a given time, that makes it so that the data points that are less represented are the ones that get worse accuracy after pruning. So for example , in your paper, you had [00:07:00] this fake dataset where you had a blond males with much less frequency than other characteristics. the models are doing absolutely wonderfully on that part of the dataset before pruning, but after pruning, it did really poorly, but the other parts of it were unaffected.Is that right?
precisely it cannibalizes from the least represented feature is to preserve performance on the rest because remember a part of what makes this question interesting is that pruning. I measure psych tests at accuracy. A P has to be a free lunch. You can prune 90% of your model and you can still have comparable top-line metrics top one, top five.So really what we're showing here is that even though your top one might not reflect the damage that has been done that's because what the model does it license performance on the unrepresented features to preserve performance on your well-worth extended features?
That's extremely well said. Thanks for that summary. We have two more questions for you. First one is, does the [00:08:00] world understand something better now? Does the subfield. And can you encapsulate what it is that we've learned?
The world understands that but there is no free lunch. A lot of what this paper points out is that if you optimize the compactness, you sacrifice another aspect, which is fairness, algorithmic bias, you amplify the algorithmic bias. That being said, it also paves the way for an exciting next chapter.We have a rigorous framework to establish a relationship, and now we can optimize for compression methods. That hopefully we distribute some of this disparate harm and don't disproportionately cannibalize. These little represented features. So paves away at science often comes in these very exciting iterations.Where do you identify something that is wrong with our approach, but a paves away for work, which is trying to do things differently. And that's the next chapter.
That's a great segue into our last question for you. What do you view as the important papers that this built upon? What were the parents of this paper that [00:09:00] influenced this work?
This is an interesting paper because in some ways it's it's trying to unify themes in a way that hasn't been done before. I would say built upon a work that I did last year, which was really looking at this. In a different lens, which is memorization. So my work last year was working on looking at the impact of compression, but it was trying to simply understand what is impacted and in particular the examples, which are most impacted by compression require memorization, because that says something interesting about capacity.So we found that yes, which suggests that, and this is an interesting takeaway. Since we can print so much and still preserve top line metrics. What I find is really sad is that most of the capacity in a model is being used to encode a useful representation of underrepresented features as a long tail.And that's a very interesting way of a problem because it is saying that. We have all these millions and millions of parameters, but most of them are being used as a blunt instrument to represent [00:10:00] very low frequency parts of our dataset. So maybe we should rethink how we treat our dataset. Maybe we can be more smart and not throw capacity at the problem which is a whole exciting area in itself.
Sarah. Thanks so much for your time.