Emily Thanks for joining us today Can you give us the title of the paper and the nickname that you use
The paper title is bringing the people back in contesting benchmark machine learning and this paper I guess we don't really have a nickname but it does fit under a broader research agenda that myself and my collaborators are working on which we've termed genealogies of data so that's kind of the way in which we're referring to this project
What do you consider the research area for geneologies of data
It's a highly interdisciplinary research agenda right now we are ourselves the researchers that kind of are comprising you know this research agenda and the authors of this paper we ourselves come from a variety of different disciplinary backgrounds I myself I'm trained as a machine learning practitioner you know social scientists philosophers a variety of different folks coming to the table here and I'd say that parts of this work can be characterized [00:01:00] as on sort of sociology of science type work There's a little bit of philosophy of science here there's some critical theories and philosophy we're really kind of bringing a lot of different disciplines to bear in this work as well as different methodologies
If this was in a conference where would it belong there
I would say like that science and technology studies we presented the work at an STS conference and I think it fits pretty squarely there and then if this were at a machine learning conference I would stick it maybe under an ML fairness heading or possibly and datasets track
Before we dive into the paper on details it'd be helpful to have a motivating example for the research to help guide the conversation This could be from the paper or from the subfield
There's one key kind of motivating sort of aspect underlying this so we started from the observation that data really fundamentally underpins [00:02:00] everything that we do in machine learning you know data sets are what we're training our models with but we also use data to evaluate our models evaluate broader progress within the field oriented research agendas It's really really really fundamental and image net is a really classic example of one of these data sets You know folks use every day I guess I'll say for folks who are not super familiar with machine learning datasets image net is a massive data set millions of images that are labeled with thousands of categories and this is a data set that people use to train what's called an object recognition models So these models would take in an image and identify the different objects that are present and this data set is really foundational to machine learning it's used sort of pretty widely across the field as like a benchmark for a variety of different methods but there's been a lot of recent kind of critiques coming out looking at the data set looking at the different categories that are [00:03:00] structuring the data set looking at the ways people are represented in the data set and looking at the kind of highly questionable practices of collection you know that really relies pretty heavily on just scraping personal photographs off of flicker and things like that So we really were looking at the image in that dataset and what kind of questioning you know what is the history of this dataset How did it come to be What are the different kinds of values and norms and politics that shaped it and we looked at images but we were taking it as kind of a an exemplar of broader practices within the field So we aren't solely interested in image We're interested in the norms of data set development the norms of dataset use and kind of use that data set to structure that inquiry So that that's one of the things that kicked off a lot of the questions that we sort of pose and dig into in the paper
If I was going to understand the example here it's something around how imagenet itself is a data set That's used ubiquitously across the field This is something for rich people have noticed [00:04:00] There were different problems associated with image net What was an example Problem?
There's a couple of different concerns with the image net dataset There is a great blog post that came out last year by researcher Kate Crawford and artists Trevor Paglen it's called excavating AI and they examined the kind of categorical schema that is underlying image net and this is just kind of one of the concerns with the categorical schema contained A range of pretty offensive categories including you know derogatory references for people slurs things like this and they really kind of dove into the kind of politics of that data set So that was one of the jumping off points for us But then another thing that we noticed is that there has been some recent work to kind of rectify some of the concerns with the image net dataset by removing some of the categories and also by balancing the demographic distributions within each category [00:05:00] within images and in particular within the categories that are describing people So they're not trying to balance the demographic distribution of you know images of chairs but you know images of nurses for example would be trying to balance across gender and things like that
In that example what you're saying is that there's problems here across all the categories but things that people are mostly cared about is stuff like how nurses are heavily biased towards women And so when you train on this dataset you end up with problems related to the gender misbalance in certain categories.
Exactly exactly And so that type of solution the solution Oh let's try And ensure that our data sets are representing a diversity of people are including many different you know kind of representations of people That's actually a really really common solution that comes out in the broader ML fairness world so in the past couple of years data has become a really serious focus of ML fairness researchers And a lot of these conversations orient around [00:06:00] these kind of representational concerns within the data set So there's been work identifying you know kind of heavily skewed distribution of skin tones within facial analysis data sets where lighter skin faces are heavily overrepresented other folks have looked at geographical regions where images are coming from an image classification data Finding a pretty serious view towards Western countries so there's been a lot of examination of those kinds of properties of datasets and that often orients the solutions around collecting more data increasing diversity increasing representation but what we want to do with this work is sort of not not to say that that's not an important contribution or an important way forward in some cases But that's not the end of the story you know it's it's the concerns with datasets go go much far beyond these kinds of statistical properties of who is represented and so that's what we're really trying to do with it and and the kind of examination of image net [00:07:00] both from the categorical side and from the distributional side is what really kind of sparked a lot of our research
That's a great segue to our next question which is what are the actual top contributions This paper makes to the field
What we do is actually outline our research agenda with four main research questions We plan to pursue over the next couple of years so I would say actually the two main contributions of this paper one of them is these research questions I'll I'll kind of dig into in a second And then the other I would say is a little bit more kind of conceptual in nature Just kind of laying out the concerns that we see with data sets and laying out the ways in which data sets kind of fundamentally underpin everything that we're doing in machine learning Often in somewhat invisible or under appreciated ways so that's one of the one of the key contributions There is just kind of working with this lens of infrastructure so we we rely quite heavily on critical infrastructure [00:08:00] studies as a as a discipline and kind of draw from that draw group perspectives from that to examine data sets the ways in which we use datasets the ways in which we rely on datasets daily machine learning practice and I can I can speak a little bit more to kind of what what we say there if you want it a little bit but that's sort of one key contribution Then the other one is outlining this research agenda So we put forward kind of four fold you know research questions some of which we are kind of currently in the process of working on other ones You know in a in a couple of years we hope to get to them This is a pretty ambitious and long-term research agenda but that's the other kind of key contribution
What was one of the questions
The first question is really trying to understand how data set developers describe and motivate the decisions that go into the creation of data sets And the idea here was to actually take data sets and take the kind of publications associated with the data sets [00:09:00] the you know websites and all the sort of artifacts associated with them and read them as texts and really try and understand based on what is said and what is unsaid within those within those texts trying to understand the kind of values motivations assumptions and norms of data set development And this is something that one of the authors on this paper who was an intern at Google this summer and Morgan close fireman he's been leading this empirical work and he currently has created a massive Corpus of nearly I think nearly 200 computer vision data sets And we've been going through and doing a content analysis of all of these publications associated with its assets to really kind of Tease out you know what are the kinds of motivations spoken and unspoken conventions and dataset construction curation and annotation
Is there anything that you found from that that you really feel like this should be shared and more people should know about this?
We're in the process of writing this up so stay tuned for like the full paper but I can share a little bit about what we're [00:10:00] finding so far so Some interesting patterns which are not not too surprising but are a little disheartening is you know very basically zero papers talking about IRB approval the only papers that discuss IRB approval processes are kind of review papers
IRB is institutional review board?
Exactly very few I think possibly only one paper discussing kind of ethical considerations of the of the work another really interesting thing is that the vast majority of datasets or sorry the vast majority of publications associated with data Don't really foreground the dataset as a core contribution so this is something that's really interesting Can we talk about in the paper is this you know even though data sets are really fundamental to machine learning we don't really value the construction of datasets but we in which we are kind of value like algorithmic and modeling contributions And so this is really reflected in our analysis [00:11:00] of these computer vision data set publications We found only half of the papers actually described the data set as the primary contribution I think it was a bit less than half Describe the data set as the primary contribution And I think it was about 15% of them or a bit less only kind of talk about the data set like in the paper So most of them even if they described the data set as a contribution The details of the development of the data set the details of kind of motivation how it came together That's really compressed into like a paragraph or two in the experimental section so that's that's really telling us a lot about how we value this work and how we how much we value kind of really careful thoughtful intentional dataset creation And then just one other point that kind of relates to that Is we're noticing a pretty widespread under specification of the key kind of considerations that go into dataset development possibly because people are being incentivized to squish it into a [00:12:00] tiny experimental section but we're seeing a lot of under specification which is reminiscent of other work That's come out in the past couple of years examining different types of data set publications
Would be really great if you could touch on how those investigation intersects with image net because it is such an important one that you also highlighted earlier. What did you find?
We've actually been doing a different type of analysis with image net So with image net that really gets at our research question number two of this paper which is what are the histories and contingent conditions of creation of these key benchmark data sets And so image net is an example of one that we've been kind of going back and really studying that kind of origins of but to get at your question of how Our examinations of the kind of dataset publications relate to the image network one thing that we have found with image net is they spend a lot more kind of of the publication describing the dataset So [00:13:00] that's actually an example of a data set that I think that's a fairly you know fairly commendable job of detailing how the data set was constructed how the images were collected and things like that but we also noticed some similar patterns across image net and the other data set that we're looking at such as you know kind of very little consideration given to who is annotating the images right This is this is kind of very prevalent across the field Folks treat annotation tasks by enlarge Sort of non interpretive you know kind of objective tasks And so the question of who is doing the annotation who's you know kind of perspectives are being brought to bear or very frequently kind of not acknowledged So that's that's one thing that we're seeing with image net data set that we saw with a wide wide wide range of computer vision data sets there's just very little kind of consideration given to the kind of interpretive nature of the task And then I something else that we've seen with image net which again we're seeing [00:14:00] with a lot of other data sets is when you look at the types of things that are you know described as the kind of valuable properties on the data set with image net we see things like the scale of the data sector massive massive data set interview hugely comprehensive diversity is talked about a lot Not sort of cultural diversity or demographic diversity more diversity of you know orientations of objects diversity of lighting conditions and things like that and and those are similar sorts of properties that we're seeing across a wide range of computer vision Datasets
It's more like what's happening in this image rather than in this image in the context of the greater world
Exactly and so image net in that sense is really kind of reflecting these broader patterns that we're seeing So I'm just looking right now actually at my notes for our kind of broader content analysis of datasets and the things the top things that people describe as really really valuing and datasets are things [00:15:00] like large scale data big data realistic data comprehensive or complete data kind of covering you know this this attempt to cover everything quality and accuracy of the data that's another thing that folks talk about a little And it's sort of interesting This intersects with image that'll say in two ways the first is that a lot of these descriptors up here in the image net publication as well and so we're seeing in this way kind of image net being a nice exemplar for these broader kind of publications But another thing is that image net actually I think it's pretty responsible for solidifying some of these desirable properties as desirable So large scale data big data sets really comprehensive benchmark data sets That's something that has kind of grown since image net was established as a key fundamental valuable property of data science and and this again kind of points to the really influential role that image met played within the field It was one [00:16:00] of the the key datasets to not only solidify kind of deep learning neural networks as a modeling paradigm but really solidify large scale benchmark data sets as a valuable way of driving research of measuring progress within the field so there there is that kind of interesting potentially causal element there
That's a lot of great color. We do have a couple other questions so I just want to move on from that. Why did you guys work on this. What is your motivation in a broader sense?
I already touched a little bit on the motivation earlier in the sense that there's a ton of ML fairness work that is examining So I think this is really important People are realizing that data you know is is pretty fundamental to understanding the kind of harms that might be perpetuated by machine learning systems potential kind of discriminatory outputs but you know as a field I think we haven't quite gone foreign Now there's a lot of examination as I mentioned Looking at you know who is [00:17:00] represented in a data set looking at kind of different sorts of biases within data sets But that is is is essentially a limited perspective on the ways in which data is actually responsible for a lot of the failures of machine money systems So just to get a couple of concrete examples There's been a lot less examination into the categories for example that are structuring these data sets and the the work of excavating AI that I mentioned earlier was one of the the first kind of you know fairly widely read examples of somebody kind of digging into the categorical schema of one of these data sets Right there's also been some recent work actually by another coauthor with this paper more than by Australia looking at the kind of operationalization of gender and racial categories within machine learning data sets within computer vision data sets in particular and really kind of you know emphasizing The concerns are not only isolated to you know who is represented but these kinds of categorical [00:18:00] questions are really key as well and then so that's kind of that's one sort of main motivating thing that you know it was kind of driving this work with like okay data is critical but like we're not digging deep enough You know we really really need to be looking at data through a different sort of lens and then another thing that really motivated this was Really trying to look at not only the data sets themselves as kind of static artifacts and not just examining you know what is kind of represented there but really looking at the work Correct So again we're really interested in how the field as a whole is thinking about data is working with data is developing data And so in this respect we kind of choose this this sort of methodology let's say a studying up you know kind of refers to like pointing our our research tools upwards towards the individuals who are constructing the data sets towards machine learning practitioners that are using the data sets to work [00:19:00] They're kind of the the folks with the with the economic and social sort of power to you know create these tools and ultimately shape the kinds of machine learning systems that are built from them so that was another key motivator here and that's why you're really interested in digging into the histories of these data sets and understanding the sort of unsaid norms of development and use.
That's a good way to capstone it to describe it as Hey there's a history of making these data sets and what are the things that people bring to the table when they do that And that'll be able to understand that and see where the deficiencies are could lead to things going forward that are just a better approaches
Exactly this really draws upon some of the kind of conceptual setup that we do in the paper where you know we really kind of Argue that data sets right now tend to be viewed as these very like value neutral kind of static scientific artifacts [00:20:00] and this is partly related to you know how machine learning practitioners are taught what the incentive structures are in the field right We're we're we're by and large you know kind of incentivized to just like important data set and do something with it There's not a lot of critical thinking that you know what was the history of this data set What are the values embedded in it That's just that's not really part of of of machine learning practice And so but but of course right these data sets were built by people situated within various specific you know social and historical claims and they are you know kind of situated artifacts They come out of those particular histories and and when we treat them as sort of purely value neutral artifacts then we kind of you know we misrepresent what they're doing We kind of misunderstand potential harms that might be embedded in it and we really missed this opportunity to kind of critically examine you know how this really foundational and fundamental part of machine learning practice is shaping our research and is shaping the field.
Do you have a [00:21:00] tagline for what the world understands now?
What I would love somebody to take away from this paper is data sets are situated so when I say that I mean you know data sets are constructed by people in particular situations They have a variety of different kinds of perspectives embedded in them and not just necessarily the perspectives of you know the kind of progenitors and creators of the data set you know perspectives that might filter in through the socio-technical processes of you know dataset collection you know scraping you know image search engines and a particular time and place things like that but ultimately data sets in bed a very particular perspective of the world and by treating them as these kinds of value neutral artifacts that's awesome So I'd say that's kind of key point number one is you know as a field we really really need to recognize that you know data set there's there's no way to kind of remove their situated and so while it may make sense to think about specific biases you know when that language [00:22:00] is qualified and very precise You know we're there's no such thing as like an unbiased data set So we need to kind of get away from the idea of like oh we can just build these unbiased data sets and then machine learning is going to be fair you know instead we need to really recognize data sets as situated And part of that is really understanding their histories as a way to kind of you know look forward and shift the ways in which we treat and interact with data
Emily Thanks so much for your time