Avoiding Bias When Inferring Race Using Name-based Approaches

Manual validation

The data is presented as a series of distributions of names across race (Table 1). In name-based inference methods, it is not uncommon to use a threshold to create a categorical distinction: e.g., using a 90% threshold, one would assume that all instances of Juan as first name should be categorized as Hispanic and all instances of Washington as a given name should be categorized as Black. In such a situation, any name not reaching this threshold would be excluded (e.g., those with the last name of “Lee” would be removed from the analysis). This approach, however, assumes that the distinctiveness of names across races does not significantly differ.

thumbnail Download:
  • PPTPowerPoint slide
  • PNGlarger image
  • TIFForiginal image
Table 1. Sample of family names (U.S. Census) and given names (mortgage data).

https://doi.org/10.1371/journal.pone.0264270.t001

To test this, we began our analysis by manually validating name-based inference at three threshold ranges: 70–79%, 80–89%, and 90–100%. We sampled 300 authors from the WoS database, 25 randomly sampled for every combination of racial category and inference threshold. Two coders manually queried a search engine for the name and affiliation of each author and attempted to infer a perceived racial category through visual inspection of their professional photos and information listed on their websites and CVs (e.g., affiliation with racialized organizations such as Omega Psi Phi Fraternity, Inc., SACNAS, etc.).

Fig 1 shows the number of valid and invalid inferences, as well as those for whom a category could not be manually identified, and those for whom no information was found. Name-based inference of Asian authors was found to be highly valid at every considered threshold. The inference of Black authors, in contrast, produced many invalid or uncertain classifications at the 70–80% threshold, but had higher validity at the 90% threshold. Similarly, inferring Hispanic authors was only accurate after the 80% threshold. Inference of White authors was highly valid at all thresholds but improved above 90%. This suggests that a simple threshold-based approach does not perform equally well across all racial categories. We thereby consider an alternative weighting-based scheme that does not provide an exclusive categorization but uses the full information of the distribution.

thumbnail Download:
  • PPTPowerPoint slide
  • PNGlarger image
  • TIFForiginal image
Fig 1. Manual validation of racial categories.

https://doi.org/10.1371/journal.pone.0264270.g001

Tag » Where Did The Last Name Bias Come From