Garbage in? How we can improve the quality of historical data

The spatial coverage of points representing all cities included in the final dataset of Reba et al. 2016. Source: Yale

The spatial coverage of points representing all cities included in the final dataset of Reba et al. 2016. Source: Yale

A week ago the urban archaeologist Mike Smith wrote a scathing post about a new article in’s journal Scientific Data. In the article, Meredith Reba and coworkers report on how they “spatialized” the dataset on urban settlements, based on previous publications by Tertius Chandler and George Modelski. As Smith writes in his blog, “The data in both Chandler and Modelski are a mess, routinely dismissed by urban demographic historians as worthless for serious scholarship.” The title of his blog post asks, “Why would a journal called ‘Scientific Data’ publish bad data?”

With all due respect (and it’s not an empty phrase, I know Mike and greatly respect his work and scholarship), his negative critique is unfair and counterproductive.

It’s unfair because Reba et al. have made an important addition to the Chandler and Modelski data by “spatializing” it. In other words, they added geographic coordinates to urban settlements in the Chandler/Modelski datasets. They also did it in a thoughtful and scholarly manner. Read their paper to see how much care they took with locating the cities on the map. As they write in the abstract, “The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point.”

Mike Smith doesn’t criticize them for doing a poor job of locating the urban settlements on the map. His problem is with the estimates that Chandler and Modelski make about the population sizes of these settlements. But developing better population estimates for cities is something that can be done independently of their geographic location. Reba et al have added to our knowledge of historical cities, by giving them spatial coordinates. They incremented our knowledge. It’s up to urban archaeologists and historians to improve the estimates of settlements’ population sizes.

Unfortunately, these specialists are not eager to increment our knowledge. Here’s what Mike Smith writes in his blog:

I have to admit that I really despair of this situation. I am very upset that such obviously poor data are being used by otherwise rigorous scholars, and I am upset that I don’t have better data. I have talked to quite a few colleagues—archaeologists and ancient historians—about this situation. I have asked if any of them were involved in assembling reliable and accurate data on ancient city sizes in their region of specialty, and the answer has been negative. I have asked if they knew of anyone doing systematic urban demographic history in their region, and again the answer is no. In my own region, Mesoamerica, there was a flurry of demographic work on city size in the 1980s, but then scholars lost interest. I have asked if anyone might be interested in mounting such a systematic comparative project, again with a negative answer.

The upshot for someone, who wants to do analyses, is: you can’t use the Chandler/Modelski data, and we have nothing better for you to use.

I don’t buy this council of despair. In any case, just how bad are the Chandler/Modelski data? Are their estimates off by 50%, by a factor of 2, or even 3? In many analyses, in which we consider settlements ranging in size from 100s to 1,000,000s – that’s four orders of magnitude – a mere factor of 2 is not that much of an error. So scholars argue about whether the population of Rome in the first century BCE was 0.5 or 1 million. After you have log-transformed these numbers, it’s not going to have that much difference on the global cross-cultural analysis that includes the whole spectrum of settlement sized across the last 10,000 years.

This is not to say that I endorse Chandler/Modelski conceptual approach. In the Seshat project we use a much more sophisticated one. First, we don’t simply provide a “point estimate”, e.g. 1,000,000. If there is a significant degree of uncertainty, our research assistants are instructed to code it with a range. It can be, for example, [500,000—2,000,000] and that’s fine—this is useful datum. Second, when experts disagree, we include both (or more) rival estimates. Finally, these estimates are just the proverbial tip of the iceberg. We also include explanations of where they come from. Eventually we are going to connect to more detailed archaeological databases that provide the solid scientific basis for these estimates. See my post on the Anatomy of a Seshat Fact.

So what the Seshat project offers is an evolutionary way forward that avoids the Scylla and Charybdis of either bad data or despair.

Between Scylla and Charybdis. Source: Flickr

Between Scylla and Charybdis. Source: Flickr

This is how science works. It’s cumulative. We start with naïve ideas, bad approximations, and wrong theories. Then, by applying the scientific method we get progressively better ideas, more accurate approximations, and logically sounds and well-tested theories. So let’s abandon negativism, roll up our sleeves, and get to work!

This post was originally published on Cliodynamica 


Name *
Email *