How We Work

Generating a Global Historical Sample

During the early stages of the project, we created an initial stratified sample of past societies by identifying 10 world regions distributed as widely as possible across the Earth’s surface and within each of those regions designated three ‘natural geographical areas’ (NGAs) with discrete ecological boundaries, on average about 10,000 km² in size, thus creating an initial sampling scheme of 30 such areas around the world, later extended to 35 areas. To maximize diversity within this initial sample, for each world region we chose one NGA in which social complexity was relatively early emerging (for example, Egypt), one in which it arose relatively recently (for example, Ghanaian Coast) and one in which it emerged somewhere in the middle of the range (for example, Inland Niger Delta). Our aim was to maximize variability in our global sample while minimizing historical relationships between cultures. We are continually adding new information and expanding the scope of our databank, 'snowballing' from our initial sample. We have thus moved past the NGA sampling scheme for a more comprehensive collection of past societies, covering additional areas including more adjoining regions.

Data on political systems (polities) that emerged and persisted in each of the NGAs are organized into a continuous time series. For the purposes of the present study, these are queried at 100-year intervals, going back as far into the history of that area as scholarly literature would allow (up to a maximum of roughly 10,000 years before present). In the case of NGAs containing clusters of very small-scale polities that share a similar culture but are not under a single system of jurisdictional control, we refer to these as ‘quasi-polities’ and code information on all of them generically, unless information is available that would allow us to differentiate between them.

Data Coding

Our data gathering strategy follows a rigorous and closely documented process taking place over several years and involving project experts, research assistants, and a Data Review Board. All variables for which data have been gathered and entered into Seshat are derived from a Seshat Codebook. The Codebook was designed by, and is continually updated and extended in consultation with, a large network of over 100 professional historians, archaeologists, anthropologists, and other specialists whom we refer to as ‘Seshat experts’. Especially during the early phases of data entry, variables in the codebook were revised and improved through continual discussions among Seshat research assistants (RAs), Seshat experts, and the Data Review Board (see below). Most variables in Seshat require the data to take the form of a number or numerical range or they specify a feature that can be coded as absent, present or unknown (additionally coding items as ‘inferred present’ or ‘inferred absent’, where the evidence is indirect). All data are linked to scholarly sources, including peer-reviewed publications and personal communications from established authorities.

Once the variables have stabilized for a particular project, full-scale data collection occurs. Coding begins with fully trained RAs populating the database for all polities with relatively easy-to-acquire information extracted from scholarly publications. First, RAs code machine-readable values. Rather than using an arbitrary scale to code features that vary in magnitude, we prefer to quantify variables (e.g. estimated population size) or fractionate them into multiple features that can be coded as either absent or present (allowing also for ‘inferred’ codes). These machine-readable values are supplemented with narrative paragraphs explaining the rationale for coding each variable a certain way along with citations to relevant scholarship. While coding, RAs list variables where information is lacking or ambiguous. The next step is to consult with Seshat experts on these matters, whose feedback is incorporated back into the codes (often this is an iterative process).

Seshat experts are thus involved in reviewing the data, addressing questions of interpretation, filling gaps or confirming that data are unavailable. The names of both Seshat experts and RAs are linked to the data. This information on expert provenance, logged according to the dates of interventions, is used to monitor the state of maturity of the data curation process for any given variable. Disagreements in the literature or among Seshat experts, as well as uncertainty, are recorded as far as possible so that data analysis can take into account alternative interpretations. Where magnitudes are estimated, coders specify the likely range of variation.

Data Review Board

All Seshat codes go through an iterative process of editing and review before being used in analyses submitted for publication. For each variable, the initial value coded by RAs is later checked, edited, and augmented by Seshat experts, then reviewed by a Data Review Board (DRB). The latter comprises the senior team responsible for data management on a given paper, generally consisting of a combination of humanities scholars (e.g. historians, archaeologists, classicists, and anthropologists) and scientists (e.g. data analysts and complexity scientists). Typically, the DRB will include all members of the Seshat Board of Directors.

The DRB engages Seshat experts to review coding decisions, provide literature recommendations, and assist with the interpretation of complicated or conflicting evidence. When Seshat experts point out disagreements in the literature or disagree among themselves on a particular code, this is recorded so that multiple analyses can be run taking into account contrasting interpretations. The DRB scrutinizes all initial coding decisions and may request further expert review, where appropriate, to address remaining points of uncertainty. The DRB is also responsible for ensuring that coding conventions are consistently applied across NGAs and by all Seshat RAs. The DRB is ultimately responsible for the data presented in published work.

Research Ethics

The Seshat project involves the study of myriad different communities and populations from the past. Some peoples living today trace their ancestry to one or more of these past groups. As researchers, we have an obligation to present fair-minded, responsible, and respectful information concerning the past. While maintaining a commitment to scientific enquiry, we are committed to avoiding biased interpretation or representation of past or contemporary cultures, to refraining from using harmful or disrespectful terminology, and to treating sensitive information or topics with appropriate nuance and respect for the dignity and lived experiences of descendant communities.

How We Work

Generating a Global Historical Sample

Data Coding

Data Review Board

Research Ethics

Further Reading

Information about

Information for

Quick links