Cleaning History Data: The Conflict Catalogue

Over the past few years at Seshat, we’ve been collecting historical data on conflict and war. We are interested in how different societies from prehistory to the more recent times respond to periods of pressure and crisis—namely whether crisis periods result in major war, minor turbulence, or something in between. Early exploration reveals that most historical crises result in fairly severe outcomes, including massive loss of life, famine, and even complete state collapse (Figure 1). But there are some less devastating outcomes as well, raising important questions for our own tumultuous times. Namely, what determines the severity of a crisis? How do the foundational structures of a society determine its ability to remain stable in crisis periods? For a more detailed breakdown, Seshat director Peter Turchin has written a related blog on modeling crisis periods[1].

While we have mainly built our early historical datasets from secondary source research, but there are also several well-known open-source datasets for early modern and modern warfare that we hope to use for our analysis. One of these is Peter Brecke’s Conflict Catalogue. This data has been cleaned for Clio Infra’s datasets on internal and international armed conflicts, but these datasets only indicate the number of wars present in a certain year. The original Conflict Catalogue also contains rich data on fatalities, location, types of conflicts, and entities, but it is all currently unusable as it is hidden within the Conflict names.

While a good amount of hand-cleaning and fact-checking is necessary here, Brecke’s original scheme has allowed us to begin the process of breaking down the Conflict Catalogue into analyzable pieces in R. A simple script using the tidyverse like the one below can save hundreds of research assistant hours of work!

The cleaning script uses the following libraries:


Importing the XLSX from Brecke’s website:

url <- ""

tf <- tempfile(fileext = ".xlsx")
curl_download(url, tf)

Brecke’s documentation notes that Entities in the Conflict Name column separated by an “-” indicate an Interstate Conflict, while others indicate an Internal Conflict. The following code removes a few extraneous uses of “-” within a name and then sorts the conflicts into Internal and Interstate based on the presence of “-”.

conflicts <-
read_excel(tf) %>%
  Conflict = str_replace(Conflict, "Ibn-Hafsun", "Ibn Hafsun"),
  Conflict = str_replace(Conflict, "Anglo-Saxons", "Anglo Saxons"),
  Conflict = str_replace(Conflict, "Garde-Freinet", "Garde Freinet"),
  Conflict = str_replace(Conflict, "Sancho-Alphonso", "Sancho Alphonso"),
  ConflictType = ifelse(grepl("-", Conflict),'Interstate Conflict', 'Internal Conflict'))

Then we split the data up into Interstate and Internal Conflicts because they are written in different ways.

An Interstate Conflict is written by Brecke as:

  • Russia-Byzantium
  • Germany-Lombards

And so on. We can easily separate Side A and Side B. In some cases, there are more than one Entity per side. This is tricky in his early dataset because they are marked by brackets with no delimiters. Separating them out by script would create some erroneous lines:

  • Magyars-[Burgundy France]
  • Wessex-[East Anglia Mercia]
interstate <-
conflicts %>%
filter(ConflictType == "Interstate Conflict") %>%
mutate(Conflictsep = Conflict) %>%
separate(Conflictsep, into=c("SideA","SideB"), sep="-")

Internal conflicts are a bit messier. They are written as follows:

  • Norway (nobles revolt)
  • Byzantium (Rebellion of Maniaces)
  • Scotland (Dunsinane)
  • Spain (Insurrection of Arabs in Malaga)

For this, Side A becomes the main state and then Side B (in brackets) becomes the internal combatant. Side B can be a region, a group, or an action like “rebellion” or “insurrection.” This data can be useful for looking at the number of internal conflicts in a country or region in a specific period but needs further cleaning to be useful for more complex analyses.

internal <-
conflicts %>%
filter(ConflictType == "Internal Conflict") %>%
separate(Conflict, into = 'SideA', extra = 'drop', remove = FALSE) %>%
mutate(SideB = gsub("[\\(\\)]", "", regmatches(Conflict, gregexpr("\\(.*?\\)", Conflict))))

Then we join the two datasets and we have a cleaner version of the Conflict Catalogue. Further data like locations, specific type of conflicts, and more can be parsed and scraped from Conflict Names. There are also specific instances that need hand cleaning like a few cases of two non-state actors a conflict that our script doesn’t register. <-
internal %>%
mutate(SideB = gsub("*\\(.*?\\) *", "", SideB)) %>%
rbind(interstate) %>%
pivot_longer(c("SideA", "SideB"), names_to = "Side", values_to = "Entity") 

One interesting exercise we can do with this data is look at the breakdown of Internal vs. Interstate Conflicts within some of the major states of this period. In the following visualization, I selected the nations with the highest number of conflicts. It is too far of a jump, however, to claim that these were the most turbulent regions of medieval and early modern Europe: they were merely some of the longest-standing and largest states. We can look further into important historical questions the visualization brings up like: was being a civilian in Italy more dangerous than in England, or does this change when we account for early Italian peninsula kingdoms and pre-England kingdom groups like the the Normans?


#--Cleaned Conflict Catalogue Data 
conflicts <- read.csv("Downloads/earlyconflicts.csv")

`%nin%` = Negate(`%in%`)

conflicts.count <-
conflicts %>%
group_by(Entity) %>%
mutate(Total_Conflicts = n()) %>%
ungroup() %>%
group_by(Entity, ConflictType) %>%
mutate(Total_ConflictType = n()) %>%
conflicts.count.plot <-
conflicts.count %>%
summarize(Entity, ConflictType, Total_Conflicts, Total_ConflictType) %>%
distinct() %>%
arrange(desc(Total_Conflicts)) %>%
slice(1:14) %>%
filter(Entity %nin% c("Civil War", "Insurrection")) 

ggplot(conflicts.count.plot, aes(x = Entity, y = Total_ConflictType, group = Entity, 
  fill = ConflictType)) + 
  geom_col() + 
  theme_minimal() + 
  scale_fill_brewer(palette = "Greens") + 
  labs(x = "Country", y = "Number of Conflicts", fill = 'Conflict Type', 
  title = "Conflict Breakdowns (900-1400 CE - Europe)") + 

Especially interesting is visualizing the total number of conflicts over time. This following visualization is based on the year the conflict was sparked, rounded to the nearest decade. In this preliminary visualization some interesting patterns appear (however, a t-test showed no strong correlation between the number of Internal and Interstate years in a given decade).

conflicts.count2 <-
conflicts %>%
filter(Side == "SideA") %>%
summarize(Conflict, ConflictType, StartYear) %>%
distinct() %>%
mutate(StartYear = round(StartYear, -1)) %>%
group_by(StartYear, ConflictType) %>%
mutate(Total_ConflictType = n()) %>%
ungroup() %>%
summarize(ConflictType, Total_ConflictType, StartYear) %>%
filter(ConflictType != "Intervention")

ggplot(conflicts.count2, aes(x = StartYear, y = Total_ConflictType, 
  color = ConflictType)) + 
  geom_line() + 
  theme_minimal() + 
  scale_fill_brewer(palette = "Set2") + 
  labs(x = "Conflict Year (Starting Year, Rounded to Decades)", y = "Number of Conflicts", 
  color = "Conflict Type", title = "Breakdown of Conflict Types (900-1400 CE - Europe)")

We can also look at the data geo-spatially over time using our library of historical GIS shapefiles to hypothesize which parts of Europe were more conflict-prone during some of the periods covered by the dataset.


world <- st_as_sf(rnaturalearth::countries110)

#-- Dataset that matches the Conflict Catalogue Entities to our historical shapefile library 
key <- read.csv("Desktop/geokey.csv")
layers <- key$layer

#-- Seshat Historical Shapefiles matched with Conflict Catalogue data by location
shapefile <-
st_read("Desktop/Macrostates 3.3.2021/Macrostates 3.3.2021.shp") %>%
filter(layer %in% layers) %>%
merge(key, "layer") %>%
left_join(conflicts.count, by = c("" = "Entity")) %>%
mutate(StartYear = round(StartYear, -2)) %>%
filter(StartYear == Macrostate.year) %>%
filter(StartYear == c(900, 1000, 1100, 1200))

ggplot() + 
geom_sf(data = world, fill = "white", color = "grey") +
geom_sf(data = shapefile, aes(fill = Total_Conflicts)) +
coord_sf(xlim = c(-20, 45), ylim = c(30, 73), expand = FALSE) + 
labs(title = "Macrostate Conflicts (Rounded to Century)", fill = "Total Conflicts") +
facet_wrap(~ StartYear) + 
theme(axis.ticks = element_blank(),
      axis.text.x = element_blank(),
      axis.text.y = element_blank())

In these maps, the zone of high conflict emerges in modern-day United Kingdom and France, but these results could likely be due to the strength of historical records of violence in these regions. We plan to combine these preliminary results with our own findings in order to balance the record as well as we can. The Conflict Catalogue is just one of hundreds of rich humanities data resources that a little bit of R knowledge can open up tons of possibilities for in terms of cleaning and visualization.

[1] I am also part of a large research effort, the Consequences of Crisis project, on historical crises and consequences being run through the Seshat Databank, funded by a grant from the V. Kann Rasmussen Foundation.

Leave a comment

Your email address will not be published.