This past week in my Humanities Data Analysis class, we looked at mapping as data. We explored ggplot2’s map functions, as well as doing some work with ggmap’s geocoding and other things. One thing that we just barely explored was automatically extracting place names through named entity recognition. It is possible to do named entity recognition in R, though people say it’s probably not the best way. But in order to stay in R, I used a handy tutorial by the esteemed Lincoln Mullen, found here.
I was interested in extracting place names from the data I’ve been cleaning up for use in a Bookworm, the text of the 6-volume document collection, Naval Documents Related to the United States Wars with the Barbary Powers, published in the 1920s by the U.S. government. It’s a great primary source collection, and a good jumping-off point for any research into the Barbary Wars. The entire collection has been digitized by the American Naval Records Society, with OCR, but the OCRed text is not clean. The poor quality of the OCR has been problematic for almost all data analysis, and this extraction was no exception.
The tutorial on NER is quite easy to follow, so that wasn’t a problem at all. The problem I ran into very quickly was the memory limits on my machine–this process takes a TON of memory, apparently. I originally tried to use my semi-cleaned-up file that contained the text of all 6 volumes, but that was way too big. Even one volume proved much too big. I decided to break up the text into years, instead of just chunking the volumes by size, in order to facilitate a more useful comparison set. For the first 15 years (1785-1800), the file was small enough, and I even combined the earlier years into one file. But starting in 1802, the file was still too large even with only one year. So I chunked each year into 500kb files, and then ran the program exactly the way the tutorial suggested with multiple files. I then just pushed the results of each chunk back into one results file per year.
Once I got my results, I had to clean them up. I haven’t tested NER on any other type of document, but based on my results, I suspect that the particular genre of texts I am working with causes NER some significant problems. I started by just doing a bit of work with the list in OpenRefine in order to standardize the terrible spelling of 19th-century naval captains, plus OCR problems. That done, I took a hard look at what exactly was in my list.
Here’s what I found:
1. The navy didn’t do NER any favors by naming many of their ships after American places. It’s almost certain that Essex and Chesapeake, for instance, refer to the USS Essex and USS Chesapeake. Less certain are places like Philadelphia, Boston, United States, and even Tripoli, which are all places that definitely appear in the text, but are also ship names. There’s absolutely no way to disambiguate these terms.
2. The term “Cape” proved to be particular problems. The difficulty here is that the abbreviation for “Captain” is often “Cap” or “Capt,” and often the OCR renders it “Cape” or “Ca.” Thus, people like Capt. Daniel McNeill turn up in a place-name list. Naval terms like “Anchorage” also cause some problems. I guarantee: Alaska does not enter the story at all.
3. The format of many of these documents is “To” someone “from” someone. I can’t be certain, but it seems like the NER process sometimes (though not always) saw those to and from statements as being locational, instead of relational. I also think that journal or logbook entries, with their formulaic descriptions of weather and location, sometimes get the NER process confused about which is the weather and which is the location.
4. To be honest, there are a large number of false hits that I really can’t explain. It seems like lists are particularly prone to being selected from, so I get one member of a crew list, or words like “salt beef,” “cheese,” or “coffee,” from provision lists. But there are other results as well that I just can’t really make out why they were selected as locations.
Because of all these foibles, each list requires hand-curation to throw out the false hits. Once I did that, I ran it through R again to geocode the locations using ggmap. Here we also had some problems (which I admittedly should have anticipated based on previous work doing geolocation of these texts). Of course, many of the places had to be thrown out because they were just too vague to be of any use: “harbor,” “island,” and other such terms didn’t make the cut.
When I ran the geocoder for the first time, it threw a bunch of errors because of unrecognizable place names. Then I remembered: this is why I’ve used historical maps of the area in the past–to try to track down these place names that are not used today. Examples include “Cape Spartel,” “Cape DeGatt,” and “Cape Ferina.” (I’m not sure why they were all capes.) I discovered that if you run the “more” option on the geocode, the warnings don’t result in a failed geocode, plus all the information is useful to get a better sense of the granularity of the geocode, and what exact identifier the geocoder was using to determine the locations.
This extra information proved helpful when the geocoded map revealed oddities such as the Mediterranean Sea showing up in the Philippines, or Tunis Bay showing up in Canada. Turns out, the geocoder doesn’t necessarily pick the most logical choice for ambiguous terms: there is, in fact, an Australasian sea sometimes known as the Mediterranean Sea. These seemingly arbitrary choices by the geocoder mean that the map looks more than a little strange.
So what’s the result here? I can see the potential for named-entity extraction, but for my particular project, it just doesn’t seem logical or useful. There’s not really anything more I can do with this data, except try to clean up my original documents even more. But even so, it was a useful exercise, and it was good practice in working with maps and data in R.