Civil War Navies Bookworm

If you read my last post, you know that this semester I engaged in building a Bookworm using a government document collection. My professor challenged me to try my system for parsing the documents on a different, larger collection of government documents. The collection I chose to work with is the Official Records of the Union and Confederate Navies. My Barbary Bookworm took me all semester to build; this Civil War navies Bookworm took me less than a day. I learned things from making the first one!

This collection is significantly larger than the Barbary Wars collection—26 volumes, as opposed to 6. It encompasses roughly the same time span, but 13 times as many words. Though it is still technically feasible to read through all 26 volumes, this collection is perhaps a better candidate for distant reading than my first corpus.

The document collection is broken into geographical sections, the Atlantic Squadron, the West Gulf Blockading Squadron, and so on. Using the Bookworm allows us to look at the words in these documents sequentially by date instead of having to go back and forth between different volumes to get a sense of what was going on in the whole navy at any given time.

Looking at ship types over the course of the war, across all geographies.
Looking at ship types over the course of the war, across all geographies.

Process and Format

The format of this collection is mostly the same as the Barbary Wars collection. Each document starts with an explanatory header (“Letter to the secretary of the navy,” “Extract from a journal,” etc.). Unlike BW, there are no citations at the end of each document. So instead of using the closing citations as document breakers, I used the headers. Though there are many different kinds of documents, the headers are very formulaic, so the regular expressions to find them were not particularly difficult to write.[ref]Ben had suggested that I do the even larger Civil War Armies document collection; however, that collection does not even have headers for the documents, much less citations, so the document breaking process would be exponentially more difficult. It’s not impossible, but I may have to rework my system—and I don’t care about the Civil War that much. 🙂 However, other document collections, such as the U.S. Congressional Serial Set, have exactly the same format, so it may be worth figuring out.[/ref]

Further easing the pain of breaking the documents is the quality of the OCR. Where I fought the OCR every step of the way for Barbary Bookworm, the OCR is really quite good for this collection (a mercy, since spot-checking 26 volumes is no trivial task). Thus, I didn’t have to write multiple regular expressions to find each header; only a few small variants seemed to be sufficient.

New Features

The high quality OCR enabled me to write a date parser that I couldn’t make work in my Barbary Bookworm. The dates are written in a more consistent pattern, and the garbage around and in them is minimal, so it was easy enough to write a little function to pull out all parts. In the event that certain parts of the dates were illegible, or non-existent, I did make the function find each part of the date in turn and then compile them into one field, rather than trying to extract the dates wholesale. That way, if all I could extract was the year, the function would still return at least a partial date.

Another new feature of this Bookworm is that the full text of the document appears for each search term when you click on the line at a particular date. This function is slow, so if the interface seems to freeze or you don’t seem to be getting any results, give it a few minutes. It will come up. Most of the documents are short enough that it’s easy to scroll through them.

Testing the Bookworm

Some of the same reservations apply to this Bookworm as I detailed in my last post about Barbary Bookworm—they really apply to all text-analysis tools. Disambiguation of ship names and places continues to be a problem. But many of the other problems with Barbary Bookworm are solved with this Bookworm.

The next step that I need to work on is sectioning out the Confederate navy’s documents from the Union navy’s. Right now, you can get a sense of what was important to both navies, but not so easily get a sense of what was important to just one side or the other.

To be honest, I don’t really know enough about the navies of the Civil War to make any significant arguments based on my scrounging around with this tool. There are some very low-hanging fruit, of course.

Unsurprisingly, the terms "monitor" and "ironclad" become more prominent throughout the war.
Unsurprisingly, the terms “monitor” and “ironclad” become more prominent throughout the war.

The Bookworm is hosted online by Ben Schmidt (thanks, Ben!). The code for creating the files is up on GitHub. Please go play around with it!

Feedback

Particularly since I don’t do Civil War history, I’d welcome feedback on both the interface and the content here. What worked? What didn’t? What else would you like to see?

Feel free to send me questions/observations/interesting finds/results by commenting on this post (since there’s not a comment function on the Bookworm itself), by emailing me, or for small stuff, pinging me on Twitter (@abbymullen). I really am very interested in everyone’s feedback, so please scrub around and try to break it. I already know of a few things that are not quite working right, but I’m interested to see what you all come up with.

Text Analysis on the Documents of the Barbary Wars

This past semester, I took a graduate seminar in Humanities Data Analysis, taught by Professor Ben Schmidt. This post describes my final project. Stay tuned for more fun Bookworm stuff in the next few days (part 2 on Civil War Navies Bookworm is here).


 

In the 1920s, the United States government decided to create document collections for several of its early naval wars: the Quasi-War with France, the Barbary Wars, and the Civil War (the War of 1812 did not come until much later, for some reason). These document collections, particularly for the Quasi-War and the Barbary Wars, have become the standard resource for any scholar doing work on these wars. My work on the Barbary Wars relies heavily on this document collection. The Barbary Wars collection includes correspondence, journals, official documents such as treaties, crew manifests, other miscellaneous documents, and a few summary documents put together in the 1820s.[ref]U.S. Office of Naval Records and Library, Naval Documents Related to the United States Wars with the Barbary Powers (Washington: U.S. Govt. Print. Off., 1939); digitized at http://www.ibiblio.org/anrs/barbary.html.[/ref]

It’s quite easy to get bogged down in the multiplicity of mundaneness in these documents—every single day’s record of where a ship is and what the weather is like, for instance. It’s also easy to lose sight of the true trajectory of the conflict in the midst of all this seeming banality. Because the documents in the collection are from many authors in conversation with each other, we can sometimes follow the path of these conversations. But there are many concurrent conversations, and often we do not have the full correspondence. How can we make sense of this jumble?

Continue reading Text Analysis on the Documents of the Barbary Wars