Last week, Ben Schmidt gave a talk at Northeastern, part of which was about developing technical competency in digital methods. This semester, I’ve had the chance to develop my technical competency in working with data, mostly by jumping in with both feet and flailing around in all directions.
The task I was given in the NULab has allowed me to play with several different digital methods. The base project was this: turn strings such as these
10138 sn86071378/1854-12-14/ed-1 sn85038518/1854-12-07/ed-1
8744 sn83030213/1842-12-08/ed-1 sn86053954/1842-12-14/ed-1
8099 sn84028820/1860-01-05/ed-1 sn88061076/1859-12-23/ed-2
7819 sn85026050/1860-12-06/ed-1 sn83035143/1860-12-06/ed-1
7792 sn86063325/1850-01-03/ed-1 sn89066057/1849-12-31/ed-1
into a usable representation of a pair of newspapers who share a printed text. This snippet is 5 lines of a document of over 2 million lines, so obviously doing the substitutions by hand was not really an option.
David Smith, the computer science professor who wrote the algorithm that generated these pairs, suggested a Python program, using the dictionary data structure, for creating the usable list. That dictionary would draw its key from the text file provided by the Library of Congress for the Chronicling America newspapers. That was all fine, except that I had never even seen a Python script before.
I started very basic: The Programming Historian! Though that program was very helpful in learning the syntax and vocabulary, the brief discussion of dictionaries in The Programming Historian wasn’t sufficient for what I needed. So I turned to other sources of information: Python documentation (not that helpful) and my husband Lincoln (very helpful).
Through a lot of frustration, bother, and translating Ruby scripts into Python, Lincoln and I (95% Lincoln) were able to come up with a working program that generated a .csv file with lines of text that looked like this:
Democrat and sentinel. (Ebensburg, Pa.) 1853-1866 Nashville union and American. (Nashville, Tenn.) 1853-1862
New-York daily tribune. (New-York [N.Y.]) 1842-1866 Jeffersonian Republican. (Stroudsburg, Pa.) 1840-1853
Holmes County Republican. (Millersburg, Holmes County, Ohio) 1856-1865 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865
Fremont journal. (Fremont, Sandusky County, [Ohio]) 1853-1866 Cleveland morning leader. (Cleveland [Ohio]) 1854-1865
Glasgow weekly times. (Glasgow, Mo.) 1848-1861 Democratic banner. (Bowling Green, Pike County, Mo.) 1845-1852
Belmont chronicle. (St. Clairsville, Ohio) 1855-1973 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865
The next step was pulling out the dates of publication (for the shared texts) and adding them to the .csv file. To do so, I had to update my Python program. I wrote a regular expression that detected the dates by searching for fields that looked like ####/##/##. In order to accommodate the Atlantic Monthly, which didn’t do its dates the same way, I added a variation that found the string beginning with 18 and recorded the 18 plus the next 6 digits. (At some point, I’ll write a separate thing that will add in the hyphens, perhaps?)
Third, I used the command line to remove the parentheses and brackets in the master newspapers file, and tab delimit the fields so that the location was its own column. This command looks like this:
tr '()' '\t' < newspapers-edit.txt | tr ',' '\t' | tr '' '\t' > newspapers-edit-expanded.txt
However, I realized when I did this command that it messes up my newspaper dictionary (from step 1) because the LCCN number, which was the last field, is now in a non-fixed location depending on how many fields were created by moving the comma-separated information into new tab-separated fields. So I did the highest-tech thing I know: I opened the .txt file in LibreOffice Calc (the poor man’s MS Excel) and simply moved the LCCN column in the original newspapers-edit.txt file over so that it wouldn’t be affected when I ran the tab-separating command. Then I ran the command again.
The data set now looks like this:
Democrat and sentinel. (Ebensburg, Pa.) 1853-1866 1854-12-14 Nashville union and American. (Nashville, Tenn.) 1853-1862 1854-12-07
New-York daily tribune. (New-York [N.Y.]) 1842-1866 1842-12-08 Jeffersonian Republican. (Stroudsburg, Pa.) 1840-1853 1842-12-14
Holmes County Republican. (Millersburg, Holmes County, Ohio) 1856-1865 1860-01-05 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865 1859-12-23
Fremont journal. (Fremont, Sandusky County, [Ohio]) 1853-1866 1860-12-06 Cleveland morning leader. (Cleveland [Ohio]) 1854-1865 1860-12-06
Glasgow weekly times. (Glasgow, Mo.) 1848-1861 1850-01-03 Democratic banner. (Bowling Green, Pike County, Mo.) 1845-1852 1849-12-31
Belmont chronicle. (St. Clairsville, Ohio) 1855-1973 1857-12-10 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865 1857-12-14
My next task was figuring out how to write the dictionary to draw out the city/state as their own separate fields, which can then be geocoded in ArcGIS. I wrote the dictionary in a sort of stack: the LCCN calls the title; the title calls the city; the city calls the state. When I figured out how to set this up, I felt (for the first time) a major advancement in my understanding of Python syntax.
And this is how the data set has finally ended up looking:
Democrat and sentinel. Ebensburg Pennsylvania 1854-12-14 Nashville union and American. Nashville Tennessee 1854-12-07
New-York daily tribune. New-York New York 1842-12-08 Jeffersonian Republican. Stroudsburg Pennsylvania 1842-12-14
Holmes County Republican. Millersburg Ohio 1860-01-05 Clarksville chronicle. Clarksville Tennessee 1859-12-23
Fremont journal. Fremont Ohio 1860-12-06 Cleveland morning leader. Cleveland Ohio 1860-12-06
Glasgow weekly times. Glasgow Missouri 1850-01-03 Democratic banner. Bowling Green Missouri 1849-12-31
Belmont chronicle. St. Clairsville Ohio 1857-12-10 Clarksville chronicle. Clarksville Tennessee 1857-12-14
At the very beginning, I set up a shortened set (10 lines) of pairwise data to run my tests on, so I wouldn’t super-mess any of the big data up (or wait a really long time to discover that I’d done something wrong and the output wasn’t what I intended). This was a really helpful way to test my program without major consequences.
Each time, when it was time to replace the test file with the real one, I got all knock-kneed, fearful that something would go terribly awry. With the first program, something did go awry: we discovered that the test one worked but the big one didn’t because of mysterious empty lines in the big one. We solved that problem by (1) finding the blank lines and removing–don’t quite know how, to be honest, and (2) writing an exception that skipped over aberrant lines. Since that time, I fixed the aberrant line problem by adding the problem publication (the Atlantic Monthly) into the newspapers master list I’m pulling my dictionary keys from. So in the second iteration of the program, not only were there dates, but all the lines in the file were actually being identified. Troubleshooting these problems was quite beneficial in helping me learn exactly how Python works.
My first experiences with programming, though a very great frustration to me at times, have stretched me a lot in thinking about how data can be manipulated, and the best ways to get the job done. I look forward to continuing to flail around in all directions, both on this project and hopefully on some of my own.