Tag Archives: Viral Texts

On Newspapers and Being Human

Last week, an opinion piece appeared in the New York Times, arguing that the advent of algorithmically derived human-readable content may be destroying our humanity, as the lines between technology and humanity blur. A particular target in this article is the advent of “robo-journalism,” or the use of algorithms to write copy for the news. 1 The author cites a study that alleges that “90 percent of news could be algorithmically generated by the mid-2020s, much of it without human intervention.” The obvious rebuttal to this statement is that algorithms are written by real human beings, which means that there are human interventions in every piece of algorithmically derived text. But statements like these also imply an individualism that simply does not match the historical tradition of how newspapers are created. 2

In the nineteenth century, algorithms didn’t write texts, but neither did each newspaper’s staff write its own copy with personal attention to each article. Instead, newspapers borrowed texts from each other—no one would ever have expected individualized copy for news stories. 3 Newspapers were amalgams of texts from a variety of sources, cobbled together by editors who did more with scissors than with a pen (and they often described themselves this way). Continue reading On Newspapers and Being Human


  1. The article also decries other types of algorithmically derived texts, but the case for computer-generated creative fiction or poetry is fairly well argued by people such as Mark Sample, and is not an argument that I have anything new to add to.
  2. This post is based on my research for the Viral Texts project at Northeastern University.
  3. In 1844, the New York Daily Tribune published a humorous story illustrating exactly the opposite, in fact—some readers preferred a less human touch.

Passing on the Scissors and the Quill: Editorial Tenure in Viral Texts

The newspaper business was highly variable in the nineteenth century (in different ways than it is in the 21st century). Changes in editorship, political affiliation, and even location were frequent. Editorial changes were particularly significant, since very few editors maintained exactly the same newspaper that they inherited from a predecessor. Editors came and went quite often, passing on the “scissors and the quill,” in the words of the outgoing editor of the Polynesian, Edwin O. Hall.

A Hoe press, of the type made famous by John McClanahan, editor of the Memphis Daily Appeal
A Hoe press, of the type made famous by John McClanahan, editor of the Memphis Daily Appeal (Creative Commons licensed image from flickr user jwyg)

Continue reading Passing on the Scissors and the Quill: Editorial Tenure in Viral Texts

Developing High- and Low-Tech Digital Competencies

Last week, Ben Schmidt gave a talk at Northeastern, part of which was about developing technical competency in digital methods. This semester, I’ve had the chance to develop my technical competency in working with data, mostly by jumping in with both feet and flailing around in all directions.

The task I was given in the NULab has allowed me to play with several different digital methods. The base project was this: turn strings such as these

10138 sn86071378/1854-12-14/ed-1 sn85038518/1854-12-07/ed-1
8744 sn83030213/1842-12-08/ed-1 sn86053954/1842-12-14/ed-1
8099 sn84028820/1860-01-05/ed-1 sn88061076/1859-12-23/ed-2
7819 sn85026050/1860-12-06/ed-1 sn83035143/1860-12-06/ed-1
7792 sn86063325/1850-01-03/ed-1 sn89066057/1849-12-31/ed-1

into a usable representation of a pair of newspapers who share a printed text. This snippet is 5 lines of a document of over 2 million lines, so obviously doing the substitutions by hand was not really an option.

David Smith, the computer science professor who wrote the algorithm that generated these pairs, suggested a Python program, using the dictionary data structure, for creating the usable list. That dictionary would draw its key from the text file provided by the Library of Congress for the Chronicling America newspapers. That was all fine, except that I had never even seen a Python script before.

I started very basic: The Programming Historian! Though that program was very helpful in learning the syntax and vocabulary, the brief discussion of dictionaries in The Programming Historian wasn’t sufficient for what I needed. So I turned to other sources of information: Python documentation (not that helpful) and my husband Lincoln (very helpful).

Through a lot of frustration, bother, and translating Ruby scripts into Python, Lincoln and I (95% Lincoln) were able to come up with a working program that generated a .csv file with lines of text that looked like this:

Democrat and sentinel. (Ebensburg, Pa.) 1853-1866 Nashville union and American. (Nashville, Tenn.) 1853-1862
New-York daily tribune. (New-York [N.Y.]) 1842-1866 Jeffersonian Republican. (Stroudsburg, Pa.) 1840-1853
Holmes County Republican. (Millersburg, Holmes County, Ohio) 1856-1865 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865
Fremont journal. (Fremont, Sandusky County, [Ohio]) 1853-1866 Cleveland morning leader. (Cleveland [Ohio]) 1854-1865
Glasgow weekly times. (Glasgow, Mo.) 1848-1861 Democratic banner. (Bowling Green, Pike County, Mo.) 1845-1852
Belmont chronicle. (St. Clairsville, Ohio) 1855-1973 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865

The next step was pulling out the dates of publication (for the shared texts) and adding them to the .csv file. To do so, I had to update my Python program. I wrote a regular expression that detected the dates by searching for fields that looked like ####/##/##. In order to accommodate the Atlantic Monthly, which didn’t do its dates the same way, I added a variation that found the string beginning with 18 and recorded the 18 plus the next 6 digits. (At some point, I’ll write a separate thing that will add in the hyphens, perhaps?)

Third, I used the command line to remove the parentheses and brackets in the master newspapers file, and tab delimit the fields so that the location was its own column. This command looks like this:

tr '()' '\t' < newspapers-edit.txt | tr ',' '\t' | tr '[]' '\t' > newspapers-edit-expanded.txt

However, I realized when I did this command that it messes up my newspaper dictionary (from step 1) because the LCCN number, which was the last field, is now in a non-fixed location depending on how many fields were created by moving the comma-separated information into new tab-separated fields. So I did the highest-tech thing I know: I opened the .txt file in LibreOffice Calc (the poor man’s MS Excel) and simply moved the LCCN column in the original newspapers-edit.txt file over so that it wouldn’t be affected when I ran the tab-separating command. Then I ran the command again.

The data set now looks like this:
Democrat and sentinel. (Ebensburg, Pa.) 1853-1866 1854-12-14 Nashville union and American. (Nashville, Tenn.) 1853-1862 1854-12-07
New-York daily tribune. (New-York [N.Y.]) 1842-1866 1842-12-08 Jeffersonian Republican. (Stroudsburg, Pa.) 1840-1853 1842-12-14
Holmes County Republican. (Millersburg, Holmes County, Ohio) 1856-1865 1860-01-05 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865 1859-12-23
Fremont journal. (Fremont, Sandusky County, [Ohio]) 1853-1866 1860-12-06 Cleveland morning leader. (Cleveland [Ohio]) 1854-1865 1860-12-06
Glasgow weekly times. (Glasgow, Mo.) 1848-1861 1850-01-03 Democratic banner. (Bowling Green, Pike County, Mo.) 1845-1852 1849-12-31
Belmont chronicle. (St. Clairsville, Ohio) 1855-1973 1857-12-10 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865 1857-12-14

My next task was figuring out how to write the dictionary to draw out the city/state as their own separate fields, which can then be geocoded in ArcGIS. I wrote the dictionary in a sort of stack: the LCCN calls the title; the title calls the city; the city calls the state. When I figured out how to set this up, I felt (for the first time) a major advancement in my understanding of Python syntax.

And this is how the data set has finally ended up looking:

Democrat and sentinel. Ebensburg Pennsylvania 1854-12-14 Nashville union and American. Nashville Tennessee 1854-12-07
New-York daily tribune. New-York New York 1842-12-08 Jeffersonian Republican. Stroudsburg Pennsylvania 1842-12-14
Holmes County Republican. Millersburg Ohio 1860-01-05 Clarksville chronicle. Clarksville Tennessee 1859-12-23
Fremont journal. Fremont Ohio 1860-12-06 Cleveland morning leader. Cleveland Ohio 1860-12-06
Glasgow weekly times. Glasgow Missouri 1850-01-03 Democratic banner. Bowling Green Missouri 1849-12-31
Belmont chronicle. St. Clairsville Ohio 1857-12-10 Clarksville chronicle. Clarksville Tennessee 1857-12-14

At the very beginning, I set up a shortened set (10 lines) of pairwise data to run my tests on, so I wouldn’t super-mess any of the big data up (or wait a really long time to discover that I’d done something wrong and the output wasn’t what I intended). This was a really helpful way to test my program without major consequences.

Each time, when it was time to replace the test file with the real one, I got all knock-kneed, fearful that something would go terribly awry. With the first program, something did go awry: we discovered that the test one worked but the big one didn’t because of mysterious empty lines in the big one. We solved that problem by (1) finding the blank lines and removing–don’t quite know how, to be honest, and (2) writing an exception that skipped over aberrant lines. Since that time, I fixed the aberrant line problem by adding the problem publication (the Atlantic Monthly) into the newspapers master list I’m pulling my dictionary keys from. So in the second iteration of the program, not only were there dates, but all the lines in the file were actually being identified. Troubleshooting these problems was quite beneficial in helping me learn exactly how Python works.

My first experiences with programming, though a very great frustration to me at times, have stretched me a lot in thinking about how data can be manipulated, and the best ways to get the job done. I look forward to continuing to flail around in all directions, both on this project and hopefully on some of my own.