Tag Archives: NULab

Boston-Area Days of DH Wrap-up

[cross-posted to HASTAC.org]

Now that it’s been almost a month since the Boston-Area Days of DH, I figured I’d better write a wrap-up of the conference. It was my very great pleasure to help Prof. Ryan Cordell organize the conference, and along the way I learned a lot about DH and about scholarly work in general (and about scheduling and organization and making sure the coffee gets to the right place…).

The Boston-Area Days of DH conference was sponsored by Northeastern University’s NULab for Texts, Maps, and Networks. Originally, it was designed to coincide with the worldwide Day of DH, sponsored by CenterNet. It would do in a conference what Day of DH does online: highlight the work that Boston-area digital humanists are doing and start conversations based on that work. In addition, we tried to include sessions to help digital humanists do their work better.

Day 1 Breakdown

Our first session, the lightning talks, was designed to highlight as many projects as possible in a short amount of time. All the presentations were interesting, but I’d like to especially mention a couple. First, the Lexomics group from Wheaton College presented on their text analysis work on Old English texts. This group was unusual both for the work they did and also for their place in the field: all the presenters were undergraduates at Wheaton. I found it very heartening to see undergraduates doing serious scholarly work using digital humanities. Second, Siobhan Senier’s work on Native American literature was especially inspiring. I love how she is using digital tools to help expose and analyze literature of New England Native Americans. She’s using Omeka as a digital repository for Native American literature, much of which is not literature in words, but rather in art or handicraft (such as baskets). I think this is a perfect use for the Omeka platform.

After the lightning talks, we were able to run a set of workshops twice during the first day of the conference. The topics ranged from network analysis (taught by Jean Bauer), to text analysis (taught by David Smith), to historical GIS (taught by Ryan Cordell). I heard lots of good feedback about how helpful these workshops were, though I wasn’t able to attend any myself.

The keynote address has to rate as one of the most entertainingly educational talks I’ve ever heard. Matt Jockers, from the University of Nebraska, Lincoln, sparred with Julia Flanders from Brown University in a mock debate over the relative merits of big data and small data. They’ve posted their whole talk, along with some post-talk comments on their respective blogs (Matt’s and Julia’s). The talk is certainly well worth the read, so rather than outlining or overviewing it here,  I’ll just entreat you to go to the source itself.

Day 2 Breakdown

On Day 2, we suffered an environmental crisis: a sudden snowstorm in the night on Monday night which made travel a much greater hassle than it already is in Boston. As a result, our numbers were greatly reduced, but we soldiered on, sans coffee and muffins.

Our first session was a series of featured talks about specific projects. Topics ranged from gaming, to GIS, to pedagogy, to large-scale text analysis. Augusta Rohrbach discussed how a game she’s working on, Trifles, incorporates elements of history and literature into a game environment to teach students about both history and literature, while engaging in questions about gender and social issues as well. Michael Hanrahan talked about how GIS can reframe questions about rebellions in England in 1381, and on a wider scale, how GIS can reframe questions of information dissemination. Shane Landrum talked about how he uses digital technology to teach at a large, public, urban university, and the challenges of doing DH in a place where computer access and time to “screw around” are real problems. And Ben Schmidt talked about doing textual analysis on large corpora using Bookworm, a tool created at the Harvard Cultural Observatory.

The final session of the conference was a grants workshop with Brett Bobley, director of the NEH’s Office of Digital Humanities. By staging a mock panel discussion such as might occur in a real review of grant proposals, Brett was able to instruct us about what the NEH-ODH is looking for in grant proposals, and how the grant-awarding process works. I found the issues that Brett raised about grant proposals to be helpful in thinking through all of my work: am I being specific about my objectives? about who this will reach? about how exactly it’s all going to get done? These questions ought to inform our practice not just for grants, but for all the work we do.

 

All in all, despite some environmental setbacks, I think the conference was a great success. A friend, upon seeing the program, remarked to me, “Wow, a digital humanities conference that’s not a THATCamp!” I’m all for THATCamps, but I do think that pairing this sort of conference with the THATCamp model allows us to talk about our work in different ways, all of which are valuable. So, with some trepidation, I will join those who have already called for this conference to become an annual event. (After all, with a year of experience under our belt, what could go wrong?)

Developing High- and Low-Tech Digital Competencies

Last week, Ben Schmidt gave a talk at Northeastern, part of which was about developing technical competency in digital methods. This semester, I’ve had the chance to develop my technical competency in working with data, mostly by jumping in with both feet and flailing around in all directions.

The task I was given in the NULab has allowed me to play with several different digital methods. The base project was this: turn strings such as these

10138 sn86071378/1854-12-14/ed-1 sn85038518/1854-12-07/ed-1
8744 sn83030213/1842-12-08/ed-1 sn86053954/1842-12-14/ed-1
8099 sn84028820/1860-01-05/ed-1 sn88061076/1859-12-23/ed-2
7819 sn85026050/1860-12-06/ed-1 sn83035143/1860-12-06/ed-1
7792 sn86063325/1850-01-03/ed-1 sn89066057/1849-12-31/ed-1

into a usable representation of a pair of newspapers who share a printed text. This snippet is 5 lines of a document of over 2 million lines, so obviously doing the substitutions by hand was not really an option.

David Smith, the computer science professor who wrote the algorithm that generated these pairs, suggested a Python program, using the dictionary data structure, for creating the usable list. That dictionary would draw its key from the text file provided by the Library of Congress for the Chronicling America newspapers. That was all fine, except that I had never even seen a Python script before.

I started very basic: The Programming Historian! Though that program was very helpful in learning the syntax and vocabulary, the brief discussion of dictionaries in The Programming Historian wasn’t sufficient for what I needed. So I turned to other sources of information: Python documentation (not that helpful) and my husband Lincoln (very helpful).

Through a lot of frustration, bother, and translating Ruby scripts into Python, Lincoln and I (95% Lincoln) were able to come up with a working program that generated a .csv file with lines of text that looked like this:

Democrat and sentinel. (Ebensburg, Pa.) 1853-1866 Nashville union and American. (Nashville, Tenn.) 1853-1862
New-York daily tribune. (New-York [N.Y.]) 1842-1866 Jeffersonian Republican. (Stroudsburg, Pa.) 1840-1853
Holmes County Republican. (Millersburg, Holmes County, Ohio) 1856-1865 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865
Fremont journal. (Fremont, Sandusky County, [Ohio]) 1853-1866 Cleveland morning leader. (Cleveland [Ohio]) 1854-1865
Glasgow weekly times. (Glasgow, Mo.) 1848-1861 Democratic banner. (Bowling Green, Pike County, Mo.) 1845-1852
Belmont chronicle. (St. Clairsville, Ohio) 1855-1973 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865

The next step was pulling out the dates of publication (for the shared texts) and adding them to the .csv file. To do so, I had to update my Python program. I wrote a regular expression that detected the dates by searching for fields that looked like ####/##/##. In order to accommodate the Atlantic Monthly, which didn’t do its dates the same way, I added a variation that found the string beginning with 18 and recorded the 18 plus the next 6 digits. (At some point, I’ll write a separate thing that will add in the hyphens, perhaps?)

Third, I used the command line to remove the parentheses and brackets in the master newspapers file, and tab delimit the fields so that the location was its own column. This command looks like this:

tr '()' '\t' < newspapers-edit.txt | tr ',' '\t' | tr '[]' '\t' > newspapers-edit-expanded.txt

However, I realized when I did this command that it messes up my newspaper dictionary (from step 1) because the LCCN number, which was the last field, is now in a non-fixed location depending on how many fields were created by moving the comma-separated information into new tab-separated fields. So I did the highest-tech thing I know: I opened the .txt file in LibreOffice Calc (the poor man’s MS Excel) and simply moved the LCCN column in the original newspapers-edit.txt file over so that it wouldn’t be affected when I ran the tab-separating command. Then I ran the command again.

The data set now looks like this:
Democrat and sentinel. (Ebensburg, Pa.) 1853-1866 1854-12-14 Nashville union and American. (Nashville, Tenn.) 1853-1862 1854-12-07
New-York daily tribune. (New-York [N.Y.]) 1842-1866 1842-12-08 Jeffersonian Republican. (Stroudsburg, Pa.) 1840-1853 1842-12-14
Holmes County Republican. (Millersburg, Holmes County, Ohio) 1856-1865 1860-01-05 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865 1859-12-23
Fremont journal. (Fremont, Sandusky County, [Ohio]) 1853-1866 1860-12-06 Cleveland morning leader. (Cleveland [Ohio]) 1854-1865 1860-12-06
Glasgow weekly times. (Glasgow, Mo.) 1848-1861 1850-01-03 Democratic banner. (Bowling Green, Pike County, Mo.) 1845-1852 1849-12-31
Belmont chronicle. (St. Clairsville, Ohio) 1855-1973 1857-12-10 Clarksville chronicle. (Clarksville, Tenn.) 1857-1865 1857-12-14

My next task was figuring out how to write the dictionary to draw out the city/state as their own separate fields, which can then be geocoded in ArcGIS. I wrote the dictionary in a sort of stack: the LCCN calls the title; the title calls the city; the city calls the state. When I figured out how to set this up, I felt (for the first time) a major advancement in my understanding of Python syntax.

And this is how the data set has finally ended up looking:

Democrat and sentinel. Ebensburg Pennsylvania 1854-12-14 Nashville union and American. Nashville Tennessee 1854-12-07
New-York daily tribune. New-York New York 1842-12-08 Jeffersonian Republican. Stroudsburg Pennsylvania 1842-12-14
Holmes County Republican. Millersburg Ohio 1860-01-05 Clarksville chronicle. Clarksville Tennessee 1859-12-23
Fremont journal. Fremont Ohio 1860-12-06 Cleveland morning leader. Cleveland Ohio 1860-12-06
Glasgow weekly times. Glasgow Missouri 1850-01-03 Democratic banner. Bowling Green Missouri 1849-12-31
Belmont chronicle. St. Clairsville Ohio 1857-12-10 Clarksville chronicle. Clarksville Tennessee 1857-12-14

At the very beginning, I set up a shortened set (10 lines) of pairwise data to run my tests on, so I wouldn’t super-mess any of the big data up (or wait a really long time to discover that I’d done something wrong and the output wasn’t what I intended). This was a really helpful way to test my program without major consequences.

Each time, when it was time to replace the test file with the real one, I got all knock-kneed, fearful that something would go terribly awry. With the first program, something did go awry: we discovered that the test one worked but the big one didn’t because of mysterious empty lines in the big one. We solved that problem by (1) finding the blank lines and removing–don’t quite know how, to be honest, and (2) writing an exception that skipped over aberrant lines. Since that time, I fixed the aberrant line problem by adding the problem publication (the Atlantic Monthly) into the newspapers master list I’m pulling my dictionary keys from. So in the second iteration of the program, not only were there dates, but all the lines in the file were actually being identified. Troubleshooting these problems was quite beneficial in helping me learn exactly how Python works.

My first experiences with programming, though a very great frustration to me at times, have stretched me a lot in thinking about how data can be manipulated, and the best ways to get the job done. I look forward to continuing to flail around in all directions, both on this project and hopefully on some of my own.

Documenting Change over Time with Simile Timeline

The NULab project that I’m working on right now involves documenting connections between newspapers in the nineteenth-century United States. So far, my work has been researching the history of each individual newspaper. It’s been an enlightening and entertaining process. (If you’re interested in one of the most entertaining stories I discovered, check out my Omeka exhibit for my digital humanities class.)

We’re pulling data from the Chronicling America website at the Library of Congress. The newspapers we have range from 1836 to 1860. We don’t have all the newspapers from that range, though. We’re adding new papers all the time. The data I’m working with right now is from the first batch of data.

One of the difficulties I encountered early into the process of research was the astonishing number of name changes each paper went through during its lifetime. To get a better sense of how many times and how often these name changes occurred, I decided to plot the changes on a timeline.

Based on a suggestion from Chuck Rybak, and armed with an excellent basic tutorial by Brian Croxall, I built my timeline using MIT Simile Timeline. I found data entry very easy using the timeline interface. (I don’t think the CSS is particularly beautiful; as I have time, I may try to make it nicer-looking.)

For many of the newspapers, the exact date of some changes is uncertain. Various sources disagree on dates, and some information is just not out there for me to find. To compensate for that difficulty, I marked each uncertain date as starting on January 1 of the year and noted the uncertainty in the comment.

The current timeline has another drawback: it can’t filter by newspaper. For some newspapers, it’s easy to tell which names are connected (for instance, it’s pretty easy to tell that the Sunbury American is connected to the Sunbury American and Shamokin Journal). It’s not so easy to see the link between papers such as the Salt River Journal, The Radical, and The Democratic Banner. I’d like to be able to filter the timeline so these connections are self-evident without reading the comments.

Here’s the final result, though as newspapers get added to our dataset, they’ll get added here too. You’ll notice, too, that my timeline spans more than 1836-1860. Though many of the newspapers exist past 1860, I decided to stop my investigations there because it wouldn’t be that helpful to our project. However, I decided to trace each paper back to its origin, if possible, as a way to get at the characteristics of the paper. For that reason, the start date for some newspapers is well before 1836.

If anyone has suggestions for how to improve this timeline, I would welcome them! Please leave me a comment.

 

Witticism from the Holmes County Republican

As part of my work at the NULab, I’ve been researching newspapers from the mid-nineteenth century. This little tidbit from a newspaper in Millersburg, OH, caught my eye and I thought I’d share.

P.S. If you know anything about Millersburg in the 1850s, or the Holmes County Republican or Holmes County Farmer, please contact me. I would like to know more about this town’s news.

About Girls’ Names.

If you are a very precise man and wish to be certain of what you get, never marry a girl named Ann; for we have the authority of Lindley Murray, and others, that “an is an indefinite article.”[1. I’ve been listening to the Anne of Green Gables series on my commute to school, and this one seems to hold true, in that case at least.]

If you would like to have a wife who is “one of a thousand,” you should marry an Emily or an Emma, for any printer can tell you that “em’s” are always counted by thousands.

If you do not wish to have a bustling, fly-about wife, you should not marry one named Jenny; for every cotton spinner knows that jennies are always on the go.

If you marry one named Margaret, you may confidentially expect that she will end her days on the gallows; for all the world knows that “days” were made for hanging.[2. I must confess, I don’t understand this one at all. Someone please enlighten me? Update: Apparently this is a misprint. In every other newspaper in which this or a similar witticism appears, the word “days” is replaced by the word “Pegs,” which makes much more sense. What’s weird is that the “days”/”pegs” mistake looks a lot like an OCR problem. Did they have OCR problems in 1856?]

The most incessant writer in the world is he who is always bound to Ad-a-line.

You may adore your wife, but you will be surpassed, in love when your wife is a Dora.

Unless you would have the evil one for a father-inlaw, you should not marry a lady named Elizabeth, for the devil is the father of Lize– (lies.)

If you wish to succeed in life as a porter, you should marry a Caroline, and treat her very kindly, for so long as you continue to
do this, you will be good to Carry.

Many men of high moral principles, and who would not gamble for the world, still have not refused to take a Bet.

from the Holmes County Republican, August 21, 1856 (its inaugural issue under that name)