Lincoln A. Mullen: The Making of America’s Public Bible: Computational Text Analysis for Religious History

1 Leave a comment on paragraph 1 0 America’s Public Bible: A Commentary is a website that charts biblical quotations in U.S. newspapers over the nineteenth and early twentieth centuries.[1] The prototype version uses the Chronicling America corpus of over 12 million newspaper pages as a source base. It finds and identifies quotations from the King James Version of the Bible. The prototype was created for the Chronicling America Data Challenge hosted by the National Endowment for the Humanities.[2] The initial version, which will remain available until the revised version is published, is available at the project website: http://americaspublicbible.org (figure 1). A much expanded version—featuring an additional newspaper corpus, more versions of the Bible, and expanded interpretation in the form of visualizations and prose—is in progress.[3]

2 Leave a comment on paragraph 2 0 Figure 1: The front page of the prototype version of America’s Public Bible, created for the Chronicling America Data Challenge.

3 Leave a comment on paragraph 3 0 America’s Public Bible as a digital history or digital religious studies project could be discussed in a variety of ways. The project uses the techniques of machine learning and text analysis to find the quotations, and it uses data analysis and visualization to make sense of them. It is therefore a kind of computational humanities project, like Frederik Elwert’s use of network analysis of ancient Egyptian and Indian religion or Marcus Bingenheimer’s digital analysis of Buddhist texts, both described in this volume. Since such projects are rather unusual for humanities scholars (though decidedly less so), the specific computational techniques and methods that they use are understandably an object of curiosity: what did the data look like? what methods were used to analyze it?

4 Leave a comment on paragraph 4 0 But it is also possible to talk about the methods behind these projects in a more humanistic way, to understand how computational methods drawn from other disciplines can be used within humanities disciplines. In other words, how does one use computational methods to make a historical interpretation? And then it is also possible to discuss such projects in terms of the contributions they make to a specific field of study. What does this project tell us about nineteenth-century American religious history?

5 Leave a comment on paragraph 5 0 In this essay, I invert the typical way of discussing such projects. Instead of discussing methods, then modes of interpretations, then results, I begin with a case study of an interpretative result, describe a general pattern of interpretation that is useful to humanities scholars working with digital methods, show how America’s Public Bible offers an interface that enables such interpretations, and finally describe the computational methods for finding quotations. Scholars in history and religious studies who are contemplating a digital project must acquaint themselves with the various computational methods available to them, to be sure. But the more pressing problem is finding a mode of argumentation which usefully applies those methods to make meaningful interpretations in one’s given field.

6 Leave a comment on paragraph 6 0 In explaining what my digital project is and how it functions, I hope to show two things. First, the project creates serendipitous findings through computational history by surfacing sources that would otherwise go unnoticed. And second, the project disciplines those searches by setting the results in a much broader chronological context in which the typical and the exceptional can be identified. This disciplined serendipity constitutes a method of approaching the past that is relevant to religious studies.[4]

7 Leave a comment on paragraph 7 0 To see how disciplined serendipity can be used as an interpretative method, let’s take up the case of the McKinley assassination.

Interpretation: The McKinley assassination and 2 Chronicles 7:14

8 Leave a comment on paragraph 8 0 William McKinley, the twenty-fifth president of the United States, was assassinated on September 6, 1901. The Cameron County Press reported on the sermons preached in the churches of Emporium, Pennsylvania, to commemorate the slain president. The Methodist, Catholic, Episcopal, and Presbyterian churches all paid tribute to McKinley. The Rev. Robert McCaslin called his Presbyterian congregation to prayer and repentance, juxtaposing prosperity and repentance. “In our national prosperity we were forgetting God and we were becoming self-reliant,” the minister claimed. He called his hearers to repent using 2 Chronicles 7:14: “If my people, which are called by my name, shall humble themselves, and pray, and seek my face, and turn from their wicked ways; then will I hear from heaven, and will forgive their sin, and will heal their land.”

9 Leave a comment on paragraph 9 0 There could be no mistaking who McCaslin and his hearers thought the “people” in that biblical verse were. “The people” were not ancient Israelites, or even the Presbyterians sitting in the pews of Emporium, but the citizens of the United States. McCaslin drove home this point by encouraging his listeners to heed “the loud call of God to nations to rise and stamp out the curse of anarchy,” regretting that it was “deeply humiliating that such a deed could take place in this christian land.”[5]

10 Leave a comment on paragraph 10 0 This brief scene should be familiar to any historian of American religion. Christian nationalism has long been a subject of concern to religious historians, and this episode in Emporium is a classic example of the jeremiad.[6] A public event in the wake of a national tragedy linked Christianity and the state, and the sacred scriptures were used to undergird the power of the state and condemn its enemies. Like a Puritan fast day, the day was set aside for prayer, preaching, and repentance. And the scriptural verse that was used would later become popular in the late twentieth-century rise of the religious right.

11 Leave a comment on paragraph 11 0 The question, though, is whether this use of the Bible was typical or unusual. When the citizens of Emporium opened their Bibles to 2 Chronicles, would the verse calling them to humility have naturally seemed to mean what their ministers said it meant that day? Or might it have held other meanings that they saw as more obvious? We understand that cultural significance of that text in the light of our recent history, but can we uncover the assumptions of the somewhat more distant past?[7]

12 Leave a comment on paragraph 12 0 In the case of 2 Chronicles 7:14, an answer is readily apparent. In the decade on either side of the McKinley assassination, the connection between that text and Christian nationalism was fairly infrequent. Instead the verse had two other uses which were far more common.

13 Leave a comment on paragraph 13 0 Far more frequently, the verse was used as a call to humility for Christians, occasionally in the context of a revival. The Watchman and Southron in Sumter, SC, ran an unsigned column on “Humility,” assembling the verse from 2 Chronicles and other texts, encouraging its readers to “never mind the showy array and costly equipage of the worldly, but to be clothed with humility.”[8] In St. Louis, the pastor of the Presbyterian church preached on the text, reminding his congregants that, as the newspaper headline put it, “to give up habits of sin … is harder than pulling teeth.”[9] In Sacramento, a Congregationalist pastor surprisingly used the text to recommend the practice of Lent, and less surprisingly to encourage preparation for the famous preacher D. L. Moody’s upcoming revival in that city. The pastor did regard the Lenten observance as tied to the responsibility of Christians in the state, observing that “as a nation and people we need many things,” among them “higher ideals of business integrity” and “higher standards of political morality,” besides the “deeper consciousness of God.”[10]

14 Leave a comment on paragraph 14 0 The verse was also popular in revivals. The Pacific Commercial Advertiser in Honolulu tried to gather “the Christian people of our city” in a “call to prayer,” quoting that text as a means of “putting away of sin.”[11] A cooperative revival in Kentucky of “God’s own people,” meaning clearly the members of churches and not the citizens of the United States, was announced by the Mt. Sterling Advocate.[12] The Monroe City Democrat in Missouri specifically addressed the verse to Christians, quoting it as the solution “when a church wants a real revival.”[13]

15 Leave a comment on paragraph 15 0 The second common use of 2 Chronicles 7:14 was as a prayer for rain in response to drought. After all, verse 13 framed the call to repentance as being in a time when “there is no rain.” A Baptist pastor in Clinton, Missouri, promised that “God will answer prayer with rain.”[14] In Utah the Deseret Evening News quoted the verse while worrying about a drought’s effect on the corn crop.[15] In 1901, the governor of Nebraska proclaimed a day of prayer “for relief from destructive winds and drouth.” Ten years later the Omaha Daily Bee ran a commemoration, claiming that by 1pm on the day of prayer it began to rain and that in a few days “the whole state was wet down.” The verse from 2 Chronicles was used to bolster that meteorological claim.[16]

16 Leave a comment on paragraph 16 0 The point of undertaking this brief history of 2 Chronicles 7:14 is not to suggest that Christian nationalism was not a significant factor in American life, but to put it in a larger context. That verse did not provoke automatic and easy assumptions of the role of God’s favor on the United States or his promised defense against anarchists and other enemies. For farmers in Utah or Missouri, the danger against which God might defend them was drought. Their link to ancient Israel was not a theological identity as the people of God but their common agricultural occupation. A pattern of quotation that became a favorite of Jerry Falwell and the Moral Majority in the 1980s had antecedents at the turn of the twentieth century, but those antecedents were unusual rather than typical.

Pattern of argumentation: the typical and the exceptional

17 Leave a comment on paragraph 17 0 This vignette of a civil religious event in the McKinley assassination combined with the more complicated history of a biblical text highlights a fundamental tension in how scholars approach the study of religion and history. The problem lies in knowing whether some phenomenon that we are interested in studying is typical or exceptional. Arguments in the humanities tend to be structured around common patterns. One common pattern explicates some unusual text or event. An equally important strand of research aims at explicating the typical, the everyday, and the ordinary.

18 Leave a comment on paragraph 18 0 But how can we know whether the object of our study is one or the other? Scholars often assert but seldom prove their claims about whether something is exceptional or typical. Such claims typically rest on the basis of the scholar’s expertise. This is not without justification. After all, a career spent immersed in the sources does give scholars some ability to distinguish between the ordinary and the extraordinary.

19 Leave a comment on paragraph 19 0 Our intuitive understanding, though, is inadequate in the face of the scarcity and abundance of our sources—a problem faced by all humanistic disciplines, but perhaps especially by historians. Historical sources are scarce in that we always have the problem of an incomplete set of sources, and must therefore be attentive to the silences of the archives.[17] The art of being a historian is taking sources that were created for one purpose and reading against their grain to answer the questions which interest us.

20 Leave a comment on paragraph 20 0 Yet historians also face the problem of an abundance of sources. However partial they may be, the archival and the printed record is vast. Despite all the labors of librarians and archivists, our sources are inadequately cataloged and indexed. For even the most narrowly targeted scholarly question, the sources available outstrip the historian’s time and ability to read. There are no two ways about it: the way we go about finding our sources are inextricably tied up with chance and happenstance.

21 Leave a comment on paragraph 21 0 This problem of scarcity and abundance is rendered all the more acute by the rise of digitized sources. Digitized sources, as everyone knows, hardly represent the whole of the human record available in libraries and archives, nor do the collections in libraries and archives represent everything that was created in the past. The easy availability of digitized sources leads historians to use those sources rather than other sources which are available in the archive but remain undigitized.[18] Digital sources thus exacerbate the problem of scarcity, in the sense that they are an incomplete and partial record of the past which absorbs scholarly attention.[19] But digitized sources also exacerbate the problem of abundance. Scholars can access a much larger source base than ever before thanks to collections of primary source materials such as newspapers, photographs, government documents, as well as large book collections including the HathiTrust and Google Books.

22 Leave a comment on paragraph 22 0 We should not have a rosy view of this digitization, not least because of two problems that limit their usability. One is the enclosure of our cultural heritage by the large, for-profit publishers often doing the digitizing. The importance of large scale, publicly funded projects like the Chronicling America collection of over 15.3 million newspaper pages cannot be understated. These projects are free for scholars both in terms of cost and free in terms of the way that scholars can use them for any purpose, including computational research. (They are both gratis and libre, in the parlance of open source software.) But the norm for digital collections is the subscription database from a for-profit company. Though paid for at great cost by university libraries, such databases are not usually available for text mining.[20]

23 Leave a comment on paragraph 23 0 The second limitation of digitized sources is the way that increasing the scale of a source base tends to decrease the diversity of sources used. A thousand, a million, or ten million newspaper pages are all still just newspaper pages. Historians and most humanities scholars tend to rely on combining many different kinds of sources in order to make useful interpretations, but using large-scale text corpora, for instance, paradoxically narrows the kinds of source scholars use.[21]

24 Leave a comment on paragraph 24 0 The abundance of digitized sources has already transformed historical research.[22] As Lara Putnam has pointed out in an article on “The Transnational and the Text Searchable,” searching digitized collections is now a basic scholarly practice. Putnam argues that the ability to search for sources without the constraints of the national archive has allowed new angles of vision on transnational history, because “transnational approaches among historians did not become commonplace until technology radically reduced the cost of discovering information about people, places, and processes outside the borders of one’s prior knowledge.” Yet as Putnam points out, “Digital search makes possible radically more decontextualized research” and makes it possible to find examples of what we are looking for without a sense of its significance. To deal with this problem of context, Putnam observes that “computational tools can discipline our term-searching if we ask them to. By measuring proximity and comparing frequencies, topic modeling [or other text analysis methods, we might add] can balance easy hits with evidence of other topics more prevalent in those sources.”[23]

25 Leave a comment on paragraph 25 0 In addition to its contribution to the history of the Bible in the United States, America’s Public Bible is a work of scholarship whose whose form as a digital project is intended to implement Putnam’s idea of text analysis that facilitates the discovery of new sources at the same time that it disciplines searching. To understand how it accomplishes that end, let me explain how it the site’s interface works, and how it was put together.

Interface: disciplined serendipity through interactive visualization

26 Leave a comment on paragraph 26 0 The prototype version of America’s Public Bible has as its centerpiece a visualization that lets users interactively explore a time series visualization of the trend in quotations for over one thousand of the most quoted verses in the Bible (figure 2).

27 Leave a comment on paragraph 27 0 Figure 2: An example of the interactive verse browser, showing time series of the rate of quotations for the five most quoted verses from Proverbs. Users can enter verse references on their own, or they can choose from pre-selected collections of verses, such as the Lord’s Prayer or verses on marriage and divorce.

28 Leave a comment on paragraph 28 0 Most important, users can disaggregate the time series and find each quotation in the context of the newspaper page at Chronicling America. A table below the visualization shows a row for each instance of a quotation. Users can follow links to that specific newspaper page in Chronicling America (figure 3).

29 Leave a comment on paragraph 29 0 Figure 3: After identifying an interesting verse and see the trend in quotations, users can find every quotation which was used to compute that trend line. The links to the right take users to that specific newspaper page in Chronicling America.

30 Leave a comment on paragraph 30 0 On arriving at the newspaper page in Chronicling America, the key words in the quotation are highlighted on the page. This allows the user to readily identify the quotation on the page, which would otherwise be quite difficult.

31 Leave a comment on paragraph 31 0 The ability to move between the trend line of the verses quotation and its location in the actual primary source is the way that the site enables disciplined serendipity. The serendipity lies in how the site surfaces hundreds of thousands of instances of quotations which the user can readily browse. This may be a subjective judgement, to be sure, but this has been the most fun project that I have ever worked on because I am constantly surprised by the quotation finder. (And not just that it works at all!) For example, how could I have known to look for the time when a Democratic newspaper thought Samuel Tilden had been elected in the disputed presidential race of 1876, and plastered the banner “The Lord called Samuel” across the paper (figure 4). Biblical jokes are another frequent category that I did not expect.[24] I take this as a sign that the method truly is serendipitous.

32 Leave a comment on paragraph 32 0 Figure 4. In the disputed 1876 election for the president of the United States between Rutherford B. Hayes and Samuel Tilden, several Democratic newspapers announced Tilden’s supposed victory using the verse 1 Samuel 3:8 (“The Lord called Samuel”). Stark County Democrat (Canton, OH), 7 December 1876.

33 Leave a comment on paragraph 33 0 The discipline in this approach lies in how the site contextualizes those quotations. Each text is set within two different contexts. The first context is the trend in the verse’s quotation. This context allows the user to see both how often the verse was quoted in an absolute sense, and to compare that trend to other verses to get a sense of its relative popularity.

34 Leave a comment on paragraph 34 0 The second context is the place of the text on the newspaper page itself (figure 5). This context allows the scholar to understand how the Bible verse was used. The Bible was a common yet contested text, and the fact that a verse was quoted only raises the questions of why it was quoted and what meaning was imparted to the text. To add to the example of how 2 Chronicles 7:14 was used in the wake of the McKinley assassination, take the trend for John 15:13 (“Greater love hath no man than this, that a man lay down his life for his friends”). This verse exploded in popularity around World War I, and looking at how the verse was used in specific newspapers confirms that it was popular because of obituaries. Investigating earlier uses of the verse shows that it was not associated with the military in any significant way until the Great War. It was more likely to be used to memorialize medical personnel who died taking care of people infected with cholera or yellow fever.

35 Leave a comment on paragraph 35 0 Figure 5: On arriving at the Chronicling America newspaper page containing a biblical quotation, the key words of the quotation are highlighted on the page. On this page from the Cameron County Press (cited above), the quotation from 2 Chronicles 7:14 is highlighted beneath the image of McKinley and the program for his memorial service in Emporium, Pennsylvania. The other highlights on the page are uses of words in the verse such as “humble” and “people.”

36 Leave a comment on paragraph 36 0 This prototype version of the site has implemented an interface that enables disciplined serendipity, but the site needs further work to more fully advance historical interpretations about the role of the Bible in public life.

37 Leave a comment on paragraph 37 0 This expanded version still in progress will extend the prototype in several ways. First, it will broaden the source base and improve the reliability of the method used to find the quotations. This will include extending the source base to Gale’s corpus of 19th Century U.S Newspapers.[25] This corpus is available only by license from Gale, which in my case was made possible because my university library purchased the text mining rights. Restrictions on how the corpus can be used, in particular a requirement that only the briefest of snippets be reproduced from the text of the newspapers, means that displaying the full text of the quotation and allowing back-and-forth interaction between contexts is simply not possible using the Gale dataset. The Gale corpus is also an order of magnitude smaller than Chronicling America. But it has cleaner OCR than Chronicling America, and it is segmented into articles rather than pages. It is therefore useful for validating the trends identified in Chronicling America. Article- rather than document-level segmentation may allow me to use computational analysis to understand the topics of articles in which the verses were quoted.

38 Leave a comment on paragraph 38 0 Second, the new version will extend the finding of quotations to other versions of the Bible. The prototype version used the King James Version since it was a standard text for English-speaking Protestants in the United States. The revised version will use other versions of the Bible as well. It is an extremely difficult problem both identify quotations and distinguish between versions of the Bible which have only small (if significant) verbal differences. Nevertheless this is an important addition to the project, since the rise of different translations of the Bible was a crucial moment for American Protestants, and since Catholics and Jews (and non-English speakers, for that matter) have all used versions other than the KJV.[26]

39 Leave a comment on paragraph 39 0 Third, the site will take the exploratory analysis of the prototype and add interpretative work on the history of the Bible. In digital history there is a widespread sense that digital work has failed to make historical interpretations.[27] This digital monograph will feature a series of chapters that interpret various aspects of the history of the Bible, using prose and visualizations. This interpretative process will necessarily involve more data analysis of the quotations, such as clustering the time series of quotation trends in order to find verses that had similar trajectories over time. But at its root will be much close reading of the context of the verses. The project will thus rely on the serendipity of the quotations that are turned up but also on the discipline of its empirical approach.

Method: text analysis and machine learning for quotation identification

40 Leave a comment on paragraph 40 0 Having described the public facing component of the project, and considered the disciplined serendipity it provides, I want to go behind the scenes of the project to show how it was created. In particular, I want to show how sources, methods, and questions came together to shape the project. The availability of large digital corpora and the knowledge of digital text analysis methods are not sufficient to create a meaningful intellectual project. Nor does having a worthwhile intellectual question in one’s discipline imply that that one can apply the right methods to the right sources to answer that question in a meaningful way. Sources, methods, questions—these three must come together to make a worthwhile interpretative project. To the extent that America’s Public Bible is useful as a model—let the reader be the judge—I want to show how that project combined those three considerations.

41 Leave a comment on paragraph 41 0 The initial problem was formulating a worthwhile historical question that was amenable to some kind of computational research. When the National Endowment for the Humanities opened a data challenge for the best project using Chronicling America, I had an incentive to compete but not a question worth asking. The sources were predetermined, but they were not obviously “religious” and so there was no clear connection to my primary research area. I was familiar with an array of methods that might be applied to the dataset. (Topic modelling! the newly initiated to DH is likely to suggest.) But without a question to answer there was no purchase for the methods.

42 Leave a comment on paragraph 42 0 My question came after hearing a paper that Mark Noll gave at the 2016 annual meeting of the American Society of Church History. The paper compared how the Bible was used in sermons after the deaths of Presidents Washington, Lincoln, and Garfield.[28] Noll had counted scriptural references in those sermons, then analyzed the way the quotations were used. That paper, along with the work he had done in his books In the Beginning Was the Word and The Civil War as a Theological Crisis, established that the Bible could be studied as a text that had been used in the public sphere.[29] Other historians have of course studied the Bible from a number of angles, not least its reception by Christians and its use as a material object.[30] The intellectual frame that Noll was using, however, was one that both justified the source base and pointed to the methods that should be used. Why was a large corpus of newspapers worth investigating for a project in the history of the Bible in America? Because it could provide evidence of how the Bible was used in the public sphere. What kind of method would be worth using? Finding quotations was a task that could be turned over to the computer.

43 Leave a comment on paragraph 43 0 America’s Public Bible is an example of a methodological approach that I would label computational history. The project uses the techniques of machine learning to find examples of biblical quotations, then uses data analysis combined with close reading to understand the meaning of those quotations.

44 Leave a comment on paragraph 44 0 Finding the quotations computationally required a series of steps.[31] The first step was to download the plain text versions of the roughly 12 million newspaper pages available at the time. While for many online resources this would be an enormous chore or impossible due to licensing restrictions, the data for Chronicling America is easily downloaded thanks to the Library of Congress’s bulk data downloads. Chronicling America also offers JSON and RDF application programming interfaces (APIs) for downloading the machine-readable metadata for each newspaper, such as its title, dates of publication, and location.[32] I parsed this API into a database of newspaper pages and publication information. From there I also computed other basic metadata such as the word count for each page.

45 Leave a comment on paragraph 45 0 Once the data was in hand (or, more precisely, on a RAID array) it was time to apply a series of digital methods. The second step was to identify potential quotations on the page. Detecting quotations can be thought of as a problem of detecting text reuse.[33] While this is a well-known area of research, for my particular problem it was necessary to invent a new kind of method.

46 Leave a comment on paragraph 46 0 The idea behind my method is straightforward. A common process in text analysis is to tokenize the text, that is, to turn it into single words or phrases (called n-grams).[34] The corpus can then be represented in a matrix, often called a document-term matrix, where the rows are documents (in this case a newspaper page), the columns are tokens, and the counts in the cells are the number of times that token appears in that document. I created a document-term matrix for both newspapers and for the Bible. As long as one limits the tokens in the matrix to only those words and phrases which appear in the Bible, it is possible to multiply one matrix by the transpose of the other matrix. The result is a document-document matrix with Bible verses down the rows, newspaper pages across the columns, and the count of the number of biblical words and phrases in the cells. If all that talk of tokenizing and matrix multiplication sounds confusing, just remember that at the root it is all just counting: the new matrix has the counts of how many words and phrases from each Bible verse appeared on each newspaper page. That is a common-sense understanding of what a quotation looks like.

47 Leave a comment on paragraph 47 0 In practice it is not always possible to distinguish between actual quotations and false positives using only the counts of tokens. So I created other features that could indicate a quotation. I weighted the matrices by their TF/IDF score.[35] This technique gives more weight to phrases which are unusual and characterize a document. For example, the four-word phrase “went into the city” appears several times in the Bible and many times in newspapers, but the mere fact that phrase appears on a page is unlikely to indicate that it is a quotation from the Bible. But the four-word phrase “through a glass darkly” cannot be anything other than a biblical allusion to 1 Corinthians 13:13. Another useful feature is the percentage of the Bible verse that is quoted. If every word or phrase from a verse appears on a page, then it is more likely to be a genuine quotation than if a single phrase appears. Then too, a bunch of random phrases from the Bible scattered around the newspaper page are unlikely to be a quotation, but if those phrases are concentrated in a single location, then they are far likelier to be a quotation. The spread of phrases around the page can be computed with a statistical test of randomness called a runs test.

48 Leave a comment on paragraph 48 0 Applying this algorithm to the newspapers using my university’s high-performance computing cluster resulted in millions of possible quotations from the Bible. But many of these are false positives, as could be verified by looking up the newspaper page and seeing if the biblical text actually appeared. What was necessary was a way of weeding through all of these potential matches and identifying the correct quotations and discarding the incorrect ones. The computational tool for the job is called machine learning, or more precisely, supervised classification.

49 Leave a comment on paragraph 49 0 The aim of supervised classification is to take inputs (potential quotations) which have certain features (the count of tokens, how unusual those tokens were, etc.) and assign them a label (“quotation” or “not-a-quotation”). Typically when writing a program one takes input data and creates a series of rules about what to do with it. To make up some rules, for instance, one might instruct the computer that matches with more than five tokens are genuine quotations and the rest are noise. However such rules are extremely difficult to write. Ten meaningless tokens might not be a quotation, but one unusual token could be a quotation.

50 Leave a comment on paragraph 50 0 Machine learning takes a different approach. Instead of writing rules and getting answers, you give the computer a set of answers and get back the rules, called a model. In my case, I went through a number of potential matches and labeled them as quotations or not (see figure 6). The idea behind supervised classification is that by showing a machine-learning model what a bunch of real quotations and a bunch of false quotations look like, the model can learn to tell the difference between them.[36] The accuracy of this model can be estimated to test its precision (what proportion of its results are true matches) and recall (the proportion of all matches that it has found).

51 Leave a comment on paragraph 51 0 Figure 6: Some sample data used to train the classification model. The combination of the “reference” and “url” columns indicate a single potential quotation. I then filled out the “match” column to indicate whether that row was a genuine quotation (“TRUE”) or just noise (“FALSE”).

52 Leave a comment on paragraph 52 0 The result was a table, where each row was an instance of a Bible verse quoted on a specific newspaper page, along with the probability that it was a quotation, according to the model.[37] This table was then able to be used for data analysis. The most obvious approach, and the one which is at the center of the prototype site, was to chart time series of the popularity of the verses. Another approach is to look for verses which often appear next to one another, omitting verses which appear in the same passage. And ultimately the research involves lots of reading of the context of the quotations to see how the Bible was actually used.

53 Leave a comment on paragraph 53 0 To return to the initial contention of this chapter, this method of using machine learning to identify biblical quotations is both like and unlike the kinds of keyword searching and, more broadly, the lack of context and evidence of typicality that I was critiquing.

54 Leave a comment on paragraph 54 0 This method is unlike that kind of keyword searching in that it provides a rigorous contextualization of the quotations I’ve found. Unlike a keyword search, which in a corpus of 60+ billion words is guaranteed to turn up at least something, this shows me the relative frequency of occurrence, so that I can see the context of a quotation in terms of the change in quotations over time and in the context of the Bible as a whole.

55 Leave a comment on paragraph 55 0 But this method is still a kind of search. In particular, the way that machine learning works requires you to feed it a bunch of examples, and the model then finds other things that look like the examples. It is impossible for my model to find things other than biblical quotations. Fundamentally, the model cannot surprise because it only ever returns biblical quotations. It is thus subject to the limitations of the question that I have framed.

Broader applications

56 Leave a comment on paragraph 56 0 The method of quotation detection that I have described is only one example of a kind of computational text analysis applicable to the humanities. Other forms of identifying the reuse of text can show the migration of ideas.[38] Another category of textual analysis relies on understanding discursive structures. Topic modeling shows how categories of topics change over time at the document or corpus level.[39] Word-embedded models can show the structure of the relationships between words, allowing a kind of multidimensional map of concepts which can also be compared over time.[40] In short, there are a range of text analytical methods which can be applied to questions in religious studies.

57 Leave a comment on paragraph 57 0 What does it take to learn these kinds of methods? At a minimum it requires learning some kind of programming language suitable for data analysis such as R or Python. There are plenty of digital humanities approaches where one can work without ever seeing, let alone writing, a single line of code. But any serious text analysis or data analysis requires the ability to write scripts in a programming language. There are now abundant resources targeted at the humanist who wishes to learn these skills.[41] And in fact learning to program is the least difficult part of what it takes to do computational work in the humanities. More important is gaining a conceptual familiarity with methods such as data analysis or machine learning. And most important of all is the problem how to frame worthwhile disciplinary questions in the humanities and religious studies.[42]

58 Leave a comment on paragraph 58 0 And since everyone asks, what does this project cost? Both the prototype version and the eventual digital monograph have been created with a minimal budget, apart from a couple hundred dollars I spent out of pocket on web hosting for the first year or so of the project. The lack of apparent costs, however, obscures the very real costs behind the project. The Chronicling America dataset was provided for free by the NEH and the Library of Congress, while the George Mason University Libraries entered into an agreement with Gale to allow text mining of their newspaper collection. I have access to the high-performance computing cluster at George Mason, along with web servers at the Roy Rosenzweig Center for History and New Media to host the site. And it is not to be taken for granted that my department affords me the time to undertake research and has the intellectual flexibility to regard such experimental digital projects as scholarship.

59 Leave a comment on paragraph 59 0 But in this chapter I have set these details about writing code to do text analysis and within the large conceptual problems in computational research. I have described the process of creating a large-scale text analysis project. This project is a kind of a search, and therefore subject to the limitation that it depends on the question asked. But on the other hand the project is generative of new knowledge in that it offers a disciplined serendipity in searching, rigorously contextualizing its results but continually turning up surprises. Whatever methods religious studies scholars employ, to the extent that the try to cope with large data sets and text analysis, they may find the problem of distinguishing between the typical and the exceptional a useful starting question, and the kind of disciplined serendipity that I have described a useful method.

60 Leave a comment on paragraph 60 0 [1] Lincoln A. Mullen, America’s Public Bible: A Commentary (Stanford University Press, forthcoming): <http://americaspublicbible.org>.

61 Leave a comment on paragraph 61 0 [2] “NEH Announces the Winners of the Chronicling America Data Challenge,” National Endowment for the Humanities, 27 July 2016, https://www.neh.gov/news/press-release/2016-07-25.

62 Leave a comment on paragraph 62 0 [3] The expanded version will be published by Stanford University Press as a part of their digital publishing program: https://www.sup.org/digital/.

63 Leave a comment on paragraph 63 0 [4] This idea of disciplined serendipity comes out of conversations with my colleague, Mike O’Malley, who shared an unpublished essay on the topic.

64 Leave a comment on paragraph 64 0 [5] Cameron County Press (Emporium, PA), 19 Sept. 1901, p. 1. All citations to newspapers in this chapter come from Chronicling America: Historic American Newspapers, Library of Congress, http://chroniclingamerica.loc.gov/.

65 Leave a comment on paragraph 65 0 [6] Sacvan Bercovitch, The American Jeremiad (Madison: University of Wisconsin Press, 1978); Andrew R. Murphy, Prodigal Nation: Moral Decline and Divine Punishment from New England to 9/11 (New York: Oxford University Press, 2009).

66 Leave a comment on paragraph 66 0 [7] On the varieties of ways in which Christian terminology has been associated with very different political and cultural assumptions in American history, see Matthew Bowman, Christian: The Politics of a Word (Cambridge, MA: Harvard University Press, 2018).

67 Leave a comment on paragraph 67 0 [8] Watchman and Southron (Sumter, SC), 19 Dec. 1882, p. 5.

68 Leave a comment on paragraph 68 0 [9] St. Louis Republic (St. Louis, MO), 17 Feb. 1902, p. 6.

69 Leave a comment on paragraph 69 0 [10] Record-Union (Sacramento, CA), 6 Mar. 1899, p. 4.

70 Leave a comment on paragraph 70 0 [11] Pacific Commercial Advertiser (Honolulu, HI), 5 Apr. 1905, p. 2.

71 Leave a comment on paragraph 71 0 [12] Mt. Sterling Advocate (Mt. Sterling, KY), 22 March 1905, p. 7.

72 Leave a comment on paragraph 72 0 [13] Monroe City Democrat (Monroe, MO), 22 Mar. 1900, p. 6.

73 Leave a comment on paragraph 73 0 [14] St. Louis Republic (St. Louis, MO), 15 July 1901, p. 7.

74 Leave a comment on paragraph 74 0 [15] Deseret Evening News (Salt Lake City, UT), 15 July 1901, p. 4.

75 Leave a comment on paragraph 75 0 [16] Omaha Daily Bee (Omaha, NE), 23 July 1911, p. 3.

76 Leave a comment on paragraph 76 0 [17] Roy Rosenzweig, “Scarcity or Abundance? Preserving the Past in a Digital Era,” American Historical Review 108, 3 (2003): 735–762, https://doi.org/10.1086/ahr/108.3.735.

77 Leave a comment on paragraph 77 0 [18] For a useful example, see Ian Milligan, “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010,” Canadian Historical Review 94, no. 4 (2013): 540–569.

78 Leave a comment on paragraph 78 0 [19] Michael O’Malley, “Evidence and Scarcity,” blog post, 2 October 2010, http://theaporetic.com/?p=176; and Sean Takats, “Evidence and Abundance,” blog post, 18 October 2010, http://quintessenceofham.org/2010/10/ 18/evidence-and-abundance/.

79 Leave a comment on paragraph 79 0 [20] Thomas Padilla, “Text and Data Mining: Seeking Traction,” LIS Scholarship Archive, 7 March 2018, https://osf.io/preprints/lissa/qxs9j.

80 Leave a comment on paragraph 80 0 [21] For a discussion of this problem, see the Arguing with Digital History working group, “Digital History and Argument,” white paper, Roy Rosenzweig Center for History and New Media

81 Leave a comment on paragraph 81 0 (13 November 2017): https://rrchnm.org/argument-white-paper/.

82 Leave a comment on paragraph 82 0 [22] Jennifer Rutner and Roger Schonfeld, “Supporting the Changing Research Practices of Historians” Ithaka S+R, 11 August 11, 2015, https://doi.org/10.18665/sr.22532, p. 9, 14-15.

83 Leave a comment on paragraph 83 0 [23] Lara Putnam, “The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast,” American Historical Review 121, no. 2 (2016): 377–402, https://doi.org/10.1093/ahr/121.2.377. Quotations at p. 383, 392. See also Tim Hitchcock, “Confronting the Digital: or How Academic History Writing Lost the Plot,” Cultural and Social History 10, no. 1 (2013): 9–23; Ted Underwood, “Theorizing Research Practices We Forgot to Theorize Twenty Years Ago,” Representations 127, no. 1 (2014): 64–72.

84 Leave a comment on paragraph 84 0 [24] “Our own inability to get the joke is an indication of the distance that separates us from the workers of preindustrial Europe. … When you realize that you are not getting something—a joke, a proverb, a ceremony—that is particularly meaningful to the natives, you can see where to grasp a foreign system of meaning in order to unravel it.” Robert Darnton, The Great Cat Massacre: And Other Episodes in French Cultural History (Basic Books, 1984), 77–78.

85 Leave a comment on paragraph 85 0 [25] Gale, 19th Century U.S. Newspapers, https://www.gale.com/c/19th-century-us-newspapers

86 Leave a comment on paragraph 86 0 [26] Peter J. Thuesen, In Discordance with the Scriptures: American Protestant Battles Over Translating the Bible (Oxford University Press, 1999).

87 Leave a comment on paragraph 87 0 [27] Cameron Blevins, “Digital History’s Perpetual Future Tense,” in Debates in the Digital Humanities 2016, ed. Matt Gold and Lauren Klein (University of Minnesota Press, 2016), http://dhdebates.gc.cuny.edu/debates/text/77; “Digital History and Argument” white paper.

88 Leave a comment on paragraph 88 0 [28] Mark A. Noll, “Presidential Death and the Bible: 1799, 1865, 1881” (American Society of Church History, Atlanta, 2016). The latter two presidents died from an assassin’s bullet, so consider the opening of this essay on the McKinley assassination an homage to the conference paper which gave rise to this project.

89 Leave a comment on paragraph 89 0 [29] Mark A. Noll, In the Beginning Was the Word: The Bible in American Public Life, 1492–1783 (Oxford University Press, 2016); Mark A. Noll, The Civil War as a Theological Crisis (University of North Carolina Press, 2006), ch. 3.

90 Leave a comment on paragraph 90 0 [30] James P. Byrd, Sacred Scripture, Sacred War: The Bible and the American Revolution (Oxford University Press, 2013); Valerie C. Cooper, Word, Like Fire: Maria Stewart, the Bible, and the Rights of African Americans (University of Virginia Press, 2011); Philip Goff, Arthur E Farnsley II, and Peter J. Thuesen, eds., The Bible in American Life (Oxford University Press, 2017); Paul C. Gutjahr, An American Bible: A History of the Good Book in the United States, 1777–1880 (Stanford University Press, 1999); Nathan O. Hatch and Mark A. Noll, eds., The Bible in America: Essays in Cultural History (Oxford University Press, 1982); Colleen McDannell, Material Christianity: Religion and Popular Culture in America (Yale University Press, 1995); Seth Perry, “Scripture, Time, and Authority among Early Disciples of Christ,” Church History 85, no. 4 (December 2016): 762–83, https://doi.org/10.1017/S0009640716000780; Jonathan D. Sarna and Nahum M. Sarna, “Jewish Bible Scholarship and Translations in the United States,” in The Bible and Bibles in America, ed. Ernest S. Frerichs (Scholars Press, 1988), 83–116; Stephen J. Stein, “America’s Bibles: Canon, Commentary, and Community,” Church History 64, no. 2 (June 1, 1995): 169–84, https://doi.org/10.2307/3167903; Peter J. Wosh, Spreading the Word: The Bible Business in Nineteenth-Century America (Cornell University Press, 1994); John Fea, The Bible Cause: A History of the American Bible Society (Oxford University Press, 2016); Paul Gutjahr, ed., The Oxford Handbook of the Bible in America (Oxford University Press, 2017).

91 Leave a comment on paragraph 91 0 [31] Users who are interested in how the project was created are welcome to investigate the code for themselves: https://github.com/public-bible. The prototype project has a methodological appendix: http://americaspublicbible.org/methods.html.

92 Leave a comment on paragraph 92 0 [32] Chronicling America’s bulk data and API: https://chroniclingamerica.loc.gov/about/api/

93 Leave a comment on paragraph 93 0 [33] For a useful overview of the problem, see David A. Smith, Ryan Cordell, and Abby Mullen, “Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers,” American Literary History 27, no. 3 (2015): 1–15, https://doi.org/10.1093/alh/ajv029.

94 Leave a comment on paragraph 94 0 [34] For an overview of the text analysis process generally, see Kasper Welbers, Wouter Van Atteveldt, and Kenneth Benoit, “Text Analysis in R,” Communication Methods and Measures 11, no. 4 (October 2, 2017): 245–65, https://doi.org/10.1080/19312458.2017.1387238.

95 Leave a comment on paragraph 95 0 [35] Jure Leskovec, Anand Rajaraman, and Jeff Ullman, Mining of Massive Datasets, 2nd ed. (Cambridge University Press, 2014), section 1.3.1, http://www.mmds.org/.

96 Leave a comment on paragraph 96 0 [36] This description of machine learning is indebted to Francois Chollet and J.J. Allaire, Deep Learning with R (Manning, 2018), section 1.1.2, especially figure 1.2.

97 Leave a comment on paragraph 97 0 [37] Technically any quotation with a probability above 50% was predicted to be a quotation by the model. But I chose to use only predicted quotations above 90% for the initial studies, since I judged the problem of false matches to be worse than the problem of missing genuine quotations. This cutoff may change for the final version of the project.

98 Leave a comment on paragraph 98 0 [38] Ryan Cordell, “Reprinting, Circulation, and the Network Author in Antebellum Newspapers,” American Literary History 27, no. 3 (2015): 417–45, https://doi.org/10.1093/alh/ajv028; Kellen Funk and Lincoln A. Mullen, “The Spine of American Law: Digital Text Analysis and U.S. Legal Practice,” American Historical Review 123, no. 1 (2018): 132–64, https://doi.org/10.1093/ahr/123.1.132.

99 Leave a comment on paragraph 99 0 [39] Andrew Goldstone, Susana Galán, C. Laura Lovin, Andrew Mazzaschi, and Lindsey Whitmore, An Interactive Topic Model of Signs, part of Signs at 40. http://signsat40.signsjournal.org/topic-model.

100 Leave a comment on paragraph 100 0 [40] Ben Schmidt, “Word Embeddings for the Digital Humanities,” blog post, 25 October 2015, http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html; Ben Schmidt, “Rejecting the Gender Binary: a Vector-Space Operation,” 30 October 2015, http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html

101 Leave a comment on paragraph 101 0 [41] E.g., Taylor Arnold and Lauren Tilton, Humanities Data in R (Springer, 2015), http://link.springer.com/10.1007/978-3-319-20702-5; Matthew L. Jockers, Text Analysis with R for Students of Literature (Springer, 2014), http://link.springer.com/10.1007/978-3-319-03164-4.

102 Leave a comment on paragraph 102 0 [42] For that reason my own contribution to the literature on learning computational methods, which is still very much at the beginning stages, focuses primarily on the patterns of argumentation in computational history. Lincoln A. Mullen, Computational Historical Thinking: With Applications in R (2018–): http://dh-r.lincolnmullen.com.

Source: https://opr.degruyter.com/digital-humanities-and-research-methods-in-religious-studies/lincoln-a-mullen-the-making-of-americas-public-bible-computational-text-analysis-for-religious-history/