Some Background of Voyeur

Text analysis tools go back to the first ad-hoc tools that Roberto Busa created for his concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. of the works of Thomas Acquinas and Andrew Booth’s Mechanical Resolution of Linguistic Problems in the 1950s.

Voyeur is a suite of analysis and exploration tools for digital texts. Very few contributions to knowledge and technology are unrecognizable from what preceded1, and Voyeur is no exception: it is largely built on the foundations of text analysis tool design and methodology from over 50 years of humanities computing research2. The following are some of the tools that have most influenced text analysis tool development and Voyeur in particular:

  • Unix command-line tools (grep, sort, uniq, wc, awk, etc.), since the 1970's. Each unix tool3 is designed to do one relatively simple thing very efficiently. The power of these modular tools is in how they can be combined in endless ways through the piping mechanism (the output of one tool becomes the input of the next in a chain).
  • Oxford ConcordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. Program (OCP), early 1980's. OCP provided one of the first examples of a generalized tool for producing concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia., the most historically prevalent activity for text analysis in humanities computing. Although other concordancing programs were available before (such as COCOA), OCP gained wide acceptance. The parallel explosion of personal computing also led to a variant of OCP called MicroOCP (for DOS).
  • WordCruncher (mid 1980's and into 1990's). Whereas OCP was essentially mainframeMainframe computers are generally larger systems shared by multiple users (similar to modern servers though they're usually no longer referred to as mainframe computers). This was the dominant mode of computing until the advent and rise of personal computers in the late 1970's. Like modern high performance computing systems, processing on mainframe computers was done as batch processes and lacked interactivity. (shared) computer software, WordCruncher was built for personal computing in DOS, which meant innovative interface solutions needed to be found. Originally called BYU ConcordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. Program prior to commercialization, WordCruncher morphed from its DOS form to a Windows-based form in the 1990's.
  • Textual Analysis Computing Tools (TACT) and TACTWeb, 1990's. TACT was a widely-used DOS-based suite of programs that included some of the usual features for building concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia., frequency lists, collocate lists (frequencies within the context of a keyword), but also some less common features like finding anagrams within a text. Similar to OCP and WordCruncher, TACT required a step of text preparation that enabled fine-grained searching and retrieval (based, for instance, on the presence of specified tags). TACT also provided some navigational features between the different displays that anticipated similar functionality through hypertext (though again, TACT was DOS-based). The Modern Languages Association (MLA) published a volume in 1996 entitled "Using TACT with Electronic Texts" which further extended the reach of this tool and solidified its role as the dominant text analysis tool suite of the 1990's. TACTWeb was an adaptation of TACT to run on the web that we developed by John Bradley and Geoffrey Rockwell.
  • HyperPo, late 1990's until present. HyperPo was the first web-based text analysis suite available. It provided much of the same functionality as TACT, but with a greater focus on interlinking between the original text being analyzed and the data results (for instance, a user can click on a word in a concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. to return to that location in the text). Inspired by work on the Oulipo, HyperPo also provides some more experimental and ludic functions (palindromes, text reversal, text entropy, etc.). As a web-based tool, HyperPo was innovative in allowing users to work with texts from a variety of places (pasted into a text box, uploaded from a local drive, retrieved from a URL) and a variety of formats (plain text, HTML and XML). Unlike most of its predecessors, HyperPo doesn't require preliminary steps by the user for preparing and indexing a text (a paradigm we call immediate analysis). Finally, HyperPo was designed from the outset to be localized (the interface could be translated into different languages) and to support a variety of character sets (UTF-8, ISO-8859-1, etc.) and languages.
  • Philologic (2000's until present): Philologic is a bit different than the preceding examples in that it is really a back-end framework for ingesting, indexing, and retrieving encoded text – it is not as concerned with the end-user interface (what the research might use). Philologic is the back-end system used by front-end interfaces like ARTFL. Philologic is noteworthy in that it emphasizes speed for large corpora while supporting more sophisticated operations on encoded texts and providing common analytic features (concordancing, frequency lists, etc.).
  • GATE and LingPipe (2000's until present). These are two of the most prevalent examples of text analysis frameworks: they are useful both as stand-alone analytic tools for experts and as software libraries for other text analysis tools. Each framework has its respective strengths and weaknesses, but both provide extensive capabilities for such operations as part of speech tagging and entity extraction.
  • TAPoRware (mid 2001 to present). Similar to HyperPo, Taporware is a suite of web-based tools that allow users to specify their own texts and begin immediate analytic work. TAPoRware provides a model for extensibility and rapid development of experimental text analysis tools: a simple menu provides access to some 50 tools for performing a variety of operations on different text formats.
  • TAPoR Portal (mid 2002 to present). TAPoR is a personalized virtual workbench for doing text analysis by providing a persistent web-based space for invoking remote digital tools with remote electronic texts (users are able to define texts and tools of interest that remain accessible between sessions). Although not itself a text analysis tool, the TAPoR Portal served to push notions of tool interoperability, and especially the value of remote tools exposing public APIs and web services. The TAPoR Portal also provides a mechanism for changing its appearance – or skinning – depending on user profiles and preferences.
  • Monk (present). The Monk project is a notable recent attempt to engage in large-scale data mining activities from the perspective of the humanistic – and especially literary – scholar. Among the challenges confronted are 1) how cleanly and extensively encoded do texts need to be to be useful for literary scholarship? 2) how might a user interface be designed in order to expose the sophisticated aspects of data mining while remaining accessible to literary scholars? 3) what types of literary procedures are enabled by work on very large corpora?
  • Google (late 1990's until present). Although it may seem strange to include Google in this list of specialized text analysis software, we do so for three reasons: 1) like the vast majority of search engines, Google is primarily focused on search and retrieval of textual content, which requires text analysis at various stages; anyone using a search engine is also using text analysis; 2) Google set a new standard for simplicity in interface: their default search page is relatively sparse and draws attention to a single search box and a single action button – Google has established a paradigm for a simple user interface to text analysis; 3) Google has agressively pushed embedding its tools in content that's elsewhere, whether it be the common search box that web authors can include on their pages, web traffic analytics, or even embedded YouTube videos (Google owns YouTube).
  1. 1. Limiting himself to the history of science, Thomas Kuhn provides examples of revolutionary advances in thinking, such as Copernican cosmology or Einstein's Theory of Relativity; Voyeur has much more modest ambitions.
  2. 2. For a brief overview of the history of humanities computing, see Hockey "History", 2002.
  3. 3. Unix is used here as shorthand for both Unix and unix-like operating systems like Linux.