Introduction

Introducing Voyeur

Voyeur is the suite of tools used by Hermeneuti.ca to interpret texts and to think about tools. You too can use Voyeur to analyze your own texts, to write essays with emebedded hermeneutical panels generated by Voyeur, and you can adapt the code to create your own versions of tool. This section of the Hermeneuti.ca web site is both a tutorial and a reference.

What you can do with Voyeur

Voyeur is a new type of text analysis tool that you can use across the research cycle. You can:

  • Use it to learn how computers-assisted analysis works. Check out our recipes that show you how to do real academic tasks with Voyeur.
  • Use it to study texts that you find on the web or texts that you have carefully edited and have on your computer. We don't keep your texts, except temporarily to make your analysis run better. When you are finished we discard your texts and the indexes.
  • Use it with TAPoR to create a study space for a group of students or colleagues where the texts and your favorite tools are gathered. Keep interesting results in a research log to return to or to share.
  • Use it to add functionality to your online collections, journals, blogs or web sites so others can see through your texts with analytical tools.
  • Use it to add interactive evidence to your essays that you publish online. Add interactive panels right into your research essays (if they can be published online) so your readers can recapitulate your results.
  • Use it to develop your own tools using our functionality and code.

How to use this manual

This manual if for novice users, experienced users, writers and developers. This manual works closely with The Rhetoric of Text Analysis where you will find example essays and discussion of text analysis in general.

Novice Users who want to get started with text analysis should:

  • Start by reading a recipe like Exploring a Theme Through a Text. Try doing it yourself using Voyeur - the recipe will point you to places to find a text and the tools to try.
  • If you are interested in an example essay that used this recipe you can look at Now Analyze That.
  • Read Voyeur Tools for Users, the reference guide to the tools for users. That will give you ideas about what you can do.
  • Read the Introduction: Thinking Through Technology if you want to understand how we think technology can assist in the interpretation of texts.
  • You might also read Tools of Interpretation to understand the types of text analysis tools and how others use them.
  • Oh, and ... read the rest of this Introduction.

Writers, web authors and researchers who want to embed Voyeur panels (we call them hermeneuticons) into their essays, online journals, blogs and so on should:

  • Read Voyeur Tools for Web Authors to understand how to emebed hermeneuticons.
  • Follow our recipe Writing and Interactive Essay and try making one yourself. Or, if you have a blog, try the recipe Putting Tools in your Blog.
  • Look at how we do it in an essay like Now Analyze That. Try rewriting the essay or updating it.
  • Get an account on the Text Analysis Developers Alliance wiki and use it as a sandbox (or is it soapbox) where you can put your interactive essays.
  • Read Tools Across Research if you want to know why think it is important for tools to be embeddable.
  • Read Mashing Blogs and the Knowledge Radio to see what we did with Voyeur and blogs.
  • Join the hermeneutica discussion list if you want ask questions or be kept informed.

Developers who want to adapt Voyeur tools or develop their own should:

  • Read our Creative Commons license so you understand what your responsibilities are and what we think are our responsibilities.
  • Look at our recipe Making your own see-through tool for a tutorial on how to adapt a Voyeur tool.
  • Read the Voyeur Tools for Tools Developers reference for more information on how to download code and how to use our code in your own projects.
  • Join the hermeneutica discussion list if you want ask questions or be kept informed.

How Voyeur connects to Hermeneuti.ca, the book

Diagram

Voyeur is the toolset that made possible the analysis reported in Hermeneuti.ca, the book and web site you are now looking at. The book reflects on text analysis, gives examples, and discusses the decisions behind Voyeur. The web site hermeneuti.ca (note how we use the lower case when referring to the web site) includes the sections of the book and the manual for Voyeur (which you are reading now.) The two connect like this:

  • The manual for Voyeur is on the web site hermeneuti.ca along with the book Hermeneuti.ca. This is a hybrid project, published both online and in print.
  • You can use the manual to figure out how to use Voyeur tools in your research, in your online publications and you can adapt our code to make your own tools.
  • You can read about text analysis in Hermeneuti.ca the book (or read it online.) You can read example essays that report the results of analysis to see what we did with Voyeur tools. You can read the same essays online with interactive panels that you can experiment with - one of the best ways to get a sense of text analysis.
  • You can read recipes that describe how you can use the analytical methods we used on your own texts! These recipes connect the theory, essays, and tools. They are the tutorial for the tools.

Graphic here

Some Background of Voyeur

Text analysis tools go back to the first ad-hoc tools that Roberto Busa created for his concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. of the works of Thomas Acquinas and Andrew Booth’s Mechanical Resolution of Linguistic Problems in the 1950s.

Voyeur is a suite of analysis and exploration tools for digital texts. Very few contributions to knowledge and technology are unrecognizable from what preceded1, and Voyeur is no exception: it is largely built on the foundations of text analysis tool design and methodology from over 50 years of humanities computing research2. The following are some of the tools that have most influenced text analysis tool development and Voyeur in particular:

  • Unix command-line tools (grep, sort, uniq, wc, awk, etc.), since the 1970's. Each unix tool3 is designed to do one relatively simple thing very efficiently. The power of these modular tools is in how they can be combined in endless ways through the piping mechanism (the output of one tool becomes the input of the next in a chain).
  • Oxford ConcordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. Program (OCP), early 1980's. OCP provided one of the first examples of a generalized tool for producing concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia., the most historically prevalent activity for text analysis in humanities computing. Although other concordancing programs were available before (such as COCOA), OCP gained wide acceptance. The parallel explosion of personal computing also led to a variant of OCP called MicroOCP (for DOS).
  • WordCruncher (mid 1980's and into 1990's). Whereas OCP was essentially mainframeMainframe computers are generally larger systems shared by multiple users (similar to modern servers though they're usually no longer referred to as mainframe computers). This was the dominant mode of computing until the advent and rise of personal computers in the late 1970's. Like modern high performance computing systems, processing on mainframe computers was done as batch processes and lacked interactivity. (shared) computer software, WordCruncher was built for personal computing in DOS, which meant innovative interface solutions needed to be found. Originally called BYU ConcordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. Program prior to commercialization, WordCruncher morphed from its DOS form to a Windows-based form in the 1990's.
  • Textual Analysis Computing Tools (TACT) and TACTWeb, 1990's. TACT was a widely-used DOS-based suite of programs that included some of the usual features for building concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia., frequency lists, collocate lists (frequencies within the context of a keyword), but also some less common features like finding anagrams within a text. Similar to OCP and WordCruncher, TACT required a step of text preparation that enabled fine-grained searching and retrieval (based, for instance, on the presence of specified tags). TACT also provided some navigational features between the different displays that anticipated similar functionality through hypertext (though again, TACT was DOS-based). The Modern Languages Association (MLA) published a volume in 1996 entitled "Using TACT with Electronic Texts" which further extended the reach of this tool and solidified its role as the dominant text analysis tool suite of the 1990's. TACTWeb was an adaptation of TACT to run on the web that we developed by John Bradley and Geoffrey Rockwell.
  • HyperPo, late 1990's until present. HyperPo was the first web-based text analysis suite available. It provided much of the same functionality as TACT, but with a greater focus on interlinking between the original text being analyzed and the data results (for instance, a user can click on a word in a concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. to return to that location in the text). Inspired by work on the Oulipo, HyperPo also provides some more experimental and ludic functions (palindromes, text reversal, text entropy, etc.). As a web-based tool, HyperPo was innovative in allowing users to work with texts from a variety of places (pasted into a text box, uploaded from a local drive, retrieved from a URL) and a variety of formats (plain text, HTML and XML). Unlike most of its predecessors, HyperPo doesn't require preliminary steps by the user for preparing and indexing a text (a paradigm we call immediate analysis). Finally, HyperPo was designed from the outset to be localized (the interface could be translated into different languages) and to support a variety of character sets (UTF-8, ISO-8859-1, etc.) and languages.
  • Philologic (2000's until present): Philologic is a bit different than the preceding examples in that it is really a back-end framework for ingesting, indexing, and retrieving encoded text – it is not as concerned with the end-user interface (what the research might use). Philologic is the back-end system used by front-end interfaces like ARTFL. Philologic is noteworthy in that it emphasizes speed for large corpora while supporting more sophisticated operations on encoded texts and providing common analytic features (concordancing, frequency lists, etc.).
  • GATE and LingPipe (2000's until present). These are two of the most prevalent examples of text analysis frameworks: they are useful both as stand-alone analytic tools for experts and as software libraries for other text analysis tools. Each framework has its respective strengths and weaknesses, but both provide extensive capabilities for such operations as part of speech tagging and entity extraction.
  • TAPoRware (mid 2001 to present). Similar to HyperPo, Taporware is a suite of web-based tools that allow users to specify their own texts and begin immediate analytic work. TAPoRware provides a model for extensibility and rapid development of experimental text analysis tools: a simple menu provides access to some 50 tools for performing a variety of operations on different text formats.
  • TAPoR Portal (mid 2002 to present). TAPoR is a personalized virtual workbench for doing text analysis by providing a persistent web-based space for invoking remote digital tools with remote electronic texts (users are able to define texts and tools of interest that remain accessible between sessions). Although not itself a text analysis tool, the TAPoR Portal served to push notions of tool interoperability, and especially the value of remote tools exposing public APIs and web services. The TAPoR Portal also provides a mechanism for changing its appearance – or skinning – depending on user profiles and preferences.
  • Monk (present). The Monk project is a notable recent attempt to engage in large-scale data mining activities from the perspective of the humanistic – and especially literary – scholar. Among the challenges confronted are 1) how cleanly and extensively encoded do texts need to be to be useful for literary scholarship? 2) how might a user interface be designed in order to expose the sophisticated aspects of data mining while remaining accessible to literary scholars? 3) what types of literary procedures are enabled by work on very large corpora?
  • Google (late 1990's until present). Although it may seem strange to include Google in this list of specialized text analysis software, we do so for three reasons: 1) like the vast majority of search engines, Google is primarily focused on search and retrieval of textual content, which requires text analysis at various stages; anyone using a search engine is also using text analysis; 2) Google set a new standard for simplicity in interface: their default search page is relatively sparse and draws attention to a single search box and a single action button – Google has established a paradigm for a simple user interface to text analysis; 3) Google has agressively pushed embedding its tools in content that's elsewhere, whether it be the common search box that web authors can include on their pages, web traffic analytics, or even embedded YouTube videos (Google owns YouTube).

Design Principles

Although text analysis tool developers might choose to highlight different aspects for their purposes (such as stand-alone software as opposed to web-based software), here are some of the primary design principles for Voyeur, as gleaned from other tools:

  • modularity: tools should be able to fit together in various configurations
  • generalization: tools should be designed to address a variety of types of text and uses
  • domain sensitivity: tools need to be sensitive to the ways in which textual scholars think of and interact with digital texts
  • flexibility: tools should be able to work with local or network sources in different formats
  • internationalization: tools should allow users to work in different languages
  • performance: tools should be reasonably responsive in order to function in a web-based context
  • separation of concerns: it may be best to separate back-end analytic procedures from front-end interface concerns
  • extensibility: it should be easy to create new tools and adapt existing ones, especially for the purposes of experimentation
  • interoperability: tools should provide public APIs so that they can interact with other tools on the web
  • skinnability: tools should be able to present themselves differently for different user needs and preferences
  • scalability: tools should provide functionality both for a small corpus (like a book) or a large corpus (like many books)
  • simplicity: at least one view of the tools should be maximally simple in its interface
  • ubiquity: tools should lend themselves to being embedded in content elsewhere on the web
  • referenceability: tools and their results should lend themselves to being referenced and cited as academic resources

Though they have existed before to varying degrees in different tools, Voyeur is an attempt to pull together these design principles into a single a package. In some cases the the principles may in fact be contradictory in practice (for instance, supporting large-scale immediate analysis) and compromises must be found. Working through those challenges is one of the aspects that make Voyeur a worthy intellectual challenge.

HyperPo and TAPoRware are the tools with the strongest affinities to Voyeur4, but we have devoted considerable thought and attention to improving existing web-based tools in ways further described below.

Scalability. Whereas HyperPo and Taporware can readily handle book-length texts for micro-analysis, both reach their practical limits when corpora grow to beyond a couple of megabytes. In contrast, Voyeur is designed to handle much larger corpora (dozens of megabytes and beyond). There is still a practical (though undefined) limit to the size of corpora for Voyeur given that it seeks to enable immediate micro-analysis, but the Voyeur architecture is desiged with scale in mind. There will always be a tension between indexing speed and retrieval speed: the more time is available for indexing, the faster retrieval tends to be. As such, text analysis tools that require pre-indexing (Philologic, Monk, etc.) will almost always operate faster because pre-processing can be done over the course of hours or even days (building very large relational databases, for instance). In contrast, Voyeur seeks to strike a balance between indexing and retrieval speed: ideally both should happen in a timeframe that seems reasonable in a web-based context. The ever-evolving pace of computing power and the promise of high performance computers obviously make the actual capabilities a moving target.

Ubiquity. As useful as text analysis tools like HyperPo and Taporware may be, we recognize a need to allow content providers and producers (like bloggers) to quickly and easily integrate functionality into their own space. The previous model was limited to users bringing their own texts to our tools, we now wish to also allow users to also bring our tools to their texts. In some cases users will wish to have static results, in which case we can provide a mechanism for easily copying and pasting results that can be directly embedded in other content. However, much of the most compelling functionality of Voyeur is interactive and requires considerable client-side scripting: our current approach is to provide a tiny snippet of HTML that is essentially an IFRAME that contains the necessary HTML elements. This approach allows Voyeur code to remain separate from its host while satisfying security limitations of cross-browser scripting. There are of course other challenges inherent to code embedded elsewhere, including version management (supporting legacy syntax) and cacheing of data (both the corpus and results5).

Referenceability. The status of text analysis tools as academic resources has been a point of debate over the years. Scholars feel compelled to cite ideas and texts that come from other authors, but they are much less likely to recognized tools that have contributed to their work (and we would probably not want every scholar to cite search engines such as Google that have been used during research). We feel strongly that text analysis tools can represent a significant contributor to digital research, whether they were used to help confirm hunches or to lead the researcher into completely unanticipated realms. In any case, we have designed Voyeur to be conducive to citation in various ways, including a general citation to Voyeur and citations for static or dynamic results. An important component of academic knowledge is reproducibility, and providing scholars with more information on the processes followed during research – including the use of text analysis tools – is sure to be useful.

Ultimately, Voyeur is an attempt to learn from the strengths and weaknesses of past tools, to recognize current user needs (ex: working with much larger corpora), and to anticipate future practices (ex: referencing text analysis tools and results). We believe that the potential for tools in the interpretive process merits continual rethinking of tool design and functionality, and as such, Voyeur is of course a work in progress.

  1. 1. Limiting himself to the history of science, Thomas Kuhn provides examples of revolutionary advances in thinking, such as Copernican cosmology or Einstein's Theory of Relativity; Voyeur has much more modest ambitions.
  2. 2. For a brief overview of the history of humanities computing, see Hockey "History", 2002.
  3. 3. Unix is used here as shorthand for both Unix and unix-like operating systems like Linux.
  4. 4. . The affininity to Voyeur is not surprisingly given that Sinclair developed HyperPo and Rockwell developed Taporware.
  5. 5. For instance, we wouldn't want to re-run a computationally expensive process each time someone visits a popular blog, but we don't wish to cache everyone's analytic results either.