Voyeur Tools: See Through Your Texts
Introducing Voyeur
Voyeur is a web-based text analysis environment. It is designed to be user-friendly, flexible and powerful. Voyeur is part of the Hermeneuti.ca, a collaborative project to develop and theorize text analysis tools and text analysis rhetoric. This section of the Hermeneuti.ca web site provides information and documentation for users and developers of Voyeur.
What you can do with Voyeur:
- use texts in a variety of formats including plain text, HTML, XML, PDF, RTF and MS Word
- use texts from different locations, including URLs and uploaded files
- perform lexical analysis including the study of frequency and distribution data; in particular
- export data into other tools (as XML, tab separated values, etc.)
- embed live tools into remote web sites that can accompany or complement your own content
Voyeur is a work in progress – it is currently in beta. Some things don't work properly, some planned features aren't available yet. In particular, here are some weaknesses that we recognize:
- lack of more advanced linguistic processing (lemmatization, parts of speech, semantic awareness)
- lack of XML-aware analytic features (though XML is a valid input format)
- the current default skin (configuration of tools) is not well-suited to reading texts
- some of the user documentation is a bit bare
- other funcitonality:
- proximity searching of terms
- multi-word (n-gram) views (though you can search for specific phrases)
To get started, try viewing one of the screencasts to the right or continue to Workshops -> Voyeur Tools for Users
How to Use this Manual
This manual if for novice users,
experienced users, writers and developers. This manual works closely
with The Rhetoric of Text Analysis where you will find
example essays and discussion of text analysis in general.
Novice
Users who want to get started with text analysis should:
- Start
by reading a recipe like Exploring a Theme Through a Text. Try
doing it yourself using Voyeur - the recipe will point you to places to
find a text and the tools to try.
- If you are interested in an
example essay that used this recipe you can look at Now Analyze That.
- Read
Voyeur Tools
for Users, the reference guide to the tools for users. That will
give you ideas about what you can do.
- Read the Introduction:
Thinking Through Technology if you want to understand how we think
technology can assist in the interpretation of texts.
- You might
also read Tools
of Interpretation to understand the types of text analysis tools
and how others use them.
- Oh, and ... read the rest of this
Introduction.
Writers, web authors and researchers
who want to embed Voyeur panels (we call them hermeneuticons) into
their essays, online journals, blogs and so on should:
- Read Voyeur Tools
for Web Authors to understand how to emebed hermeneuticons.
- Follow
our recipe Writing and Interactive Essay and try making one
yourself. Or, if you have a blog, try the recipe Putting Tools in
your Blog.
- Look at how we do it in an essay like Now Analyze That.
Try rewriting the essay or updating it.
- Get an account on the Text
Analysis Developers Alliance wiki and use it as a sandbox (or is it
soapbox) where you can put your interactive essays.
- Read Tools Across
Research if you want to know why think it is important for tools to
be embeddable.
- Read Mashing Blogs and the Knowledge Radio to see
what we did with Voyeur and blogs.
- Join the hermeneutica
discussion list if you want ask questions or be kept informed.
Developers
who want to adapt Voyeur tools or develop their own should:
- Read
our Creative Commons license so you understand what your
responsibilities are and what we think are our responsibilities.
- Look
at our recipe Making your own see-through tool for a tutorial on
how to adapt a Voyeur tool.
- Read the Voyeur Tools for
Tools Developers reference for more information on how to download
code and how to use our code in your own projects.
- Join the hermeneutica
discussion list if you want ask questions or be kept informed.
How
Voyeur connects to Hermeneuti.ca, the book

Voyeur
is the toolset that made possible the analysis reported in
Hermeneuti.ca, the book and web site you are now looking at. The book
reflects on text analysis, gives examples, and discusses the decisions
behind Voyeur. The web site hermeneuti.ca (note how we use the lower
case when referring to the web site) includes the sections of the book
and the manual for Voyeur (which you are reading now.) The two connect
like this:
- The manual for Voyeur is on the web site
hermeneuti.ca along with the book Hermeneuti.ca. This is a hybrid
project, published both online and in print.
- You can use the
manual to figure out how to use Voyeur tools in your research, in
your online publications and you can adapt our code to make your own
tools.
- You can read about text analysis in Hermeneuti.ca
the book (or read it online.) You can read example essays that report
the results of analysis to see what we did with Voyeur tools. You can
read the same essays online with interactive panels that you can
experiment with - one of the best ways to get a sense of text analysis.
- You
can read recipes that describe how you can use the analytical
methods we used on your own texts! These recipes connect the theory,
essays, and tools. They are the tutorial for the tools.
Principles of Voyeur
Introducing Voyeur
Voyeur is the suite of tools used by Hermeneuti.ca to interpret texts and to think about tools. You too can use Voyeur to analyze your own texts, to write essays with emebedded hermeneutical panels generated by Voyeur, and you can adapt the code to create your own versions of tool. This section of the Hermeneuti.ca web site is both a tutorial and a reference.
What you can do with Voyeur
Voyeur is a new type of text analysis tool that you can use across the research cycle. You can:
- Use it to learn how computers-assisted analysis works. Check out our recipes that show you how to do real academic tasks with Voyeur.
- Use it to study texts that you find on the web or texts that you have carefully edited and have on your computer. We don't keep your texts, except temporarily to make your analysis run better. When you are finished we discard your texts and the indexes.
- Use it with TAPoR to create a study space for a group of students or colleagues where the texts and your favorite tools are gathered. Keep interesting results in a research log to return to or to share.
- Use it to add functionality to your online collections, journals, blogs or web sites so others can see through your texts with analytical tools.
- Use it to add interactive evidence to your essays that you publish online. Add interactive panels right into your research essays (if they can be published online) so your readers can recapitulate your results.
- Use it to develop your own tools using our functionality and code.
Design Principles
Although text analysis tool developers might choose to highlight different aspects for their purposes (such as stand-alone software as opposed to web-based software), here are some of the primary design principles for Voyeur, as gleaned from other tools:
- modularity: tools should be able to fit together in various configurations
- generalization: tools should be designed to address a variety of types of text and uses
- domain sensitivity: tools need to be sensitive to the ways in which textual scholars think of and interact with digital texts
- flexibility: tools should be able to work with local or network sources in different formats
- internationalization: tools should allow users to work in different languages
- performance: tools should be reasonably responsive in order to function in a web-based context
- separation of concerns: it may be best to separate back-end analytic procedures from front-end interface concerns
- extensibility: it should be easy to create new tools and adapt existing ones, especially for the purposes of experimentation
- interoperability: tools should provide public APIs so that they can interact with other tools on the web
- skinnability: tools should be able to present themselves differently for different user needs and preferences
- scalability: tools should provide functionality both for a small corpus (like a book) or a large corpus (like many books)
- simplicity: at least one view of the tools should be maximally simple in its interface
- ubiquity: tools should lend themselves to being embedded in content elsewhere on the web
- referenceability: tools and their results should lend themselves to being referenced and cited as academic resources
Though they have existed before to varying degrees in different tools, Voyeur is an attempt to pull together these design principles into a single a package. In some cases the the principles may in fact be contradictory in practice (for instance, supporting large-scale immediate analysis) and compromises must be found. Working through those challenges is one of the aspects that make Voyeur a worthy intellectual challenge.
HyperPo and TAPoRware are the tools with the strongest affinities to Voyeur, but we have devoted considerable thought and attention to improving existing web-based tools in ways further described below.
Scalability. Whereas HyperPo and Taporware can readily handle book-length texts for micro-analysis, both reach their practical limits when corpora grow to beyond a couple of megabytes. In contrast, Voyeur is designed to handle much larger corpora (dozens of megabytes and beyond). There is still a practical (though undefined) limit to the size of corpora for Voyeur given that it seeks to enable immediate micro-analysis, but the Voyeur architecture is desiged with scale in mind. There will always be a tension between indexing speed and retrieval speed: the more time is available for indexing, the faster retrieval tends to be. As such, text analysis tools that require pre-indexing (Philologic, Monk, etc.) will almost always operate faster because pre-processing can be done over the course of hours or even days (building very large relational databases, for instance). In contrast, Voyeur seeks to strike a balance between indexing and retrieval speed: ideally both should happen in a timeframe that seems reasonable in a web-based context. The ever-evolving pace of computing power and the promise of high performance computers obviously make the actual capabilities a moving target.
Ubiquity. As useful as text analysis tools like HyperPo and Taporware may be, we recognize a need to allow content providers and producers (like bloggers) to quickly and easily integrate functionality into their own space. The previous model was limited to users bringing their own texts to our tools, we now wish to also allow users to also bring our tools to their texts. In some cases users will wish to have static results, in which case we can provide a mechanism for easily copying and pasting results that can be directly embedded in other content. However, much of the most compelling functionality of Voyeur is interactive and requires considerable client-side scripting: our current approach is to provide a tiny snippet of HTML that is essentially an IFRAME that contains the necessary HTML elements. This approach allows Voyeur code to remain separate from its host while satisfying security limitations of cross-browser scripting. There are of course other challenges inherent to code embedded elsewhere, including version management (supporting legacy syntax) and cacheing of data (both the corpus and results).
Referenceability. The status of text analysis tools as academic resources has been a point of debate over the years. Scholars feel compelled to cite ideas and texts that come from other authors, but they are much less likely to recognized tools that have contributed to their work (and we would probably not want every scholar to cite search engines such as Google that have been used during research). We feel strongly that text analysis tools can represent a significant contributor to digital research, whether they were used to help confirm hunches or to lead the researcher into completely unanticipated realms. In any case, we have designed Voyeur to be conducive to citation in various ways, including a general citation to Voyeur and citations for static or dynamic results. An important component of academic knowledge is reproducibility, and providing scholars with more information on the processes followed during research – including the use of text analysis tools – is sure to be useful.
Ultimately, Voyeur is an attempt to learn from the strengths and weaknesses of past tools, to recognize current user needs (ex: working with much larger corpora), and to anticipate future practices (ex: referencing text analysis tools and results). We believe that the potential for tools in the interpretive process merits continual rethinking of tool design and functionality, and as such, Voyeur is of course a work in progress.
Some Background of Voyeur
Text analysis tools go back to the first ad-hoc tools that Roberto
Busa created for his concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. of the works of Thomas Acquinas and
Andrew Booth’s Mechanical Resolution of Linguistic Problems
in the 1950s.
Voyeur is a suite of analysis and exploration tools
for digital texts. Very few contributions to knowledge and technology
are unrecognizable from what preceded, and Voyeur is
no exception: it is largely built on the foundations of text analysis
tool design and methodology from over 50 years of humanities computing
research. The following are some of the tools
that have most influenced text analysis tool development and Voyeur in
particular:
- Unix command-line tools (grep, sort, uniq, wc,
awk, etc.), since the 1970's. Each unix tool is designed to do one relatively
simple thing very efficiently. The power of these modular tools is in
how they can be combined in endless ways through the piping mechanism
(the output of one tool becomes the input of the next in a chain).
- Oxford
ConcordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. Program (OCP), early 1980's. OCP provided one of the first
examples of a generalized tool for producing concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia., the most
historically prevalent activity for text analysis in humanities
computing. Although other concordancing programs were available before
(such as COCOA), OCP gained wide acceptance. The parallel explosion of
personal computing also led to a variant of OCP called MicroOCP (for
DOS).
- WordCruncher (mid 1980's and
into 1990's). Whereas OCP was essentially mainframeMainframe computers are generally larger systems shared by multiple users (similar to modern servers though they're usually no longer referred to as mainframe computers). This was the dominant mode of computing until the advent and rise of personal computers in the late 1970's. Like modern high performance computing systems, processing on mainframe computers was done as batch processes and lacked interactivity. (shared) computer
software, WordCruncher was built for personal computing in DOS, which
meant innovative interface solutions needed to be found. Originally
called BYU ConcordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. Program prior to commercialization, WordCruncher
morphed from its DOS form to a Windows-based form in the 1990's.
- Textual Analysis Computing
Tools (TACT) and TACTWeb, 1990's.
TACT was a widely-used DOS-based suite of programs that included some of
the usual features for building concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia., frequency lists,
collocate lists (frequencies within the context of a keyword), but also
some less common features like finding anagrams within a text. Similar
to OCP and WordCruncher, TACT required a step of text preparation that
enabled fine-grained searching and retrieval (based, for instance, on
the presence of specified tags). TACT also provided some navigational
features between the different displays that anticipated similar
functionality through hypertext (though again, TACT was DOS-based). The
Modern Languages Association (MLA) published a volume in 1996 entitled
"Using TACT with Electronic Texts" which further extended the reach of
this tool and solidified its role as the dominant text analysis tool
suite of the 1990's. TACTWeb was an adaptation of TACT to run on the web
that we developed by John Bradley and Geoffrey Rockwell.
- HyperPo,
late 1990's until present. HyperPo was the first web-based text analysis
suite available. It provided much of the same functionality as TACT,
but with a greater focus on interlinking between the original text being
analyzed and the data results (for instance, a user can click on a word
in a concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. to return to that location in the text). Inspired by
work on the Oulipo, HyperPo also provides some more experimental and
ludic functions (palindromes, text reversal, text entropy, etc.). As a
web-based tool, HyperPo was innovative in allowing users to work with
texts from a variety of places (pasted into a text box, uploaded from a
local drive, retrieved from a URL) and a variety of formats (plain text,
HTML and XML). Unlike most of its predecessors, HyperPo doesn't require
preliminary steps by the user for preparing and indexing a text (a
paradigm we call immediate analysis). Finally, HyperPo was designed from
the outset to be localized (the interface could be translated into
different languages) and to support a variety of character sets (UTF-8,
ISO-8859-1, etc.) and languages.
- Philologic (2000's until
present): Philologic is a bit different than the preceding examples in
that it is really a back-end framework for ingesting, indexing, and
retrieving encoded text – it is not as concerned with the end-user
interface (what the research might use). Philologic is the back-end
system used by front-end interfaces like ARTFL.
Philologic is noteworthy in that it emphasizes speed for large corpora
while supporting more sophisticated operations on encoded texts and
providing common analytic features (concordancing, frequency lists,
etc.).
- GATE and LingPipe (2000's until
present). These are two of the most prevalent examples of text analysis
frameworks: they are useful both as stand-alone analytic tools for
experts and as software libraries for other text analysis tools. Each
framework has its respective strengths and weaknesses, but both provide
extensive capabilities for such operations as part of speech tagging and
entity extraction.
- TAPoRware (mid 2001 to
present). Similar to HyperPo, Taporware is a suite of web-based tools
that allow users to specify their own texts and begin immediate analytic
work. TAPoRware provides a model for extensibility and rapid
development of experimental text analysis tools: a simple menu provides
access to some 50 tools for performing a variety of operations on
different text formats.
- TAPoR Portal (mid 2002 to
present). TAPoR is a personalized virtual workbench for doing text
analysis by providing a persistent web-based space for invoking remote
digital tools with remote electronic texts (users are able to define
texts and tools of interest that remain accessible between sessions).
Although not itself a text analysis tool, the TAPoR Portal served to
push notions of tool interoperability, and especially the value of
remote tools exposing public APIs and web services. The TAPoR Portal
also provides a mechanism for changing its appearance – or skinning –
depending on user profiles and preferences.
- Monk
(present). The Monk project is a notable recent attempt to engage in
large-scale data mining activities from the perspective of the
humanistic – and especially literary – scholar. Among the challenges
confronted are 1) how cleanly and extensively encoded do texts need to
be to be useful for literary scholarship? 2) how might a user interface
be designed in order to expose the sophisticated aspects of data mining
while remaining accessible to literary scholars? 3) what types of
literary procedures are enabled by work on very large corpora?
- Google
(late 1990's until present). Although it may seem strange to include
Google in this list of specialized text analysis software, we do so for
three reasons: 1) like the vast majority of search engines, Google is
primarily focused on search and retrieval of textual content, which
requires text analysis at various stages; anyone using a search engine
is also using text analysis; 2) Google set a new standard for simplicity
in interface: their default search page is relatively sparse and draws
attention to a single search box and a single action button – Google has
established a paradigm for a simple user interface to text analysis; 3)
Google has agressively pushed embedding its tools in content that's
elsewhere, whether it be the common search box that web authors can
include on their pages, web traffic analytics, or even embedded YouTube
videos (Google owns YouTube).
Quick Guide of Voyeur for Users
Voyeur is a web-based tool. To use it go to http://voyeurtools.org. This is what you will see,

To use Voyeur you need to specify a text. You can do this different ways:
- Type the text into the field and then press reveal. You can also cut and paste from somewhere else.
- Type a URL (Uniform Resource Locator or web address) into the field and press reveal. Voyeur recognizes a URL and retrieves the web page as the text. If you type many URLs Voyeur will retrieve them all and treat them as a corpus of many documents
- Upload a text from your hard drive and press reveal.
Voyeur, once it retreives your text, will index it for analysis and display this simple arrangement of two panels,

Once your text or collection is indexed Voyeur will present you will a display with two panels. The right hand panel summarizes the text. The left panel shows you the high frequency words. You can show an hide different panels using the double arrow button. You can also see more panels by selected a word or words to follow. Try the Words in the Entire Corpus panel.

There are a number of features to the Words in the Entire Corpus panel:
- Selection: You can select one or more words by clicking on the selection box. To select all in view click the selection box at the top. You can also use that to de-select all.
- Searching: You can search for a word in the list by typing it into the Search box. If you want a number of words type them with commas as in:
salt, pepper, sardines, oil
- Phrases: You can search for phrases by using quotes as in:
"digital humanities"
- Favorites: Any selected word can be added to a favorites list by clicking the heart with the plus (+). You can add as many words as you want. You can toggle between the full list and the favorites list by clicking the plain heart icon.
- Sorting Columns: Click on the column title to sort the list by that column. Click the arrow to choose direction for sorting and to add other columns.
- Options: For this list of all the words in the collection it is useful to use a stop word list that hides the common words like "the", "a", "an". Click on the Option button in the top right. The Option button looks like a gear.
- Save: You can export and save the results of by clicking the button in the top right with an icon for a floppy disk. There are different formats you can save or export the results.
Once you select one or more words you will get an arrangement of panels like this,

The panels are connected so that clicking in one will trigger updates in the others. Typically the order is the following:
- You select a word or words in the Words in the Entire Corpus panel (1).
- If you rare looking at a collection of documents, in the Words within each Document panel (3) you will see how the selected words are distributed across documents.
- In the Word Trends panel (6) you can see a distribution graph if your documents are distributed chronologically.
- If you select a particular document the Word Trends will show distribution just in the document.
- If you click on a document node in the Word Trends display it will show the Keywords in Context for that document.
- Likewise, if you select a particular document in the Words within each Documents panel, you can see a Keywords in Context display (4) for the selected words and the document. Clicking the plus (+) button next to each context will show you more context.
- You can save contexts as favorites to compare or export.
Workshops
DH2010 Introduction to Voyeur
This is an outline for a workshop on Voyeur. It was developed for a workshop before DH 2010 in London, England.
1.0 Introduction
- The workshop leaders will introduce themselves:
- What will happen?
- how to use Voyeur with a single text
- how to use Voyeur with a corpus
- try Voyeur on your corpus
- concluding remarks on advanced features
- Now make sure you can connect to the wireless
- Connect to Hermeneuti.ca and explore the resources there. Here are some useful links:
2.0 Analyzing a Single Text
In the first part of the Workshop we will show you how to use Voyeur to analyze a single text as a way of learning the interface. We will work with the Introduction, Preface, Chapter 1 and Chapter 2 of Mary Shelley's Frankenstein. The plain text is here:
http://taporware.ualberta.ca/sampleDocs/plainText.txt - This is just a couple of chapters
http://www.gutenberg.org/cache/epub/84/pg84.txt - This is the Gutenberg version of the full text
- We will open Voyeur:
- Show how to load a text
- Show the different panels that appear initially
- Discuss the order they open and the Summary panel
- Go over the Words in the Entire Corpus panel (Options, Columns, Search, Favorites)
- Discuss the full set of panels
- Show how to manage panels
- Discuss trigger order of panels (flow within Voyeur)
- Show how to get help (Mention Quick Guide)
- Show how to make a list of favorite words to explore searching for words and saving in favorites
- Now you should try Voyeur with your text or the Frankenstein text above. To open the Frankenstein click here:
http://voyeurtools.org/?corpus=1278409278561.646
- Some things to try:
- Experiment with the Options (like the Stop Word list)
- Create a Favorites list for a theme and and explore that list
- Search for phrases
3.0 Analyzing a Corpus
In the second part of the Workshop we will look at working with a corpus or collection of many texts. We will use Voyeur on the archives of HUMANIST from 1987 to 2008 (21 documents.) The Voyeur index is at:
http://voyeurtools.org/?corpus=humanist
- We will show you how to:
- Show how to set various options, like stoplists
- Show how to hide and show columns
- Manage multiple documents
- Show how to group results
- Show comparing documents
- Try looking for trends yourself
4.0 Using your own text
- Now you can try your own text. We will show the different ways of providing Voyeur a text:
- Typing a text or pasting it in
- Typing in one or more URLs
- Uploading a text
- We will then discuss the formats of texts that will work, and what will happen to them:
- file formats: text, HTML, XML, RSS, TEI, PDF, MS Word, RTF
- Finally we will Discuss caching and so on
- Now try your own text.
5.0 Exporting Data and Quoting Analytics
We will now show how to export data and quote analytical results:
- How to export tab-separated values, copy and pasted into Excel
- How to export of XML results from KWICsA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia. (for instance)
- How to quote an analytical result in TADA.
- Go to http://tada.mcmaster.ca/Sandbox/VoyeurWorkshop to try it yourself.
6.0 Advanced and Other
- There are other beta tools in Voyeur that can be accessed:
7.0 To Prepare
- Make sure we have Voyeur running with a backup
- Sort out how participants can get on wireless
- Powerbars for laptops
- What texts will we use?
- Preindex texts and create a Workshop web page on Hermeneuti.c
Appendices
This section is mostly a parking lot for miscellaneous subsections – content will eventually be moved, integrated elsewhere, or deleted.
Functionality
Some of the discussed functionality:
Input
- from Zotero
- from Firefox plug in
- from portal
- from interactive essays
- from web site
- from results buttons that allow recapitulatio (like in portal or taporware)
- from links
- from eclipse
- panels
- command line
- application (eclipse)
- swing based interface
Output:
- out to blogs
- output to portal research log
- output to gathering (a tiddlywiki web page that you can save to computer) –
Tools:
- tzeeker is a panel builder
Priority Tools:
- cleaner
- list words with distribution
- comparative list words
- search concordanceA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia.
- KWicA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). See also the definitions from TADA and Wikipedia.
- repeated phrases
- distribution
- visual collocator
Logging:
- ability to log what happens
Interface to frameworks
- api that provides info like progress
- good error response
- Help and stuff through the framework
Temporary Workaround for Additional Tools
At the moment the default interface of Voyeur doesn't expose the range of tools that are available. As an awkward workaround, you can try this:
Click on the export icon to generate a URL of your corpus (notice that here I'm working on voyeurtools.org instead of voyeur.hermeneuti.ca):

This will generate a URL that looks something like this: http://voyeurtools.org/?corpus=1278412802513.7776
What you can do is insert tool/<em>toolName</em> in this address to see other tools (your mileage may vary):