Some Background of Voyeur
Text analysis tools go back to the first ad-hoc tools that Roberto
Busa created for his concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. of the works of Thomas Acquinas and
Andrew Booth’s Mechanical Resolution of Linguistic Problems
in the 1950s.
Voyeur is a suite of analysis and exploration tools
for digital texts. Very few contributions to knowledge and technology
are unrecognizable from what preceded, and Voyeur is
no exception: it is largely built on the foundations of text analysis
tool design and methodology from over 50 years of humanities computing
research. The following are some of the tools
that have most influenced text analysis tool development and Voyeur in
particular:
- Unix command-lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. tools (grepTo grep is to search a text for a string or regular expression pattern of characters.
Return to Glossary., sort, uniq, wc,
awk, etc.), since the 1970's. Each unix tool is designed to do one relatively
simple thing very efficiently. The power of these modular tools is in
how they can be combined in endless ways through the piping mechanism
(the output of one tool becomes the input of the next in a chain).
- Oxford
Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. Program (OCP), early 1980's. OCP provided one of the first
examples of a generalized tool for producing concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right).
Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb:
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
See also the definition at Wikipedia.
Return to Glossary., the most
historically prevalent activity for text analysis in humanities
computing. Although other concordancing programs were available before
(such as COCOA), OCP gained wide acceptance. The parallel explosion of
personal computing also led to a variant of OCP called MicroOCP (for
DOS).
- WordCruncher (mid 1980's and
into 1990's). Whereas OCP was essentially mainframeMainframe computers are generally larger systems shared by multiple users (similar to modern servers though they're usually no longer referred to as mainframe computers). This was the dominant mode of computing until the advent and rise of personal computers in the late 1970's. Like modern high performance computing systems, processing on mainframe computers was done as batch processes and lacked interactivity. (shared) computer
software, WordCruncher was built for personal computing in DOS, which
meant innovative interface solutions needed to be found. Originally
called BYU Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. Program prior to commercialization, WordCruncher
morphed from its DOS form to a Windows-based form in the 1990's.
- Textual Analysis Computing
Tools (TACT) and TACTWeb, 1990's.
TACT was a widely-used DOS-based suite of programs that included some of
the usual features for building concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right).
Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb:
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
See also the definition at Wikipedia.
Return to Glossary., frequency lists,
collocate lists (frequencies within the contextIn text analysis, context refers to the text surrounding a string of characters, which may be as short as a word or as long as a paragraph.
Context is particularly important when generating a concordance for a string.
Return to Glossary. of a keyword), but also
some less common features like finding anagrams within a text. Similar
to OCP and WordCruncher, TACT required a step of text preparation that
enabled fine-grained searching and retrieval (based, for instance, on
the presence of specified tags). TACT also provided some navigational
features between the different displays that anticipated similar
functionality through hypertext (though again, TACT was DOS-based). The
Modern Languages Association (MLA) published a volume in 1996 entitled
"Using TACT with Electronic Texts" which further extended the reach of
this tool and solidified its role as the dominant text analysis tool
suite of the 1990's. TACTWeb was an adaptation of TACT to run on the web
that we developed by John Bradley and Geoffrey Rockwell.
- HyperPo,
late 1990's until present. HyperPo was the first web-based text analysis
suite available. It provided much of the same functionality as TACT,
but with a greater focus on interlinking between the original text being
analyzed and the data results (for instance, a user can click on a word
in a concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. to return to that location in the text). Inspired by
work on the Oulipo, HyperPo also provides some more experimental and
ludic functions (palindromes, text reversal, text entropy, etc.). As a
web-based tool, HyperPo was innovative in allowing users to work with
texts from a variety of places (pasted into a text box, uploaded from a
local drive, retrieved from a URLA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content.
For more information, see the Wikipedia.
Return to Glossary.) and a variety of formats (plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files.
Plain text files do not require a specialized program, such as a word processor, to read them.
For more information, see the Wikipedia.
Return to Glossary.,
HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. and XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data.
Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another:
< book >< title >< /title >< /book >
Elements may also be modified by attributes and attribute values:
< book format="hardcover" >
In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. ). Unlike most of its predecessors, HyperPo doesn't require
preliminary steps by the user for preparing and indexing a text (a
paradigm we call immediate analysis). Finally, HyperPo was designed from
the outset to be localized (the interface could be translated into
different languages) and to support a variety of character sets (UTF-8,
ISO-8859-1, etc.) and languages.
- Philologic (2000's until
present): Philologic is a bit different than the preceding examples in
that it is really a back-end framework for ingesting, indexing, and
retrieving encoded text – it is not as concerned with the end-user
interface (what the research might use). Philologic is the back-end
system used by front-end interfaces like ARTFL.
Philologic is noteworthy in that it emphasizes speed for large corpora
while supporting more sophisticated operations on encoded texts and
providing common analytic features (concordancing, frequency lists,
etc.).
- GATE and LingPipe (2000's until
present). These are two of the most prevalent examples of text analysis
frameworks: they are useful both as stand-alone analytic tools for
experts and as software libraries for other text analysis tools. Each
framework has its respective strengths and weaknesses, but both provide
extensive capabilities for such operations as part of speech tagging and
entity extraction.
- TAPoRware (mid 2001 to
present). Similar to HyperPo, Taporware is a suite of web-based tools
that allow users to specify their own texts and begin immediate analytic
work. TAPoRware provides a model for extensibility and rapid
development of experimental text analysis tools: a simple menu provides
access to some 50 tools for performing a variety of operations on
different text formats.
- TAPoR Portal (mid 2002 to
present). TAPoR is a personalized virtual workbench This is the analysis area of the TAPoR in which you apply text analysis tools to texts.
For more information on Workbench see TAPoR Tutorial on the Workbench.
Return to Glossary. for doing text
analysis by providing a persistent web-based space for invoking remote
digital tools with remote electronic texts (users are able to define
texts and tools of interest that remain accessible between sessions).
Although not itself a text analysis tool, the TAPoR Portal served to
push notions of tool interoperability, and especially the value of
remote tools exposing public APIs and web services. The TAPoR Portal
also provides a mechanism for changing its appearance – or skinning –
depending on user profiles and preferences.
- Monk
(present). The Monk project is a notable recent attempt to engage in
large-scale data mining activities from the perspective of the
humanistic – and especially literary – scholar. Among the challenges
confronted are 1) how cleanly and extensively encoded do texts need to
be to be useful for literary scholarship? 2) how might a user interface
be designed in order to expose the sophisticated aspects of data mining
while remaining accessible to literary scholars? 3) what types of
literary procedures are enabled by work on very large corpora?
- Google
(late 1990's until present). Although it may seem strange to include
Google in this list of specialized text analysis software, we do so for
three reasons: 1) like the vast majority of search engines, Google is
primarily focused on search and retrieval of textual content, which
requires text analysis at various stages; anyone using a search engine
is also using text analysis; 2) Google set a new standard for simplicity
in interface: their default search page is relatively sparse and draws
attention to a single search box and a single action button – Google has
established a paradigm for a simple user interface to text analysis; 3)
Google has agressively pushed embedding its tools in content that's
elsewhere, whether it be the common search box that web authors can
include on their pages, web traffic analytics, or even embedded YouTube
videos (Google owns YouTube).