DH2012 Workshop in Hamburg

This is a script for a workshop on using Voyant Toos for the Digital Humanities 2012 Conference in Hamburg.

 

1.0 Introduction

 

  • Workshop leaders: Stéfan Sinclair, McGill University and Geoffrey Rockwell
  • Overview
    Voyant Tools is a web-based environment for reading and analyzing digital texts, created Stéfan Sinclair and Geoffrey Rockwell. It was previously called "Voyeur" so don't be confused if that name is used. Voyant is the next generation in a series of text analysis tools that include HyperPo and TAPoRware. It provides tables and graphs related to word use across a single document or a collection. Voyant adds, among other things, the ability to handle much larger files than the previous tools could. Voyant is actually a suite of modular tools that can be combined in pre-defined or user-defined combinations called skins. This workshop's primary objectives are to better understand how and why one might use Voyant Tools to help in the study of digital texts.
  • Outline
    In this workshop we will:
    • First, look at how to use a single Voyant tool, Cirrus, with a small multilingual corpus.
    • Then learn how to use the normal "skin" (multi-tool interface) of Voyant with a single text.
    • Show how to load your own text(s) into Voyant.
    • Look at some of the more exploratory and advanced tools available in Voyant, such a Bubbles and Correspondence Analysis.
    • Discuss the use of Voyant Tools in a larger research process (embedding tools in remote content, etc.)
  • Help
    Here are some useful links:

N.B. Voyant Tools is in beta, it has warts and blemishes. Always view what you're looking at with some circumspection and if something doesn't work as expected, assume it's a bug, not something that you're misunderstanding (and please tell us about it).


2.0 Using a single Voyant Tool: Cirrus

Voyant Tools has a number of different tools that can be composed into skins or used individually. We will start with just one tool called Cirrus that can then spawn other tools. We will try it with the English version of the Universal Declaration of Human Rights.

http://work.voyant-tools.org/tool/Cirrus/?corpus=unhr&docIndex=0&stopList=stop.en.taporware.txt&toolFlow=simple

The Cirrus tool shows you a word cloud A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text. Return to Glossary. of high frequency words. Some questions to ask yourself:

  • What words did you expect? What words are missing? What words are interesting?
  • How does the tool arrange words and choose colours? Is there any correspondence between size and frequency?

Here are some more Cirrus visualizations to consider:

These types of word clouds are prevalent from academia to advertising – they quickly provide an intriguing representation of a text, as demonstrated by this example of studying gendered languages in toy advertising. But they're ability to rapidly convey a picture with words comes at the cost of information reduction, and some are highly critical of word clouds as hermeneutical tools. What do you think?

These Cirrus visualizations don't show all top frequency words, so-called stopwords are missing – stopwords are function words (like determiners and prepositions) that typically carry less meaning. What to include in a stopword list is a matter of interpretation and purpose. Are numbers (like "one") important? What about words like "against"? A new feature in Voyant Tools is the ability to set and edit stopword lists. To do so, click on the options (gear) icon and then click on "Edit Stop Words".

Try It: Try clicking on a word. It will launch a second tab or window with a list of the texts in the corpus with the frequency of the word you clicked on.

Try It: Now try double-clicking on one of the texts. This should launch another tab or window with a Key Word In ContextIn text analysis, context refers to the text surrounding a string of characters, which may be as short as a word or as long as a paragraph. Context is particularly important when generating a concordance for a string. Return to Glossary. (KWICA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary.) of the word in that text. Note that you may have to allow pop-ups.

Try It: Try some of the other individual tools at docs.voyant-tools.org/tools


3.0 Using a Reading Skin

Voyant Tools can also be composed into "skins" that combine tools as panels so that they can be used interactively. Here is the same corpus in a simple skin.

In this skin clicking in one panelWeb frameworks like the TAPoR Portal organize information into panels (sometimes called portlets or coplets.) These can me minimized, maximized and closed using the three buttons in the upper left-hand corner of the panel. With Voyant you can export panels of results and place them into other web sites. Return to Glossary. will often (but not always) update other panels. Try the following:

  • Triggering: Click on words in the Cirrus word cloud A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text. Return to Glossary.. Then click on a text in the Word Trends and play with the KWICA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary..
  • Changing Settings: Try changing the settings for the Cirrus by clicking on the small gear icon. Try playing with the Word Trends
  • Showing and Hiding Panels: Try showing and hiding panels using the small up and down arrows in the upper-right of the panels.

When in doubt just restart the session by hitting refresh.


4.0 Using Voyant on You Own Text

Voyant Tools can be used on your own text or corpus. To do that you go to the simple URLA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary. for the tool:

Voyant: work.oyant-tools.org

Just the Cirrus tool in Voyant: work.voyant-tools.org/tool/Cirrus

You will get panelWeb frameworks like the TAPoR Portal organize information into panels (sometimes called portlets or coplets.) These can me minimized, maximized and closed using the three buttons in the upper left-hand corner of the panel. With Voyant you can export panels of results and place them into other web sites. Return to Glossary. that asks you for a text. You can provide:

  • One or more URLs to texts on the web
  • Upload a text or a zipped collection of texts
  • Upload plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files. Plain text files do not require a specialized program, such as a word processor, to read them. For more information, see the Wikipedia. Return to Glossary., HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers. HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this: < body >< p >< /p >< /body > In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body. Elements may also be modified by attributes and attribute values: < p class="hangingindent" > In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. , or XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. texts
  • Upload a PDF (and Voyant will try to extract the text)
  • Paste in a text

Voyant is forgiving, but there are nonetheless issues (and bugs).

Note that you can create a persistent URLA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary. for your corpus – that way your link can be shared or bookmarked and you won't need to reload the texts into Voyant. Click the save icon (disk icon) in the blue bar at the top and the first URLA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary. will be the link for your Voyant corpus.


5.0 Exploratory and Advanced Tools

Voyant Tools is conceived on the notion that text analysis in the humanities is a practice of re-presenting the text, not about producing incontrovertible evidence. Some Voyant Tools are more about aesthetic or ludic aspects of experiencing digital texts, which can directly or indirectly inspire observations that may not be otherwise possible. Here are some examples:

Specialized Skins

At the same time, some tools are more advanced. For instance, one use a  correspondence analysis skin that shows how terms map across multiple documents, such as this view of the Humanist Discussion Group listserv.

Voyant Tools also enables some quick-and-dirty social network analysis. This is possible thanks to a process called named entity extraction (NER) that attempts to automatically identify people, places and locations in a text (at the moment Voyant Tools uses the Stanford Natural Language Processing package to perform this automated process). It's worth emphasizing that automated processes like these are subject to several issues and problems – for instance, how to combine or differentiate between uses of first and/or last names? how to tell if a same name refers to one or two different people? What to do when an organization looks like a person's name (e.g. Johns Hopkins)? Still, you can't beat the simplicity of Voyant Tools RezoViz, especially when working with a mid-size corpus of shorter texts (5-50 articles, for instance). For instance, here is a specialize interface showing connections between people mentioned in emails to the Humanist listserv (RezoViz is in alpha and best experienced in Chrome).

As always, the real strength of Voyant Tools is the ability to create your own corpus – you can start at work.voyant-tools.org/tool/RezoViz.


6.0 Voyant as a Scholarly Tool

One of the essential design principles of Voyant Tools is that it tries to be useful not just at the moment of analysis, but through more phases of research. Here are some examples:

  • as we've already seen, you can export a link to a corpus that can be bookmarked, shared by email or Twitter, or otherwise preserved (as a general rule, a corpus in Voyant will remain accessible as long as it has been consulted at least once in the past month)
  • there's built-in Zotero awareness – you can click on the folder/article icon in the Firefox address bar to create a new entry (though you may wish to complete some of the metadata)
  • you can export data for other applications – for instance, produce a tab-separated values view of a table that can be copy-and-pasted into a spreadsheet application (where you can edit the data and produce even more graphs, charts, etc.)
  • you can embed a live tool in remote content (a blog post, a journal article, a term paper, etc.), much as you would embed a YouTube clip – the interactive affordances of Voyant allow you to go beyond static screenshots and images and allow your users/readers to engage with the content and data themselves, like the with the DH2012 abstracts

7.0 Other Stuff

Here are some other useful resources.

Other Tools:

  • TAPoR 2.0 - Discover and comment on tools. For example, here are the Voyant Tools listed in TAPoR 2.0. Leave a comment on your favorite Voyant tool. Link to a project where you use it.
  • TAPoRWare - Simple tools for processing plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files. Plain text files do not require a specialized program, such as a word processor, to read them. For more information, see the Wikipedia. Return to Glossary., HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers. HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this: < body >< p >< /p >< /body > In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body. Elements may also be modified by attributes and attribute values: < p class="hangingindent" > In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. , and XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary.
  • CWRC List of Visualization Tools
  • DIRTROD

Other Corpora:

Other Voyant Skins: