DH 2011 Visualization for Literary History (Visualization with Voyeur)

This is an outline for a workshop on visualization with Voyeur. It is based on a workshop given at DH 2010 in London, England.

1.0 Introduction

2.0 Visualizing a Single Text

In the first part of the Workshop we will show you how to use Voyeur to visualize  a single text as a way of learning the interface. We will work with the Introduction, Preface, Chapter 1 and Chapter 2 of Mary Shelley's Frankenstein. The plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files. Plain text files do not require a specialized program, such as a word processor, to read them. For more information, see the Wikipedia. Return to Glossary. is here:

http://taporware.ualberta.ca/sampleDocs/plainText.txt - This is just a couple of chapters

http://www.gutenberg.org/cache/epub/84/pg84.txt - This is the Gutenberg version of the full text

In order to focus on each tool independently, will open each Voyeur tool separately. 

  • First we will look at Cirrus: http://voyeurtools.org/tool/Cirrus/  
  • Cirrus is a visualization tool that displays a word cloud A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text. Return to Glossary. relating to the frequency of words appearing in one or more documents. One can click on any word appearing in the cloud to obtain detailed information about its relativity. The larger the word, the more frequent the term.
    • Show how to load a text by copying one of the Frankenstein URLs into the "Add Texts" box
    • Show how hovering over the words reveals a number showing the word count of the current word in the corpus. 
    • Show how clicking on a word produces a textual set of results as a list on a new page.  These results include a count, a relative count, and a trend graph.
  • Next we will look at Links: http://voyeurtools.org/tool/Links/
  • Links finds collocates for words and displays links between them using a force directed graph. It shows term frequencies in proximity to keyword. It is a visualization and shows a web of terms. Once you arrive to Links, insert / upload your content and let the tool perform its analysis. You will be presented with a web type visualization. You may hover over words to find data pertaining to that word within your corpus. You may also double-click on any word to find a more detailed analysis. Clicking and dragging allows you to organize your corpus. If there are multiple documents within the corpus, they will be coloured differently.
  • Load a text by copying one of the Frankenstein URLs into the "Add Texts" box
  • If you hover over a term, Voyeur will tell you its linkage within the corpus documents.
  • Try dragging and dropping terms to organize them.
  • If you would like to manipulate the visualization, right-click on any of the terms and choose 'Stick/unstick' or 'Remove'. 'Stick/unstick' puts the term in place, and is not moved when other terms are moved. 'Remove' simply removes the term from the visualization.
  • Clicking on the options button (the button that looks like a gear) will launch a dialog box with various options pertaining to the Links tool. Stop words list is if you would like to exclude words from the visualization. (Usually words such as 'a', 'the', and 'and'.) 'NodeA node in a graph is the basic unit of data from which a graph can be constructed. In text analysis using a hypergraph, nodes connect to other nodes. Each node represents a word, and nodes touching where words are found in conjunction with one another in the source text. For more information on nodes, see the Wikipedia. Return to Glossary. size determined by type frequency' is the default, and will result in sorting by how often the term appears in the documents. Sorting by 'NodeA node in a graph is the basic unit of data from which a graph can be constructed. In text analysis using a hypergraph, nodes connect to other nodes. Each node represents a word, and nodes touching where words are found in conjunction with one another in the source text. For more information on nodes, see the Wikipedia. Return to Glossary. links' will result in terms appearing larger if they are heavily linked with other terms. 'Autofit graph on screen' sizes the graph depending on the size of your browser window. 'Remove orphans' will remove terms which are not linked to any other term in the visualization.
Now we will look at Word Trends http://voyeurtools.org/tool/TypeFrequenciesChart/
  • Term Frequencies Chart shows how terms are distributed across document(s) in a corpus (documents are shown in the order in which they were added).  Every charted lineA line is the string of text limited by the width of a page. Lines are often used in tokenization, and may contain parts of one or more sentences. For example "The quick brown fox jumps over the lazy dog." is a complete sentence and occurs on one line. By contrast, "Hard by a great forest dwelt a poor wood-cutter with his wife and his two children. The boy was called Hansel and the girl Gretel. He had little to bite and to break, and once when great dearth fell on the land, he could no longer procure even daily bread." spans three sentences and four lines. Return to Glossary. represents one word common throughout the entire corpus. If you hover over specific points it will give you specific information about that word in a specific document.
  • When you add analyze a corpus with Term Frequencies Grid, you will initially have common words at the top of the chart with colour codes. You will see lines within the graph which are coloured accordingly to those words. If you click on one of the terms at the top, it will omit that term from the graph.
  • When we hover over the segment points, we can see the frequency of that term in that segment. If you click on the point, Voyeur will open a new window with detailed information of that segment and term within its Document KWICs tool.
  • If you click and drag on a section of the chart it will zoom in to that section. To reset the chart to its original state, click on “reset zoom”.

  • If you would like to see less or more segments on the chart, simply click on “Segments” at the bottom left of the chart to choose the desired segments.

Other Things
  • We will look at how how to get help (Mention Quick Guide)
  • Some things to try:
    • Experiment with the Options (like the Stop Word list)
    • Create a Favorites list for a theme and and explore that list
    • Search for phrases

3.0 Analyzing a Corpus

In the second part of the Workshop we will look at working with a corpus or collection of many texts. We will use Voyeur on the archives of HUMANIST from 1987 to 2008 (21 documents.) The Voyeur index is at:

http://voyeurtools.org/?corpus=humanist

 

  • Bubblelines is a visualization tool that helps to understand patterns of word repetition in one or more documents. Each document is represented as a horizontal lineA line is the string of text limited by the width of a page. Lines are often used in tokenization, and may contain parts of one or more sentences. For example "The quick brown fox jumps over the lazy dog." is a complete sentence and occurs on one line. By contrast, "Hard by a great forest dwelt a poor wood-cutter with his wife and his two children. The boy was called Hansel and the girl Gretel. He had little to bite and to break, and once when great dearth fell on the land, he could no longer procure even daily bread." spans three sentences and four lines. Return to Glossary. and each seach term is represented as a bubble – the bubble represents the frequency of the term in the corresponding segment of text (the text is divided into segments of equal length). The larger the bubble, the more frequent the term.
  • Load a text by copying one of the Frankenstein URLs into the "Add Texts" box
  • Hovering over a bubble, or set of bubbles, will cause a box to appear that displays the frequency counts for that segment of text.

  • Similarly, hovering over the number at the end of the lineA line is the string of text limited by the width of a page. Lines are often used in tokenization, and may contain parts of one or more sentences. For example "The quick brown fox jumps over the lazy dog." is a complete sentence and occurs on one line. By contrast, "Hard by a great forest dwelt a poor wood-cutter with his wife and his two children. The boy was called Hansel and the girl Gretel. He had little to bite and to break, and once when great dearth fell on the land, he could no longer procure even daily bread." spans three sentences and four lines. Return to Glossary. will cause a box to appear that summarizes the frequency for the entire document.

  • When Bubblelines first loads a corpus, you may see terms that have been pre-selected and included in the URLA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary. or embedded page. If no terms are specified, Bubblelines automatically fetches the five most frequent terms and displays bubbles based on those.

  • You can remove the default terms by clicking on the "Clear Terms" button.

  • You can add additional terms to be displayed using the "Find Term" box. Note that available terms will appear as you type and you can pick an item from the list to have it added.

  • In addition to adding and removing terms, you can toggle the display of the terms that have been loaded. To do so simply click on the term (active terms are underlined).

  • ScatterPlot creates a scatter plot graph of terms, spaced by their variation from one another. Once you arrive to ScatterPlot, insert / upload your content and let the tool perform its analysis. You may hover over these dots and click on them for more information.
  • When you first load ScatterPlot, you will see a variety of terms plotted on a graph. If you hover over the terms, you will see their variation explained by each component on the x and y axis. If you click on any of these terms, it will bring you to the Document KWICsA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary. tool for further analysis.

  • ScatterPlot offers options for changing the plot. The terms button allows you choose how many terms should be displayed. The dimensions button lets you switch between a two or three dimensional graph. Toggle labels simply removes or adds labels for the terms on the graph.

  • Some other things to try:
    • Set stoplists.  You may want to exclude common words.  To do this, click on the "Options" button, represented by a gear icon in the upper-right. 
    • Manage multiple documents.  
    • Show how to group results
    • Show comparing document
  • Try looking for trends yourself using the different tools

4.0 Using your own text

  • Now you can try your own text. There are different ways of providing Voyeur a text:
    • Typing a text or pasting it in
    • Typing in one or more URLs (as we have done above)
    • Uploading a text, using the "upload" button
  • For uploading, there are a number of formats of texts that will work:
    • file formats: text, HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers. HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this: < body >< p >< /p >< /body > In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body. Elements may also be modified by attributes and attribute values: < p class="hangingindent" > In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. , XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. , RSS, TEI, PDF, MS Word, RTF
  • Finally, we will discuss caching and so on.
  • Now try your own text.

5.0 Exporting Data and Quoting Analytics

We will now show how to export data and quote analytical results:

  • How to export tab-separated values, copy and pasted into Excel
  • How to export of XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. results from KWICsA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary. (for instance)
  • How to quote an analytical result in TADA.
  • Go to http://tada.mcmaster.ca/Sandbox/VoyeurWorkshop to try it yourself.

6.0 Advanced and Other

7.0 To Prepare

  • Make sure we have Voyeur running with a backup
  • Sort out how participants can get on wireless
  • Powerbars for laptops
  • Preindex texts