This is an outline for a workshop on visualization with Voyeur. It is based on a workshop given at DH 2010 in London, England.
1.0 Introduction
- The workshop leaders will introduce themselves:
- Geoffrey Rockwell, University of Alberta, geoffrey (dot) rockwell (at) ualberta (dot) ca, http://www.geoffreyrockwell.com
- Stan Ruecker, University of Alberta, sruecker (at) ualberta (dot) ca, http://www.ualberta.ca/~sruecker/
- Susan Brown, University of Alberta, University of Guelph, sbrown (at) uoguelph (dot) ca
- Stéfan Sinclair, McMaster University, sgs (at) mcmaster (dot) ca, http://stefansinclair.name/
- Overview
Voyeur is currently a beta release by Stéfan Sinclair and Geoffrey
Rockwell. Voyeur is the next generation in a series of text analysis
tools that include HyperPo and TAPoRware. It provides tables and graphs
related to word use across a single document or a collection. Voyeur
adds, among other things, the ability to handle much larger files than
the previous tools could.
- First, we will look at how to use Voyeur with a single text, and examine some of the visualizations that are possible.
- Then we will learn how to use Voyeur with a corpus.
- You will also have the opportunity to try Voyeur on your own corpus, if you have one.
- Finally, we will examine some of the more advanced features provided by Voyeur.
- Now make sure you can connect to the wireless
- Connect to Hermeneuti.ca and explore the resources there. Here are some useful links:
2.0 Visualizing a Single Text
In
the first part of the Workshop we will show you how to use Voyeur to visualize a single text as a way of learning the interface. We will work
with the Introduction, Preface, Chapter 1 and Chapter 2 of Mary
Shelley's Frankenstein. The plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files.
Plain text files do not require a specialized program, such as a word processor, to read them.
For more information, see the Wikipedia.
Return to Glossary. is here:
http://taporware.ualberta.ca/sampleDocs/plainText.txt - This is just a couple of chapters
http://www.gutenberg.org/cache/epub/84/pg84.txt - This is the Gutenberg version of the full text
In order to focus on each tool independently, will open each Voyeur tool separately.
- First we will look at Cirrus: http://voyeurtools.org/tool/Cirrus/
- Cirrus is a
visualization tool that displays a word cloud A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text.
More Info on the TAPoR WordCloud Tool
A good discussion on Tag Clouds
Return to Glossary. relating to the frequency
of words appearing in one or more documents. One can click on any word
appearing in the cloud to obtain detailed information about its
relativity. The larger the word, the more frequent the term.
- Show how to load a text by copying one of the Frankenstein URLs into the "Add Texts" box
- Show how hovering over the words reveals a number showing the word count of the current word in the corpus.
- Show how clicking on a word produces a textual set of results as a list on a new page. These results include a count, a relative count, and a trend graph.
- Next we will look at Links: http://voyeurtools.org/tool/Links/
- Links finds collocates
for words and displays links between them using a force directed graph.
It shows term frequencies in proximity to keyword. It is a visualization
and shows a web of terms. Once you arrive to Links,
insert / upload your content and let the tool perform its analysis. You
will be presented with a web type visualization. You may hover over
words to find data pertaining to that word within your corpus. You may
also double-click on any word to find a more detailed analysis. Clicking
and dragging allows you to organize your corpus. If there are multiple
documents within the corpus, they will be coloured differently.
- Load a text by copying one of the Frankenstein URLs into the "Add Texts" box
- If you hover over a term, Voyeur will tell you its linkage within the
corpus documents.
- Try dragging and dropping terms to organize them.
- If you would like to manipulate the visualization, right-click on any of
the terms and choose 'Stick/unstick' or 'Remove'. 'Stick/unstick' puts
the term in place, and is not moved when other terms are moved. 'Remove'
simply removes the term from the visualization.
- Clicking on the options button (the button that looks like a gear) will
launch a dialog box with various options pertaining to the Links tool.
Stop words list is if you would like to exclude words from the
visualization. (Usually words such as 'a', 'the', and 'and'.) 'NodeA node in a graph is the basic unit of data from which a graph can be constructed.
In text analysis using a hypergraph, nodes connect to other nodes. Each node represents a word, and nodes touching where words are found in conjunction with one another in the source text.
For more information on nodes, see the Wikipedia.
Return to Glossary. size
determined by type frequency' is the default, and will result in
sorting by how often the term appears in the documents. Sorting by 'NodeA node in a graph is the basic unit of data from which a graph can be constructed.
In text analysis using a hypergraph, nodes connect to other nodes. Each node represents a word, and nodes touching where words are found in conjunction with one another in the source text.
For more information on nodes, see the Wikipedia.
Return to Glossary.
links' will result in terms appearing larger if they are heavily linked
with other terms. 'Autofit graph on screen' sizes the graph depending
on the size of your browser window. 'Remove orphans' will remove terms
which are not linked to any other term in the visualization.
Now we will look at Word Trends http://voyeurtools.org/tool/TypeFrequenciesChart/
- Term Frequencies Chart
shows how terms are distributed across document(s) in a corpus
(documents are shown in the order in which they were added).
Every charted lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. represents one word common throughout the entire
corpus. If you hover over specific points it will give you specific
information about that word in a specific document.
- When you add analyze a corpus with Term Frequencies Grid, you will
initially have common words at the top of the chart with colour codes.
You will see lines within the graph which are coloured accordingly to
those words. If you click on one of the terms at the top, it will omit
that term from the graph.
- When we hover over the segment points, we can see the frequency of that
term in that segment. If you click on the
point, Voyeur will open a new window with detailed information of that
segment and term within its Document KWICs tool.
If you click and drag on a section of the chart it will zoom in to
that section. To reset the chart to its original state, click on “reset
zoom”.
If you would like to see less or more segments on the chart, simply
click on “Segments” at the bottom left of the chart to choose the
desired segments.
Other Things
- We will look at how how to get help (Mention Quick Guide)
- Some things to try:
- Experiment with the Options (like the Stop Word list)
- Create a Favorites list for a theme and and explore that list
- Search for phrases
3.0 Analyzing a Corpus
In
the second part of the Workshop we will look at working with a corpus
or collection of many texts. We will use Voyeur on the archives of
HUMANIST from 1987 to 2008 (21 documents.) The Voyeur index is at:
http://voyeurtools.org/?corpus=humanist
- Bubblelines is a
visualization tool that helps to understand patterns of word repetition
in one or more documents. Each document is represented as a horizontal
lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. and each seach term is represented as a bubble – the bubble
represents the frequency of the term in the corresponding segment of
text (the text is divided into segments of equal length). The larger the
bubble, the more frequent the term.
- Load a text by copying one of the Frankenstein URLs into the "Add Texts" box
Hovering over a bubble, or set of bubbles, will cause a box to appear
that displays the frequency counts for that segment of text.
Similarly, hovering over the number at the end of the lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. will cause
a box to appear that summarizes the frequency for the entire document.
When Bubblelines first loads a corpus, you may see terms that have
been pre-selected and included in the URLA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content.
For more information, see the Wikipedia.
Return to Glossary. or embedded page. If no terms
are specified, Bubblelines automatically fetches the five most frequent
terms and displays bubbles based on those.
-
You can remove the default terms by clicking on the "Clear Terms" button.
You can add additional terms to be displayed using the "Find Term"
box. Note that available terms will appear as you type and you can pick
an item from the list to have it added.
In addition to adding and removing terms, you can toggle the display
of the terms that have been loaded. To do so simply click on the term
(active terms are underlined).
- ScatterPlot creates a scatter plot graph of terms, spaced by their variation from one another. Once you arrive to ScatterPlot,
insert / upload your content and let the tool perform its analysis. You
may hover over these dots and click on them for more information.
When you first load ScatterPlot, you will see a variety of terms
plotted on a graph. If you hover over the terms, you will see their
variation explained by each component on the x and y axis. If you click
on any of these terms, it will bring you to the Document KWICsA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right).
Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb:
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
See also the definition at Wikipedia.
Return to Glossary. tool for
further analysis.
ScatterPlot offers options for changing the plot. The terms button
allows you choose how many terms should be displayed. The dimensions
button lets you switch between a two or three dimensional graph. Toggle
labels simply removes or adds labels for the terms on the graph.
- Some other things to try:
- Set stoplists. You may want to exclude common words. To do this, click on the "Options" button, represented by a gear icon in the upper-right.
- Manage multiple documents.
- Show how to group results
- Show comparing document
- Try looking for trends yourself using the different tools
4.0 Using your own text
- Now you can try your own text. There are different ways of providing Voyeur a text:
- Typing a text or pasting it in
- Typing in one or more URLs (as we have done above)
- Uploading a text, using the "upload" button
- For uploading, there are a number of formats of texts that will work:
- file formats: text, HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. , XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data.
Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another:
< book >< title >< /title >< /book >
Elements may also be modified by attributes and attribute values:
< book format="hardcover" >
In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. , RSS, TEI, PDF, MS Word, RTF
- Finally, we will discuss caching and so on.
5.0 Exporting Data and Quoting Analytics
We will now show how to export data and quote analytical results:
- How to export tab-separated values, copy and pasted into Excel
- How to export of XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data.
Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another:
< book >< title >< /title >< /book >
Elements may also be modified by attributes and attribute values:
< book format="hardcover" >
In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. results from KWICsA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right).
Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb:
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
See also the definition at Wikipedia.
Return to Glossary. (for instance)
- How to quote an analytical result in TADA.
- Go to http://tada.mcmaster.ca/Sandbox/VoyeurWorkshop to try it yourself.
6.0 Advanced and Other
- There are other beta tools in Voyeur that can be accessed:
7.0 To Prepare
- Make sure we have Voyeur running with a backup
- Sort out how participants can get on wireless
- Powerbars for laptops
- Preindex texts