This workshop outline is for a Summer School at Trinity College Dublin. See http://dho.ie/summerschool2011 for the full description. This outline is for Day 3 on Generating Textual Data:
Day 3: Generating Textual Data, Tobias Blanke and Geoffrey Rockwell
Based on the results of Day II, participants will dig deeper into the details of generating textual data using text and data mining techniques. Participants will learn methods to algorithmically create textual data while critically evaluating existing tools, methods, and solutions as well as their future potential. They will gain insights on how generic services need to be modified to serve the needs of humanities research. Finally, we will investigate how to generate output can be reused in the emerging web of data.
This is an outline for a workshop on Voyeur. It was developed for a workshop before DH 2010 in London, England.
1.0 Introduction
- The workshop leaders will introduce themselves:
- What will happen?
- What is text analysis? A very short introduction.
- Voyeur tools: Using simple tools in Voyeur
- Distant Reading: Analyzing a single text
- Distant Reading: Analyzing a corpus
Other things to try:- Trying Voyeur on your corpus
- Trying different Voyeur tools like the Open Callais and exporting results
- Conclusion: Stepping back and looking again at what text analysis is through a brief historical review.
- Connect to Hermeneuti.ca and explore the resources there. Here are some useful links:
2.0 TAPoRware: A Simple Recipe for Studying Themes in a Text
In the second part of the Workshop we will show you how to use TAPoRware to analyze a single text as a way of thinking about techniques in text analysis. We will work
with the Introduction, Preface, Chapter 1 and Chapter 2 of Mary
Shelley's Frankenstein. The plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files.
Plain text files do not require a specialized program, such as a word processor, to read them.
For more information, see the Wikipedia.
Return to Glossary. is here:
http://taporware.ualberta.ca/sampleDocs/plainText.txt - This is just a couple of chapters
http://www.gutenberg.org/cache/epub/84/pg84.txt - This is the Gutenberg version of the full text
We will also be using some TAPoRware tools and Recipes for TAPoRware. The Tools and Recipes are here:
List Words: http://taporware.ualberta.ca/~taporware/textTools/listword.shtml - Use short Frankenstein
Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. Tool: http://taporware.ualberta.ca/~taporware/textTools/findtext.shtml - Use short Frankenstein
Weighted Centroid: http://taporware.ualberta.ca/~taporware/otherTools/wcentroid.shtml - Use short Frankenstein
Principal Component Analysis: http://taporware.ualberta.ca/~taporware/betaTools/pca.shtml - Use short Frankenstein
2.2 Using Voyeur Simple Tools
Cirrus Word Cloud A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text.
Return to Glossary. (Frankenstein): http://dev.voyeurtools.org:8080/tool/Cirrus/?corpus=1309937516546.6692&query=&stopList=stop.en.taporware.txt
Cirrus Word Cloud A visual presentation of keywords drawn from a text, visually differentiated based on their position and frequency of use in that text.
Return to Glossary. (Austen): http://voyeur.hermeneuti.ca/tool/Cirrus/?corpus=1308408654248.9846&stopList=stop.en.taporware.txt
Other tools from Voyeur can be found here: http://hermeneuti.ca/voyeur/tools
3.0 Distant Reading: Analyzing a Single Text
In
the third part of the Workshop we will show you how to use Voyeur to
analyze a single text as a way of learning the interface.
- We will open Voyeur:
- Show how to load a text (Frankenstein: http://www.gutenberg.org/cache/epub/84/pg84.txt). Discuss different types of texts that can be loaded.
- Show the different panels that appear initially
- Discuss the order they open and the Summary panelWeb frameworks like the TAPoR Portal organize information into panels (sometimes called portlets or coplets.) These can me minimized, maximized and closed using the three buttons in the upper left-hand corner of the panel. With Voyant you can export panels of results and place them into other web sites.
Return to Glossary.
- Discuss common features to panels
- Go over the Words in the Entire Corpus panelWeb frameworks like the TAPoR Portal organize information into panels (sometimes called portlets or coplets.) These can me minimized, maximized and closed using the three buttons in the upper left-hand corner of the panel. With Voyant you can export panels of results and place them into other web sites.
Return to Glossary. (Options, Columns, Search, Favorites)
- Show how to manage panels
- Discuss trigger order of panels (flow within Voyeur)
- Show how to get help (Mention Quick Guide)
- Show how to make a list of favorite words to explore searching for words and saving in favorites
- Now you should try Voyeur with your text or the Frankenstein text above. To open the Frankenstein click here:
http://voyeur.hermeneuti.ca/?corpus=1309937028026.8131
- Some things to try:
- Experiment with the Options (like the Stop Word list)
- Create a Favorites list for a theme and and explore that list
- Search for phrases
4.0 Distant Reading: Analyzing a Corpus
In
the fourth part of the Workshop we will look at working with a corpus
or collection of many texts. We will use Voyeur on the archives of
HUMANIST from 1987 to 2008 (21 documents.) The Voyeur index is at:
http://voyeurtools.org/?corpus=humanist&skin=scatter&stopList=stop.en.taporware.txt
- We will discuss:
- Different skins with different panels
- Correspondence analysis and the exploration of a large corpus
- Try looking for trends yourself
5.0 Using your own text
- Now you can try your own text. We will show the different ways of providing Voyeur a text:
- Typing a text or pasting it in
- Typing in one or more URLs
- Uploading a text
- We will then discuss the formats of texts that will work, and what will happen to them:
- file formats: text, HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. , XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data.
Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another:
< book >< title >< /title >< /book >
Elements may also be modified by attributes and attribute values:
< book format="hardcover" >
In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. , RSS, TEI, PDF, MS Word, RTF
- Finally we will Discuss caching and so on
- Now try your own text.
6.0 Exporting Data and Quoting Analytics
We will now show how to export data and quote analytical results:
- How to export tab-separated values, copy and pasted into Excel
- How to export of XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data.
Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another:
< book >< title >< /title >< /book >
Elements may also be modified by attributes and attribute values:
< book format="hardcover" >
In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. results from KWICsA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right).
Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb:
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
See also the definition at Wikipedia.
Return to Glossary. (for instance)
- How to quote an analytical result in TADA.
- Show going to http://tada.mcmaster.ca/Sandbox/VoyeurWorkshop to insert a panelWeb frameworks like the TAPoR Portal organize information into panels (sometimes called portlets or coplets.) These can me minimized, maximized and closed using the three buttons in the upper left-hand corner of the panel. With Voyant you can export panels of results and place them into other web sites.
Return to Glossary..
7.0 Skinning Voyeur
We will now look at how you can develop a different skin.
- Open a corpus like http://voyeurtools.org/?corpus=1309931394540.8106
- Click on the Export button (the disk button in upper right) and export to layout builder
- Drag panels into the blank area to create a custom skin (Warning: many combinations won't work)