Project Blog

History of Humanities Computing

Stéfan and I were discussing the Canadian history of humanities computing and text analysis. Is it true that there is a tradition of text analysis tool development in Canada? Has this been an area of strength? How would we answer this question.

I have put together a wiki of information on the Canadian capacity in this area:

http://tapor.ualberta.ca/taporwiki/index.php/The_Academic_Capacity_of_the_Digital_Humanities_in_Canada

A good history of humanities computing in Canada has yet to be written.

Geoffrey R.

Developing Glossary, Appendix on Text Preparation and Recipes

We are trying to figure out how to integrate the following three useful components:

  • Glossary - We have a glossary of terms now, should it be expanded?
  • Appendix on Text Preparation - We need to provide more information on how to find and prepare a text. We have a draft appendix, but need to develop it. My sense is that a lot of problems occur in the text preparation.
  • Recipes - Over the years we have created recipes and outlines for workshops. We need to figure out how to weave those into the book. We don't want to overwhelm the book with information that might get dated quickly, but we would like this to genuinely help people try it out.

Cool New Tool

Stéfan has created a cool new tool that you can see with the Humanist archive here:

http://voyant-tools.org/tool/TermsRadio/?corpus=humanist&stopList=stop.en.taporware.txt

This tool plays with ideas we have had about real-time analytics and animation of analytics. It lets you explore the evolution of themes in Humanist.

With this tool you can see clearly the explosion of interest in the web, but what is more interesting is what other words rose or fell with the web. How did the shift to the web affect humanists.

Some of the words that intrigued me were "department" and "London."

Literary text analysis vs linguistic text analysis

One thing we need to do is to figure out the relationship of literary text analysis and linguistic text analysis. Could the two merge? Should literary build on linguistic? Some initial differences:

  • Linguists are interested in analysis of much smaller units while the lit folk are generally looking at large scale trends.
  • Linguists tend to therefore build computational tools that work on smaller units and are designed to demonstrate a theory while literary analysis tools are meant to support reading practices where the theory might be developed through multiple practices.
  • Linguists are interested in developing formal descriptions of linguistic phenomena - developing theories of language, while literary analysts are interested in understanding particular texts. In this way linguistics is more of a science while literary analysts are in the humanities.

An interesting exception is corpus linguistics which tends to work on larger corpora, though they use them to develop theories about language not literature.

Collapsing terms

We have added a feature now that will collapse a set of terms in the Word Trends. That allows one to develop a group of related terms and then graph the group as a whole. This is important given the forms a word or theme can take.

Stéfan also ran Hume's Dialogues Concerning Natural Religion through Mallet and here are the topics proposed:

It hardly seems fair for only me to have homework. Here's yours: look through the list of 20 topic clusters below to see if there's anything of interest. I took the entire dialogue, distilled the nouns, broke it up into parts, and fed it through Mallet to do topic modelling.

  • 0 cause nature mind idea object effect manner work consequence difficulty word recourse quality meaning phenomenon existence way abstract appellation
  • 1 world order operation ignorance fact faculty difficulty judgement person variety action course disorder thing soul supposition solution antagonist thought
  • 2 man mankind view degree intention theology opposition state temper reflection side law vice reach virtue violence fortune passage affair
  • 3 power inference creature conjecture hypothesis author inconvenience spring rest preservation architect endowment workmanship notion industry stock volition general number
  • 4 experience man arrangement earth similarity perfection house resemblance meanIn statistics, the mean is the arithmetic average of a set of values. When used in text analysis, the set of values is the distribution of words in the source text, and the mean value the word with the occurrence rate closest to the average. For more information, see the Wikipedia. Return to Glossary. conclusion piety propriety representation situation presumption curiosity moon temerity fancy
  • 5 part universe circumstance appearance question economy present regard understanding absurdity advantage conclusion faculty plan prejudice proportion activity kind bound
  • 6 principle world reason animal thing generation origin rule vegetable vegetation machine step observation planet tree standard species system essence
  • 7 life misery pain pleasure phenomenon benevolence happiness goodness attributeAn attribute is a string of characters used to modify an HTML or XML element in conjunction with an attribute value. Attribute-attribute value pairs appear within an element, and serve to distinguish the instances of the element modified with a given attribute from other instances of that element. In the case of HTML, this is frequently used to apply CSS formatting to the text within that element. Ex: < p class="hangingindent" > In the case of XML, this may be used to apply CSS formatting and/or apply metadata to the text within that element. Ex: < book format="hardcover" > In the above examples, 'class' and 'format' are the attributes modifying < p > and < book > respectively. Return to Glossary. enemy feeling enjoyment rectitude condition death complaint health wickedness folly
  • 8 matter order form motion hypothesis thought system probability position force revolution adjustment experience moment situation elementAn element, also called a tag, is characteristically used within HTML and XML to apply characteristics (such as headings, paragraphs or user-defined categories) or metadata to a document, usually a text. Elements generally appear in matching pairs of an opening element and a closing element, with text in between. All text within an element pair is modified by that element, and one element pair may be nested inside another. In the case of HTML, elements are used to format a text directly, or as a delimiter for CSS formatting to the text within that element. An HTML paragraph element: < p >< /p > In the case of XML, elements may be also be used as a delimiter for CSS formatting to the text within that element, but its primary purpose is to apply metadata to that text. Ex: < book format="hardcover" >< /book > Both HTML and XML elements may be modified with attribute/value pairs. In the above example, format="hardcover" is the attribute/value pair modifying the element < book >. Return to Glossary. alteration instance change
  • 9 deity attributeAn attribute is a string of characters used to modify an HTML or XML element in conjunction with an attribute value. Attribute-attribute value pairs appear within an element, and serve to distinguish the instances of the element modified with a given attribute from other instances of that element. In the case of HTML, this is frequently used to apply CSS formatting to the text within that element. Ex: < p class="hangingindent" > In the case of XML, this may be used to apply CSS formatting and/or apply metadata to the text within that element. Ex: < book format="hardcover" > In the above examples, 'class' and 'format' are the attributes modifying < p > and < book > respectively. Return to Glossary. supposition production figure anthropomorphism species force respect discovery philosopher ship experiment solidity scale controversy mouth cloud surface
  • 10 religion regard dispute superstition eternity time controversy maxim terror motive artifice inclination morality dogmatist proposition suspense research oath impulse
  • 11 philosophy science scepticism mind opinion evidence reality sect education doctrine kind moral history remark light heart humour earnest life
  • 12 argument body theory art proof place kind system language foundation theism end scruple theist discourse scene air knowledge state
  • 13 principle sense philosopher inquiry passion atheist doubt sceptic certainty composition resemblance school company comprehension disposition uncertainty learning self danger
  • 14 system age truth method event point sense manner determination people satisfaction eye conversation weight turn veneration horror war triumph
  • 15 animal purpose society opinion capacity love scepticism beauty want praise attention prospect abuse energy plant impossibility concession prosperity contentment
  • 16 analogy design contrivance objection case intelligence instance occasion term invention voice parent chaos volume interest instinct structure darkness fertility
  • 16 seem to be words around argument for design
  • 17 reason sentiment species conduct influence use difference assent imagination faith structure affection case god circumstance study stroke apprehension belief
  • 18 nature spirit authority account number hand comparison name wisdom disposition eye day expression importance mercy justice master race sorrow
  • 19 necessity argument being thing succession existence time topic contradiction beginning product conception error chain chance dialogue nonexistence accident weakness

 

 

Ideas from Just What Do They Do?

We have a SSHRC funded project that is looking at how just what people do with text analysis. Some of the interesting points that came up:

  • Could we use the colour and orientation in Cirrus for information - could we use colour for clusters? What would orientation then meanIn statistics, the mean is the arithmetic average of a set of values. When used in text analysis, the set of values is the distribution of words in the source text, and the mean value the word with the occurrence rate closest to the average. For more information, see the Wikipedia. Return to Glossary.?
  • How do people use existing tools? It seems people use Excel, Word, and Google as research tools. How do they use these? Could you do text analysis without specialized tools?
  • Some people want analytics integrated into text environments. What would be the best way to do that?
  • Many people don't see the big picture - they are reacting to what they are shown, but don't have ideas as to what text analysis could/should be.

 

THATcamp Kansas

I (Geoffrey Rockwell) am giving a workshop on Voyant at the Kansas 2012 THATcamp. This time we had a number of backup servers set up and they all worked well. Some participants were working with Arabic that worked, to a degree. Stéfan set up a system that resolves to different servers:

http://bit.ly/VoyantCirrusFrankenstein

resolves to

http://resolve.voyant-tools.org/tool/Cirrus/?corpus=frankenstein&stopLis...

which then redirects to

http://temp.voyant-tools.org/tool/Cirrus/?corpus=frankenstein&stopList=s...

That's in "workshop" mode (where the temp instance is favoured). If you remove the incontext part of the urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.

http://resolve.voyant-tools.org/tool/Cirrus/?corpus=frankenstein&stopLis...

it resolves to the main server.

http://voyant-tools.org/tool/Cirrus/?corpus=frankenstein&stopList=stop.e...

Some of the issues/questions that came up:

  • How should projects like ours deal with server load? I think the resolving system that we tried, for the first time in the workshop, actually worked well.
  • The documentation should be consistent. Different links go to different help places. They should probably all go to docs.voyant-tools.org.
  • We need to provide more documentation on the Correspondance Analysis tool.
  • A number of people want to do linguistic work and would like the ability to lemmatize, search by lemmas, and use texts with POS (Part of Speech) information. This will be difficult to do. Can we add some wild card searching for word lists? Could we imagine a special skin for linguistic work with special tools.
  • A number of people have XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. encoded (TEI) texts. We need to document better what we can do with XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. and figure out a way to give users control over their XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. . One idea is to be able to subset (create a corpus of smaller documents from a text) based on XPath where all passages that fit criteria get aggregated into "documents".
  • We need to test and document the stop word list editing feature. Is there a way to save and reuse a custom list? Can one work with Cyrillic words?
  • We should create a list of known and sharable corpora.

 

From Concordance to Ubiquitous Analytics

We have finished another chapter, the one that provides a history of text analysis from concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary. to ubiquitous analytics. Hurrah!

The Measured Word

The first draft of The Measured Word chapter is done. This is a chapter that goes through what we can do with a computer to texts. It introduces stringA string is a series of characters (symbols, letters or numbers) of finite length. Strings are used to generate a collocation, concordance, co-occurrence, or any other type of textual analysis in which locating a word fragment, word, phrase, sentence and so on is important. For more information, see the Wikipedia. Return to Glossary. processing for humanists. The chapter covers stuff that many digital humanists will know, but it is meant as an introduction to thinking like a computer about texts.

The frame of the chapter is a tradition of thinking about artificial life, intelligence and interpretation which includes Pygmalion, Frankenstein, Searle's Chinese room, Dreyfus on AI and Powers. Richard Powers has a brilliant novel Galatea 2.2 where the narrator (a semi-biographical Richard Powers) helps train an AI designed to pass a Masters English exam as a version of a Turing test. The story uses this challenge to revist the story of Pygmalion and Galatea - the story of an artist (trainer) getting close with their creation. The story deals with computer assisted interpretation - an AI trained to respond to exam questions about a literary text - something we are trying to do with, though differently. We are not trying to build artificial interpreters but interpretative aides or tools to augment our interpretation.

In our conference call today we also discussed the next steps. We looked at some issues with word trends and single texts. We talked about how topic modeling should be promising for the Game Studies experiment.

Change in outline

We reviewed what has to be written and changed our outline a bit. We are introducing a 4th example on Hume's Dialogues Concerning Natural Religion. This will give us:

  • An example of sustained analysis of a literary/philosophical text.
  • An example of using markup (we will mark it up in simple XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. for parts and speakers.)
  • It will let us experiment with visualizations

We are also going to experiment with Topic Modelling (on the Game Studies corpus) and Mandala.

Syndicate content