These are notes for our Game Studies experiment. The extreme text analysis will focus on the journal Game Studies. We want to try named entity recognition and network analysis.
While prepared for this during our conferences, the experiment is taking place on April 4, 5, and 6, 2012.
What we are doing
Corpus: We decided to study the journal Game Studies. Here is how we prepared the corpus:
- We scraped all the issues of Game Studies from 2001 to 2011 so we had a 10 year corpus.
- We concatenated all the articles for each year so we had an HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. file for each year with as many articles as there were that year.
- We made sure that the HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. file has a title with the proper year so
- We cleaned up the HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. files to remove heading information from the original article pages and to make it one valid HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. file.
- We created a Zip that could be loaded into the regular Voyant skin or the special ResoViz skin.
- I (GR) also loaded all the articles into DevonThink to compare how that works and to be able to use it to find/read text.
Questions: The questions we want to ask are:
- Who is important in Game Studies? Who is being quoted? Who is being referenced??
- What is the intellectual legacy of authors in Game Studies?
- What people cluster together?
- How do the named people change over the 10 years? Does this tell us anything about the evolution of the field?
- Do people cluster around certain concepts like ludology, narratology, play and so on?
Working Voyant Corpus:
- Voyant with regular skin of individual years and TAPoRware stopword list
- Voyant with regular skin of individual articles and TAPoRware stopword list
- ResoViz 1 skin using issues as what we look for when counting links between two names
- ResoViz 2 skin using articles as what we look for when counting links
- ResoViz 3 skin is the beta with the ability to export list of names by frequency - this is the current version of the software with things fixed.
- Collocates skin lets us explore the copus using the collocates panelWeb frameworks like the TAPoR Portal organize information into panels (sometimes called portlets or coplets.) These can me minimized, maximized and closed using the three buttons in the upper left-hand corner of the panel. With Voyant you can export panels of results and place them into other web sites.
Return to Glossary.. This can be used to get a sense of the semantic field of names.
- Scatter skin lets us explore using correspondence analysis
- The latest ResoViz version (June 2012)
First Pass
- We looked at ResoViz 1 with the lowest number of names (25). With the exception of some names that were not people (Mario Bros., Lara Croft, and Homo Ludens) we had a reasonable list of names, both current and historical.
- All the names I didn't recognize I used the normal skin of Voyant to study. One was a publisher (Peter Lang) and some were editors of key collections. Here is the list of names:
Brian Sutton-Smith
Lawrence Erlbaum
Celia Pearce
Sherry Turkle
Peter Lang
Brenda Laurel
Chris Crawford
Henry Jenkins
Jesper Juul
Aki Jarvinen
Eric Zimmerman
Homo Ludens
Will Wright
Mario Bros.
Katie Salen
Edward Castranova
Lara Croft
Espen Aarseth
Roget Caillois
Gonzalo Frasca
Richard Bartle
Sid Meier
Johan Huizinga
Ernest Adams
Andrew Rollings
- Then I looked at ResoViz 2 corpus. This shows the links between names based on articles. Here is the list of names that appear:
Loch Ness Expedition
Espen Aarseth
Geoff King
Tanya Kryzwinska
Noah Wardrip-Fruin
Ann Arbor
Janet Murray
Brian Sutton-Smith
Homo Ludens
Miguel Sicart
Celia Pearce
Gonzalo Frasca
Ernest Adams
Katie Salen
Andrew Rollings
Ian Bogost
Johan Huizinga
Roger Callois
Ruchard Grusin
Peter Lang
T.L. Taylor
Eric Zimmerman
Jesper Juul
Henry Jenkins
Will Wright
- It is interesting to guess why the lists are different. It has to do with spread. Names in the second list show up in more articles vs issues.
Pass 2
- Stéfan adapted the tool so it could export a list of the high frequency names and it can let you choose as few as 5 names. This is the ResoViz 3 skin.
- Now the top 5 names are: Espen Aarseth, Jesper Juul, Janet Murray, Johan Huizinga, and Homo Ludens. Quite a list.
- One of the things I can do is slowly increment the list of people using the arrow keys. We should make an animation of that.
- Looking at the list of high frequency people the list is different:
| Homo Ludens |
64 |
| Henry Jenkins |
54 |
| Espen Aarseth |
50 |
| Jesper Juul |
48 |
| Lara Croft |
42 |
| Gonzalo Frasca |
36 |
| Celia Pearce |
34 |
| Mario Bros. |
32 |
| Mario Brothers |
32 |
| Ian Bogost |
28 |
- There seem to be some interesting issues with the list. The frequencies are all even numbers. The frequencies don't match what you get if you search the corpus. One of the problems is that we are using different corpora.
Interpretation
One way to interpret this is to categorize the high frequency names. I used Excel to create a rubric of categories and then counted people that fit a category or another. They seem to be of the following sort:
- Theoretically influential historical figures like Marshall McLuhan, Barthes, Deleuze, Dewey
- People influential in the area of new media like Manovich, Jameson, Landow, ...
- Early game studies folk like Bernard Suits, Johann Huizinga, Callois
- Contemporary game studies folk like Espen Aarseth, Jesper Juul, and Janet Murray
These lists overlap. I wonder if it is useful to create these lists? Is it useful to know who to read to be grounded in a field or community? Some other things we can tell:
- We can sort of tell important edited collections based on editors names. These can be people that are super-connected through the editing.
- We can tell important designers (Will Wright, Chris Crawford)
- We can tell some important characters from games like Lara Croft and the Mario Brothers.
Next Day: Pass 3
We decided that GR would start writing the essay while SS would try to fix the tool to make it more useful. SS also created a special skin for using the Collocates tool. These are now the tools we are using
Animations: We are getting interesting animations when friction is low in Chrome. The words will bounce around producing animations that look like a light show. It is interesting to see the outlier names that link to clusters of names in the middle.
Name Frequencies and Link Frequencies: Now we can see the network graph in two ways based on two ways of assigning what are the most common names.
Next Day (April 6): Pass 4
Now we doing pair analysis and Stéfan is doing the analysis while I take notes.
- We realize that something was changed to the visualization code in the switch from the first version of RezoViz and the currect Voyant version. We exported the data and are using the older version. We see a very different graph now.
- We realize there seem to be relatively few edges (links between two nodes) because we are dealing with full names. We are going to ask the Stanford tool to do all names (including last names.)
- We got a new output with single names (as in Rockwell alone is a name.) Interestingly we found the Routledge is one of the top names. The publisher seems important.
It looks like we are going to have to work with a number of different edge files:
- The output of SNER with only double names (as in Jesper Juul)
- Output 1 (double names) cleaned up
- The output of SNER that grabs single names
- Output 3 cleaned up
We are now playing with the single names and different graphing approaches. Routledge comes up high as does Taylor.
Now we are trying to recognize places. This didn't work very well because of the bibliographies. The places are all from
Conclusions
- There is a core of people associated with Espen, the journal and the institute in Copenhagen who show up. The journal is biased that way.
- There are some key historical figures like Callois and Huizinga
- There are some interesting theoretical influences from Immanuel Kant to Wittgenstein
- You can see who the contemporary game studies folks are
- It looks like Janet Murray and Espen are circling each other. Is that really in the dataset - we need to look at articles to see if that shows up
Social Network Analysis
How are applying social network analysis? First of all, we are not describing a social network so much as trying to use analytical tools to identify nodes (people) and edges (or relationships) between people. Normally you start with a network described and then apply analytical techniques from graph theory to the network to predict outcomes and so on. Our system tries to derive a network from text. Our system only tells us that they are named in the same article it doesn't tell us what the connection is. We then need to try to figure out the connection. For that matter the relationship that the tool finds aren't necessarily there.
What we are trying to do then is:
- Figure out what the connections are and whether they are relevant. One type of connection would be intellectual influence. Another would be disagreement (as in two people are connected because they disagree).
- Traditional sociology tried to describe outcomes and characteristics of people based on other characteristics like class, education and so on. Social network analysis tries to predict outcomes and describe characteristics based on relationships to other actors. A traditional way to survey a field is to look at the major ideas. The epidemeology of ideas tries instead to describe the relationships of intellectual influence and how ideas move. Rather looking at "essential" characteristics (what they really meant) we look at superficial characteristics (who they mention) to trace intellectual heritage.
Pass 4: Ongoing
We now have a new version with new features: http://beta.voyant-tools.org/tool/RezoViz/?corpus=1333569662640.7239
We are getting strange results, but it has a number of new features:
- You can edit the links and refresh - that way you can eliminate false positives
- You can change what is shown from high-frequency links to high-frequency items (names)
- You can show places and orgs (not quite working yet.)
I'm getting a sense of a recipe:
- Look at high freq items starting at the lowest number
- Ask about people who show - check the word trends
- Delete those that are false positives (Homo Ludens)
- Check why names might be linked - use word trends to narrow in on article
- Then you work your way up.
- At around 60 you get a nice effect where the more popular names are clustered in the centre and you can mouse over to see the web of connections. Here is the chart showing Espen Aarseth.

It gives one a sense of the well connected people - Whose got game (studies.)
Issues
Here are some of the issues that came up:
- Done: In ResoViz I can't get fewer than 25 names
- In RezoViz I can't edit the pairs.
- Exporting a button doesn't keep the settings (like number of names)
- Exporting a PNG blows up.
- Done: It would be nice to have an export of a table of the name pairs.
- In Chrome ResoViz seems to jitter too much.
- We don't have a Collocate tool any longer.
- It would be nice to be able to select the items in the list to decide what is graphed
- In the reader it would be nice to be able to go up a paragraph or two. Hits tend to be at top lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. and you can read the lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. before easily.
- Stats - we need to start gathering some
What has to be done next:
- RezoViz finished - option to get only top edges
- RezoViz skin with Trends and KwicA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right).
Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb:
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
See also the definition at Wikipedia.
Return to Glossary.
- We have to create different lists of edges
- We have to put up our different texts
- It would be nice to have corpus wide list of collocates
Meta Thoughts
- There seems to be a level to which the more complex the analysis the more time it takes. This is partly due to the need to hack the tool, but also because there are so many false positives that have to be checked.
1. The latest text (with bibliography) organized by year (clumps of a year) in usual skin. We will use this for diachronic views.
http://voyant-tools.org/?corpus=GameStudiesByYearWithBibs&stopList=stop.en.taporware.txt
2. The latest text (withOUT bibliography) organized by year (clumps of a year) in Scatter skin. We will use this for correspondence analysis.
http://voyant-tools.org/?corpus=GameStudiesByYearWithoutBibs&stopList=stop.en.taporware.txt
http://voyant-tools.org/?corpus=GameStudiesByYearWithoutBibs&stopList=stop.en.taporware.txt&skin=scatter
3. The latest text (with bibliography) organized by order of articles in usual skin. We will use this for diachronic views and for finding articles.
http://voyant-tools.org/?corpus=GameStudiesByArticleWithBibs&stopList=stop.en.taporware.txt
4. The latest text (with bibliography) organized by year in collocates skin. We will use this to see associated words for certain themes.
What do you meanIn statistics, the mean is the arithmetic average of a set of values.
When used in text analysis, the set of values is the distribution of words in the source text, and the mean value the word with the occurrence rate closest to the average.
For more information, see the Wikipedia.
Return to Glossary. by collocates skin, something like this? Here is the collocates skin by year: Collocates Skin By Year.
5. The latest text (withOUT bibliography) in ResoViz skin. We will use this to look at evolution of names.
http://voyant-tools.org/tool/RezoViz/?corpus=GameStudiesByArticleWithoutBibs
Here's the one-document corpus without bibliography:
http://voyant-tools.org/?corpus=GameStudiesAllWithoutBibs&stopList=stop.en.taporware.txt
in the collocates skin
and with bibliography:
http://voyant-tools.org/?corpus=GameStudiesAllWithBibs&stopList=stop.en.taporware.txt
in the collocates skin
Here is just the Collocates tool with the full corpus.
July 3, 2012
We decided to hand edit the list of people as there are duplicates (Murray, Janet Murray, and Janet H. Murray). I edited the list of higher frequency names down to those that appeared at least 4 times (witOUT the bibliography.) Then Stéfan created a process to run this through the Standford NER and produce a special version of RezoViz. This version is quite interesting. Here is the link:
http://voyant-tools.org/tool/RezoViz/?corpus=GameStudiesByArticleWithoutBibs&geoffreysFilter=true