
Welcome to the Methods Commons.
Computation has produced new and exciting ways of studying texts. Many of these methods do not require the use of expensive programs or detailed programming knowledge, but only the know-how to combine freely accessible resources to perform various tasks.
This book describes common or interesting sequences of actions, or recipes. They are organized according to the objective of the recipe. Recipes fall into the three major categories of location and identification of ideas, themes or specific terms; analysis of textual devices or themes; or the construction of new entities or corpora. There are also a set of three tutorial recipes included to introduce three common and specific tasks using TAPoR Tools, and a series of experimental draft recipes that are still under construction.
The Methods Commons community benefits from shared experience and learning how others make use of recipes. You can share your experience by adding your own recipes to the collection. More information about recipe and exercise structure and authoring is available on the RecipeStructure page. We also have a Glossary that we hope you will add to.
Methods in this section constitute the heart of text analysis, and allow texts to be measured, visualized, or otherwise dissected in innovative and interesting ways.
This recipe uses List Words, Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary. and CollocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary. tools to explore themes in blog discourse. It is important to keep individual blog entries around the same length to ensure consistent results when analyzing your compiled text.
The easiest way to collect blog entries is to use Google Blogs. When choosing which blog entries to include, remain consistent with your search criteria. In other words, keep the size, topic and date range of your search consistent throughout to ensure consistent results.
One way to organize your blog entries is to copy them into a bibliographic database like EndNote. You can then add keywords and metadata with which to sort or select subsets. You can also create styles to export the entries with XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. tags for analysis.
This recipe takes two works purported to come from the same author and uses tools such as distribution, Word Lists, etc. to suggest whether they may have been created by the same author.
This recipe is applied to a sample text in Exercise to Compare Texts to Verify Authorship
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
This recipe takes data provided by textual analysis tools and uses Microsoft Excel to create graphs to aid in its interpretation.
This recipe needs a good exercise to show how to Compile Textual Data and Visualize the Results Using Excel?
This recipe uses the Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary. Tool to generate Dynamically Aggregated Text and uses tools such as a frequency list and concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary. to explore the results.
This recipe is applied to a sample text in Exercise to Explore Dynamically Aggregated Text
;
;
This is a recipe to explore simple themes within a text.
This recipe is applied to a sample text in Exercise to Explore Themes within a Text
;
;
;
;
;
A word’s sense is the way in which it is used within the contextIn text analysis, context refers to the text surrounding a string of characters, which may be as short as a word or as long as a paragraph. Context is particularly important when generating a concordance for a string. Return to Glossary. of the text. Word Sense Disambiguation is the process through which the various senses of a word are considered within the contextIn text analysis, context refers to the text surrounding a string of characters, which may be as short as a word or as long as a paragraph. Context is particularly important when generating a concordance for a string. Return to Glossary. of its specific usage. A service such as Eva at POETS provides a list of senses in which a word can be used.
WordNet is one of many web services available which will provide word senses, synonyms, antonyms and other related words for terms that you enter. For more information see WordNet or Eva at POETS.
This recipe takes a text and explores the use of colloquial words within it using tools such as Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary., concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary., co-occurrence and collocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary..
This recipe is applied in Exercise Exploring Colloquial Word Usage in a Text.
;
;
This recipe takes a text which is rich in concepts and uses tools such as word frequency, concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary., co-occurrence and collocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary. to explore a specific concept.
This recipe needs a good exercise to demonstrate how to Explore Concepts?
;
Concepts are usually discussed in a text using unambiguous vocabulary so search for words associated with a concept will find relevant passages. This recipe shows how you can follow a web of concepts looking at words that co-locate.
This recipe takes a text and explores the tenses and senses of word usage by combining the use of a sense finding service, the Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary. and CollocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary. Tools.
This recipe needs a good exercise to demonstrate how to Explore Word Sense and Tense in a Simple Text?
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
The word list can provide a first clue about the nature of the text. Questions which can be asked of the word list may include:
This recipe takes a text and explores its use of theory by using tools such as word list, concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary., and collocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary..
This recipe is applied to a sample text in Exercise Exploring a Text for Theoretical Foundation
;
in the frequency list;
This recipe extracts and examines a character’s dialogue from a play to explore a particular discourse in a linear fashion.
This recipe is applied to a sample text in Exercise to Extract Dialogue from a Screenplay to Explore Linear Discourse
This recipe uses the Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary., frequency lists, concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary. and collocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary. to explore how a writer’s use of language changes over a lifetime.
This recipe needs a good exercise to demonstrate how to Follow Changes in Language Use by a Particular Writer
When using an aggregation tool such as the Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary., you must be able to save text to the Databench The Databench is a temporary workspace where you can store your text analysis results in the TAPoR for further use. For more information see this TAPoR tutorial. Return to Glossary. as part of the process. To make this possible you must be logged into the system to maintain your own personal workspace. If you require access to TAPoR please visit the TAPoR signup page
After finding the list of words from the writers work, it is useful to use the thesaurus tool to find related words that you can use to explore further nuances of the writers’ changing use of language. The key words that you identify as points of exploration themselves mat have evolved themselves and the subtle changes in word choice can be identified through contrasting synonyms.
This recipe takes a text with known syntactic dependencies and explores those using tools such as Word List, Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary., Co-Occurrence and CollocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary..
This recipe needs a good exercise showing how to Test Assumptions about Syntactic Dependencies within a Text?
;
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
The word list can provide a first clue about the nature of the text. Questions which can be asked of the word list may include:
This recipe uses frequency lists and concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary. to determine the impact and clarity of your own writing in meeting your objectives.
This recipe needs a good exercise to demonstrate how to Use Text Analysis Tools to Clarify the Intentions of Your Own Writing?.
This is a recipe to use textual visualization tools to identify streams of thought, trends and potential avenues for scholarly investigation
This recipe needs a good exercise to demonstrate how to Visualize Scholarly Trends?
;
;
Possible sources for electronic abstracts are listed on the Electronic Texts Panel of TAPoR.
Sometimes text analysis requires building digital objects. Methods in this section allow for the creation of tools, diagrams, and other objects.
This recipe extracts information about perceived social networks from a text populated with references to individuals.
This recipe needs a good exercise to demonstrate how to Exercise to Build a Social Network Map from a Text?
;
;(Can the exercise do this with something like Romeo and Juliet…determine who’s related to who, which side one is on?)
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
WordNet is one of many web services available which will provide word senses, synonyms, antonyms and other related words for terms that you enter. For more information see WordNet or Eva at POETS. In this case, to determine relationships involves distinguish between objects and people as well as between the parties relating to one another. To make this distinction…
This recipe uses text analysis tools to extract key words to create an index and table of contents from a body of text.
This recipe is applied to a sample text in Exercise to Create Textual Infrastructure using Textual Analysis Tools
;
;
You can use tools such TAPoR Extract Text to remove added material.
This recipe takes a biographical text and uses tools such as Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary. and Find Dates to provide framework for a chronological timeline.
This recipe is applied to a sample text in Exercise to Create a Chronological Timeline from a Biographical Text
;
;
.
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such as the TAPoR Extract Text to remove added material.
The beauty of this process is in the ability to augment a standalone text with an aggregated collection of supporting textual matter. This aggregation can then be scanned to produce a rich timeline for further editing. A text aggregation tool allows for a shotgun approach to text acquisition and much of what may be trawled is irrelevant. However, there may be small nuggets of useful information acquired that a date finding tool will quickly pinpoint.
The first pass using a Date Finding tool will probably return redundant and possible erroneous information. However, it allows for easy classification and sorting of the results and thus very quickly pares down biographical or other event based narratives to quickly construct a chronological timeline.
This is a recipe for building and maintaining an online bibliography with TAPoR. With a TAPoR account you can manage a bibliography that links to online articles, essays, web sites and so on. The bibliography will have only your "public" items visible and will organize them by your subject tags. It will allow others to link to the items or to analyze them using TAPoR accessible tools.
The TAPoRize Bookmark allows you to browse the web and quickly add pages to your myTexts This is an area of the TAPoR in which you collect your private texts for analysis. It is also a portal to access publicly available texts which have been added by other users. In this area you can view the catalogue of texts available to you or add, edit, tag, and view the contents of specific texts. For more information see the TAPoR Tutorial. Return to Glossary. Library. When you are looking at a web page that you want to add to your library, choose TAPoRize and you will get a small window that lets you add the text. You may be asked to log in to your account, if you haven't already.
This recipe uses frequency lists and Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary. to build meta tags for a web page or website.
This recipe is applied to a sample text in Create Generate Meta Information for a website using TA tools
;
;
If you find that your frequency lists do not contain as many keywords as you anticipated you may also want to consider re-writing some of your text to be more keyword rich because many search engines carry out a similar analysis to rank the relevance of your page. You may want to consider the lists generated by competing sites as well to ensure that you are highlighting your site appropriately. The ratio of key word density in web text is judged to be between 3-9% of full text for most search engines.
Recipes in this section are incomplete or untested. As to function, they may fit into any of the other categories. Please feel free to critique them or to add new ones.
-------------------
If you want to re-run the examples with CATMA, use the attached file JamesBond.txt and load it into CATMA.
This is a recipe to compare different text.
This recipe is applied to speeches by Obama and Wright in Now, Analyze That
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream. | But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
The grepTo grep is to search a text for a string or regular expression pattern of characters. Return to Glossary. man page: on the terminal type “man grep” or online go to http://www.ss64.com/bash/grep.html
A good online tutorial is: A Tao of Regular Expressions http://jmason.org/software/sitescooper/tao_re ...
Mastering Regular Expressions (by Jeffrey Friedl) - the O’Reilly manual for regular expressions (2nd edition published 2002)
Regular expressions can also be used in many different programming languages.
Things to look out for:
Regular ExpressionA regular expression, sometimes called regex, is an advanced method of searching text using formal language, commonly employed by programming languages. The TAPoR toolset frequently refers to regular expressions as 'patterns'. Using regular expressions allows one to expand a search beyond a simple string of characters ('cat'). Instead, one may search for such instances as all words including 'cat' ('catalogue', 'concatenate'), or all words beginning with 'c' and ending in 't'. This method therefore allows one to search for a pattern within a text with a high degree of precision and flexibility. Please note that TAPoR also supports Unix style searching, a specific form of regular expression used by the Unix operating system. For more information, please see the Wikipedia entry for regular expressions. To learn regular expressions, please see the Open Directory's resource list. Return to Glossary.: a means for matchingstrings of text, such as particular characters, words, or patterns of characters. A regular expressionA regular expression, sometimes called regex, is an advanced method of searching text using formal language, commonly employed by programming languages. The TAPoR toolset frequently refers to regular expressions as 'patterns'. Using regular expressions allows one to expand a search beyond a simple string of characters ('cat'). Instead, one may search for such instances as all words including 'cat' ('catalogue', 'concatenate'), or all words beginning with 'c' and ending in 't'. This method therefore allows one to search for a pattern within a text with a high degree of precision and flexibility. Please note that TAPoR also supports Unix style searching, a specific form of regular expression used by the Unix operating system. For more information, please see the Wikipedia entry for regular expressions. To learn regular expressions, please see the Open Directory's resource list. Return to Glossary. is written in aformal language that can be interpreted by a regular expressionA regular expression, sometimes called regex, is an advanced method of searching text using formal language, commonly employed by programming languages. The TAPoR toolset frequently refers to regular expressions as 'patterns'. Using regular expressions allows one to expand a search beyond a simple string of characters ('cat'). Instead, one may search for such instances as all words including 'cat' ('catalogue', 'concatenate'), or all words beginning with 'c' and ending in 't'. This method therefore allows one to search for a pattern within a text with a high degree of precision and flexibility. Please note that TAPoR also supports Unix style searching, a specific form of regular expression used by the Unix operating system. For more information, please see the Wikipedia entry for regular expressions. To learn regular expressions, please see the Open Directory's resource list. Return to Glossary. processor, a program that either serves as aparser generator or examines text and identifies parts that match the providedspecification.
Finite State Machine: a mathematical abstraction sometimes used to designdigital logic orcomputer programs. It is a behavior model composed of a finite number ofstates, transitions between those states, and actions. The operation of an FSM begins from one of the states (called a start state), goes through transitions depending on input to different states and can end in any of those available, however only a certain set of states mark a successful flow of operation (called accept states).
Function:
Ingredients:
image file (see X for converting TIFF to JPG)
associated TEI/TILE-compliant text about the image
Appliances:
TILE
Steps:
1. Open up the TEI/TILE-compliant file with a text editor.
2. Add lineA line is the string of text limited by the width of a page. Lines are often used in tokenization, and may contain parts of one or more sentences. For example "The quick brown fox jumps over the lazy dog." is a complete sentence and occurs on one line. By contrast, "Hard by a great forest dwelt a poor wood-cutter with his wife and his two children. The boy was called Hansel and the girl Gretel. He had little to bite and to break, and once when great dearth fell on the land, he could no longer procure even daily bread." spans three sentences and four lines. Return to Glossary. elements for the text of each annotation. Follow this model:
<pb facs=”#idforfirstimage” />
<lb/>
<l>Something of Interest 1</l>
<lb/>
<l>Something of Interest 2</l>
…
3. Use importer script for bringing text commentary into TILE workspace. Use this model:
http://<server>/TILE/importWidgets/impo ... for TEI XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. >&rname=text&rnum=0&ipath=<Full path to image folder>
|
4. Go to a page [Image/Map/etc.].
5. Select an annotation and mark its area(s) of interest using the shape tools provided.
6. Repeat 4-5 as necessary until all annotations on every page are complete.
|
7. Save your session (JSON data).
7b. For archival purposes, Export to XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary.
Examples:
Here’s a TEI file
|
<?xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. version="1.0" encoding="UTF-8"?> <?oxygen RNGSchema="..//tei_all.rng" type="xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. "?> <?xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. -stylesheet type="text/css" href="tei-11-08-08.css"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader></teiHeader> <facsimile> <surface xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="ham-1611-22277x-bod-c01-image001"><desc>Image 001</desc><graphic urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.="promontorymoc.png"/></surface> <surface xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="ham-1611-22277x-bod-c01-image002"><desc>Image 002</desc><graphic urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.="rockart.png"/></surface> <surface xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="ham-1611-22277x-bod-c01-image003"><desc>Image 003</desc><graphic urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.="DeneYeniseianMap.png"/></surface> </facsimile> <text> <pb facs="#ham-1611-22277x-bod-c01-image001" /> <lb/> <l>Semi-circular puckered vamp</l> <lb/> <l>quill work</l> <lb/> <l>grass-lining</l> <pb facs="#ham-1611-22277x-bod-c01-image002" /> <lb/> <l>Funky Shoulders</l> <lb/> <l>Spear</l> <pb facs="#ham-1611-22277x-bod-c01-image003" /> <lb/> <l>Location of the Dene-Yeniseian Languages</l> <lb/> <l>Location of the Promontory Point Moccasins</l> </text> </TEI>
|
After using TILE, your data may look like this (Notice the new zone tags where areas of interest have been drawn):
|
<?xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. version="1.0" encoding="UTF-8"?> <?oxygen RNGSchema="..//tei_all.rng" type="xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. "?> <?xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. -stylesheet type="text/css" href="tei-11-08-08.css"?> <TEI xmlns="http://www.tei-c.org/ns/1.0"> <teiHeader/> <facsimile> <surface xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="ham-1611-22277x-bod-c01-image001"><desc>Image 001</desc><graphic urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.="promontorymoc.png"/><zone xmlns="http://www.w3.org/1999/xhtml" lry="8093.23741778644035" lrx="46468.6227394908201" uly="80" ulx="464" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164543676_0"></zone></surface> <surface xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="ham-1611-22277x-bod-c01-image002"><desc>Image 002</desc><graphic urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.="rockart.png"/><zone xmlns="http://www.w3.org/1999/xhtml" lry="2737" lrx="110.2000122070312552" uly="27" ulx="110.20001220703125" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164644663_0"></zone><zone xmlns="http://www.w3.org/1999/xhtml" lry="2581" lrx="70.2000122070312535" uly="25" ulx="70.20001220703125" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164654271_0"></zone></surface> <surface xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="ham-1611-22277x-bod-c01-image003"><desc>Image 003</desc><graphic urlA URL (Uniform Resource Locator), sometimes called a web address, is used to locate and identify web content. For more information, see the Wikipedia. Return to Glossary.="DeneYeniseianMap.png"/><zone xmlns="http://www.w3.org/1999/xhtml" lry="8196" lrx="47.20001220703125134" uly="81" ulx="47.20001220703125" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164715057_3"></zone><zone xmlns="http://www.w3.org/1999/xhtml" lry="18524" lrx="109.2000122070312516" uly="185" ulx="109.20001220703125" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164719345_4"></zone><zone xmlns="http://www.w3.org/1999/xhtml" lry="20728" lrx="141.2000122070312543" uly="207" ulx="141.20001220703125" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164724041_5"></zone><zone xmlns="http://www.w3.org/1999/xhtml" lry="10379" lrx="546.200012207031257" uly="103" ulx="546.2000122070312" xmlXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. :id="1282164726809_6"></zone></surface> </facsimile> <text> <pb facs="#ham-1611-22277x-bod-c01-image001"/> <lb/> <l>Semi-circular puckered vamp</l> <lb/> <l>quill work</l> <lb facs="1282164543676_0"/> <l>grass-lining</l> <pb facs="#ham-1611-22277x-bod-c01-image002"/> <lb facs="1282164644663_0"/> <l>Funky Shoulders</l> <lb facs="1282164654271_0"/> <l>Spear</l> <pb facs="#ham-1611-22277x-bod-c01-image003"/> <lb facs="1282164726809_6"/> <l>Location of the Dene-Yeniseian Languages</l> <lb/> <l>Location of the Promontory Point Moccasins</l> </text> </TEI> |
Discussion
The result is a JSON file that can be used to present your image(s) with annotations and associated areas using the TILE interface. You also have a saved XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. file for archival purposes.
Known Limitations
Using TIF or raw image data can slow down the interface considerably.
TEI/TILE compliant XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. files should be used within TILE. It uses certain elements that TILE needs, and which are specified in the TILE documentation under “Making your XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data. Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another: < book >< title >< /title >< /book > Elements may also be modified by attributes and attribute values: < book format="hardcover" > In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element. Return to Glossary. TILE-Ready”
Resources
* TILE Documentation Page: http://mith.umd.edu/tile/documentation/
Next Steps
· Using manuscript data in TILE
· Making your own customized JSON file, using the structure written out in the TILE Documentation
· Add text of annotations in TILE
Introduction:
This recipe performs three different tasks:
(1) plot the cumulative type/tokenTokens are strings of characters, such as word fragments, words, phrases or sentences, generated from a source text.
In text analysis, tokens are useful for generating everything from word counts to statistical analysis to creating a concordance.
For more information, see the Wikipedia.
Return to Glossary. ratio in a text;
(2) track the occurrence of a particular word in a text and plot all occurrences of the word in a dispersion plot;
(3) show graphically the relative frequencies of the word across n equal sub-parts of the text and add to the plot chi-square and a dispersion measure (default is Juilland's D).
Ingredients:
Raw text file
The R programming/statistical package (base package)
User-specified input:
A search word
Number of parts (n) which the text file will be divided into (task 3)
Dispersion measure to use (task 3)
Steps:
Read into R a text file.
If necessary, clean/organize text.
Tokenize the text file into words.
Make one vector containing all the words of the text file in the order in which they occur in the original text.
Calculate the type/tokenTokens are strings of characters, such as word fragments, words, phrases or sentences, generated from a source text.
In text analysis, tokens are useful for generating everything from word counts to statistical analysis to creating a concordance.
For more information, see the Wikipedia.
Return to Glossary. ratio incrementally for each position and plot it. Show the positions where a search word occurs in red. (task 1 above)
Identify the positions where a search word occurs in the vector.
Make a distribution plot to graphically show the positions of each occurrence of the searchword. (task 2 above).
Divide the vector of words into n equal sub-parts.
Make a barplot showing frequency of the search word within each sub-part.
Calculate frequency and percentages, chi-square and the selected dispersion measure indicating how even/uneven the dispersion of the search word is within a text. Add all these measures to the barplot. (task 3 above)
Discussion:
The recipe produces three ".png" plots.
For a critical overview of various dispersion measures, see Gries (2008):
Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403-437.
Available here:http://www.linguistics.ucsb.edu/faculty/stgries/research/Dispersion@IJCL....
Additional web resources (dispersion scripts) for this paper available here: http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/link...
Gries, Stefan Th. 2009. Dispersions and adjusted frequencies in corpora: further explorations. In Stefan Th. Gries, Stefanie Wulff, & Mark Davies (eds.), Corpus linguistic applications: current studies, new directions, 197-212. Amsterdam: Rodopi.
Available here: http://www.linguistics.ucsb.edu/faculty/stgries/research/Dispersion_Rodo....
Glossary:
dispersion
vector (in R)
Next steps:
This is a recipe to compare different text.
Concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. -
ContextIn text analysis, context refers to the text surrounding a string of characters, which may be as short as a word or as long as a paragraph.
Context is particularly important when generating a concordance for a string.
Return to Glossary. parameter -
Readable Format -
Window span -
XMLXML, or Extensible Markup Language, is a language used in web development to make a text readable by web browsers and/or store data.
Like HTML, XML is primarily formed of paired elements. Unlike HTML, the elements are defined by the user, rather than predefined. For example, both < book >< /book > and < murfle >< /murfle > are valid element pairs. These elements apply characteristics and metadata to the text within them. One pair of elements may be nested inside another:
< book >< title >< /title >< /book >
Elements may also be modified by attributes and attribute values:
< book format="hardcover" >
In this case, the book element has the attribute 'format' and the attribute value 'hardcover'. In addition to storing metadata about the text, attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. -
Where to get digitized versions of texts: Project Gutenberg
Case Study
We
wish to explore the search term ‘witch(es)’ in contemporary British
usage (spoken and written). Specifically we are interested in what type
of objects are described as being possessed by witches in this group.
In
this case we have chosen to use a site that provides both the corpus
of contemporary British texts as well as a built-in concordancing tool
(Mark Davies’ online BNC ...).
We
searched on the lemma WITCH (=witch, witches, witch’s, witches’) and
chose 100 lines of the concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary., using the default settings of the
interface.
We brought the 100 lines into a spreadsheet with the search word tab-separated from the left and right contexts.
We coded each concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word.
Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends.
See the Wikipedia entry on Concordance (Publishing)
Return to Glossary. lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary. for any noun possessed by the search word, e.g. ‘broomstick’ in “a witch’s broomstick”.
Results:
Objects possessed by witches in our sample set includes: “all of their
belongings”, broom, broomstick, cat, “cone of power”, coven, cow,
cottage, hat, “microphone headsets & miniature televisions,” stew
A text may be full of valuable information, but sometimes something specific is sought. Recipes in this section allow users to quickly serparate the wheat from the chaff and locate what they are seeking.
This recipe takes a French language text and adds it to the TAPoR workspace for textual analysis.
This recipe is applied to a sample text in Exercise to Add a French language Text to TAPoR
;
;
;
You may require a text editor to encode your text into UTF-8 or Latin-1 to maintain the accents and special characters in the textual language. On a Windows system, this can be done through NotePad and under Macintosh OSX through TextEdit. On Unix-based systems, you will find a text editor installed as part of the standard system install. Word processors typically provide a much deeper tool set for formatting text and generally save documents in their native format which is not appropriate for importing into a text analysis environment. However, they too can be used to save a plain textPlain text refers to a text without any additional formatting affecting its human readability, often found in .txt files. Plain text files do not require a specialized program, such as a word processor, to read them. For more information, see the Wikipedia. Return to Glossary. file with appropriate encoding by following the appropriate steps.
To verify that the web page that you wish to import into TAPoR is encoded in either UTF-8 or Latin-1, you need to check the browser settings. In Internet Explorer, simply go to the View Menu and select the Encoding Option. This should read Unicode (UTF-8). On Firefox, the option is Character Encoding under the View
menu. This should also read Unicode (UTF-8). If this is not the case,
then you can manually select the encoding you wish to use from this
menu. On other web browsers, the process should be similar. Please
consult their help files for specific instructions on character
encoding.
If you view the page source for your web page, it may contain the HTMLHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary.
lineA line is the string of text limited by the width of a page.
Lines are often used in tokenization, and may contain parts of one or more sentences. For example
"The quick brown fox jumps over the lazy dog."
is a complete sentence and occurs on one line. By contrast,
"Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had little
to bite and to break, and once when great dearth fell on the land, he
could no longer procure even daily bread."
spans three sentences and four lines.
Return to Glossary.:
<meta http-equiv="Content-Type" content="text/htmlHTML, or Hypertext Markup Language, is a language used in web development to make a text readable by web browsers.
HTML is primarily formed of paired elements, such as < body >< /body > or < p >< /p >, that apply some characteristic to the text within it. One pair of elements may be nested inside another like this:
< body >< p >< /p >< /body >
In this case, < body >< /body > marks the beginning and end of the body of the document, while < p >< /p > marks the beginning and end of a paragraph within the body.
Elements may also be modified by attributes and attribute values:
< p class="hangingindent" >
In this case, the paragraph element has the attribute 'class' and the attribute value 'hangingindent'. Attribute/attribute value pairs are frequently used in combination with CSS to apply formatting to the text within the element.
Return to Glossary. ; charset=utf-8" />or
"<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>"
Which will indicate that it is encoded properly for text analysis.
This recipe uses the Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary., frequency lists, concordancesA concordance or keyword in context (KWIC) is usually represented as a list of occurrences of a word with some limited context shown (words to the left and words to the right). Here is an example that shows the occurrences of the word "dream" in A Midsummer Night's Dream in TACTweb: I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream. | But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not See also the definition at Wikipedia. Return to Glossary. and collocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary. to efficiently explore information from the web for a particular topic.
This recipe is applied in Exercise to Aggregate Information from the Web To Explore a Particular Concept.
;
;
When using an aggregation tool such as the Googlizer The Googlizer queries the Google search engine using a word or phrase you provide and returns the results of the Google search. For more information see this TAPoR tutorial. Return to Glossary., you must be able to save text to the Databench The Databench is a temporary workspace where you can store your text analysis results in the TAPoR for further use. For more information see this TAPoR tutorial. Return to Glossary. as part of the process. To make this possible you must be logged into the system to maintain your own personal workspace. If you require access to TAPoR please visit the TAPoR signup page.
This recipe examines a text in a language in which you are not fluent and demonstrates a strategic approach to comprehension using text analysis tools.
This recipe needs a good exercise to demonstrate how to Explore a Text in a Foreign Language?
Exercise to Explore a Text in a Foreign Language?
This is a recipe to identify simple themes within a sample text.
This recipe is applied to a sample text in Identifying Themes within a Text
;
;
;
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
The word list can provide a first clue about the nature of the text. Questions which can be asked of the word list may include:
A Stop list A stop list is a series of words that you may choose to exclude from a particular operation because you deem them to be irrelevant or obstructive to your analysis task. If you are searching for descriptive terms for example, you may choose to exclude function words normally occurring as part of everyday speech. Your interest may lie only in extraordinary words. Return to Glossary. is a series of words that you may choose to exclude from a particular operation because you deem them to be irrelevant or obstructive to your analysis task. If you are searching for descriptive terms for example, you may choose to exclude function words normally occuring as part of everyday speech. Your interest may lie only in extraordinary words.
This recipe uses tools such as CollocationCollocation refers to the occurrence of words adjacently more often than would be expected by chance. Collocation is the relationship between two words or groups of words that often go together and form a common expression. If the expression is heard often, the words become 'glued' together in our minds. 'Crystal clear', 'middle management' 'nuclear family' and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'. Return to Glossary. and Co-OccurenceCo-occurrence is the number of times two patterns occur in a set order within a set distance of one another in a source text. For more information, see the Wikipedia. Return to Glossary. to explore the syntactic dependencies within the textual construction
This recipe needs a good exercise to demonstrate how to Identify Syntactic Dependencies within a Text?
;
Sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that… what improves… what can obstruct?.
Recipes in this section teach the use of basic tools. Mastering the use of these tools will greatly increase users' abilities to make use of the methods in the other sections, as well as being useful in and of themselves.
This is a recipe to build a simple concordance A concordance is a gathering of passages that "concord" or agree. Usually it is a gathering of passages with a sought for word. Concordances are a form of reading tool that go back to the Middle Ages. They are typically lists of words with their appearances. A concordance for the bible, for example, would have entries for all the content words of the bible in alphabetical order. Each entry would include information about where the word appears and some context. Searching for words on a computer now typically returns a concordance called a Key Word in Context (KWIC) with the sought word down the center and a few words of context on either side. Google returns a type of concordance when you search for a word with an example of the word in context for each page it recommends. See the Wikipedia entry on Concordance (Publishing) Return to Glossary. from a text
This recipe is applied to a sample text in Build a Simple Concordance
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
This is a recipe to find collocated words for a key word
This recipe is applied to a sample text in Find Collocated Words
;
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
This is a recipe to list words to suggest themes in a text
This recipe is applied to a sample text in List Words to Identify Themes
Possible sources for electronic texts are listed on the Electronic Texts Panel of TAPoR. When preparing text for analysis, you should be aware that academic infrastructure included in the text may obstruct reading the text for its original construction. It may be useful to remove notes and other materials added by subsequent authors from the original work. You can use tools such TAPoR Extract Text to remove added material.
The word list can provide a first clue about the nature of the text. Questions which can be asked of the word list may include: