Corpus Summary

Corpus Summary is a tool that provides a simple, textual overview of the current corpus. This includes number of words, number of unique words, longest and shortest documents, highest and lowest vocabulary density, most frequent words, notable peaks in frequency, and distinctive words.

Getting Started

When you first arrive to the Corpus Summary tool you will see one of two possible screens:

Corpus Summary without a pre-loaded corpus. See loading texts into Voyeur for help on how to proceed.

Corpus Summary with a pre-loaded corpus. You were probably given a URL that included the corpus, or you’re viewing a page that has an embedded Voyeur tool in it. If you prefer, you can also start without a corpus.

 

Interface Elements

Corpus Summary includes the standard set of interface elements (see image to the right). For more help with these see the Voyeur Tools Standard Interface Elements page.

Standard UI Elements

Reading the tool

The tool displays 6 categories of information formatted in a bulleted list.

  • The first bullet provides an overview of the corpus, including number of documents in the corpus, number of words in the corpus, and number of unique words in the corpus.
  • The second point provides the top two longest documents (by number of words) in the corpus, and the top 2 shortest documents. Following each tittle the actual number of words is provided in brackets. As well the point illustrates the distribution of document length across the corpus through a small thumbnail pic just to the right of the point’s keyword. This line graph shows the documents in the order that they were added, and not for example in the order of longest to shortest text.
  • The third point provides the documents with the top two vocabulary densities, and the documents with the lowest two. Following each title the vocabulary density for the document is indicated in brackets. As well the point illustrates the distribution of vocabulary density across the corpus through a small thumbnail pic just to the right of the point’s keyword. This line graph shows the documents in the order that they were added, and not for example in the order of highest to lowest vocabulary density.
  • The fourth point indicates the five most frequent words in the corpus, with their frequencies indicated to their right in brackets.
  • The fifth point indicates the five words with the most notable peaks in frequency. The word’s frequencies are indicated to their right alongside a small thumbnail pic depicting their relative frequencies across the corpus.
  • The sixth point indicates the top five most distinctive words of each of the documents. While only the first five documents are visible clicking “Next # of # remaining” allows the user to navigate through the remaining undisplayed documents. To the right of each of the words is the word’s frequency displayed in brackets.

Summary-overview

Linking out of the tool

Clicking on any of the text in blue, this includes all the documents’ titles as well as the invitation “More…” or “All…”, will open the corpus in a new instance of the Voyant Tools default skin. Clicking on the words highlighted yellow will again open the Voyant Tools default skin but with a focus on the word.

Exporting

Like all Voyeur tools, Corpus Summary can be reused in a variety of ways:

  • create a link that is specific to the corpus and options that are currently being used
  • embed the current corpus and options as a tool in an external page

For more information see exporting and reusing Voyeur Tools.

Leave a Reply

Your email address will not be published. Required fields are marked *