DH2015 Workshop: The New, The Neat & The Gnarly

Voyant Tools Logo
@VoyantTools

This workshop will focus on the second major release of Voyant Tools (2.0), which addresses several of the major shortcomings and irritants of version 1.0. In addition to performance improvements throughout, the search and filtering functionality have been vastly enhanced and Voyant now supports proximity and n-gram operations.

We have designed this workshop to be of interest both to new users of Voyant, who will get an introduction to the platform, and to existing users, who will discover all the new functionality 2.0 has to offer. Please note that some URLs in this workshop document are time-sensitive and may not be functional beyond the workshop – you may want to consult the Workshops page to see if there’s a more recent workshop document.

Outline

  • Getting Setup
  • First Steps: Cirrus
  • Next Steps: The Full Environment
  • Bring Your Own Texts
  • Getting to Know the Tools
  • Exploring More Tools
  • Advanced Search Functionality
  • Exporting URLs, Tools & Data
  • Voyant Tools Roadmap

Getting Setup

One of the strengths of Voyant Tools has always been that it’s freely and conveniently accessible online – there’s a hosted version that anyone can use (at voyant-tools.org, though we’ll be using a more recent beta version). There’s also a downloadable version of Voyant Tools that can be run locally and that has several potential advantages:

  • You can keep your texts confidential as they will not be cached on our server.
  • You can restart the server if it slows down or crashes.
  • You can handle larger texts without the connection timing out.
  • You can work offline (without an Internet connection).
  • You can have participants in a group (like in this workshop) run their own instance without encountering load issues on our server.

For this workshop, it’s strongly recommended that you use the standalone local instance of Voyant Tools (available through VoyantServer):

  • download  the VoyantServer 2.0 zip archive
  • double-click on the zip archive to expand its contents
  • double-click on VoyantServer.jar
    • on Mac, because of security restrictions on applications that aren’t signed and approved by Apple, you may need to Ctrl-click on the VoyantServer.jar file, select open from the menu, and then click open (not the default button) in the next dialog box
    • you’ll need Java 1.7+ for this, your computer will tell you if you need to download Java

You can find more information about Running VoyantServer, including tips in case of problems. If you’re unable to run VoyantServer (because of a problem with your machine or because you’re using a tablet, or for any other reason), you should be able to following along using one of the two following URLs (in order of preference):

For most of the Workshop outline below we will provide a list of links for the different URLs possible, such as for the home page [local, workshop, beta].

First Steps: Cirrus

Cirrus is the word cloud tool in Voyant. Have a look at an example [localworkshopbeta]. Voyant Tools (Austen)

  • What text do you think it is a cloud for?
  • What features are metrical (based on measuring the text in some way)? How are the other features generated?
  • What words are missing?

new All tools and visualizations  in Voyant 2.0 are HTML5-based, no more Flash or Java applets (that cause cross-platform compatibility and security issues.

Next Steps: The Full Environment

Voyant Tools is an environment that can host different individual tools (like Cirrus) in different views and layouts. The default view of Voyant is composed of 5 panels where the tools interact with one another. Try opening the Austen corpus [localworkshopbeta]. If you click on a word in Cirrus, the Trends graph will update. If you click on a node in the Trends graph, the Contexts tool will update. Here’s a summary of the 5 visible tools:

Voyant Tools Numbered (Austen)

 

  1. Cirrus: a simple wordcloud that displays the highest frequency terms in the corpus (that aren’t in the stopword list)
  2. Reader: a infinite scrolling reader for the actual text in the corpus (this fetches the next part of the text as needed)
  3. Trends: a visualization of word frequency across the corpus or within each document (depending on the mode)
  4. Summary: a high-level summary of data from the corpus
  5. Contexts: a list of occurrences of a specified word (this is sometimes called a concordance or a keyword in context)

Explore the visible tools (we’ll come back to the other tools later):

  • what happens when you hover over the help icon? what if you click it?
  • which tools trigger responses from which other tools?
  • what scale is each tool (entire corpus, entire document, part of a document, etc.)?
  • what is the visualization in the bottom of the Reader (middle-top) panel?
  • try a simple search in the Reader panel
  • what is relative frequency in the Trends tool?
  • what are vocabulary density and distinctive words in the Summary tool?
  • what does the plus icon do in the Contexts tool?
  • what is the difference between context and expand in the Contexts tool?

new Voyant 2.0 uses a new, crisper theme (the global appearance of the interface).

Bring Your Own Texts

A strength of Voyant Tools has always been that you can use an existing corpus (such as the Austen corpus we used above), or you can create your own corpus from the home page [localworkshopbeta]. voyant-home   There are three primary ways of creating a corpus:

  1. type or paste text into the large box (you can copy-and-paste text from a webpage or word processor, for instance) – in this case you’ll be creating a corpus with one document
  2. type or paste URLs into the large box, one URL per line – this will create a corpus with as many documents as you have URLs, Voyant will try to fetch the content from the specified locations (so they can’t be behind a password or restrictive firewall); the URLs can point to documents in various supported formats (see below)
  3. click the upload button and select one or more files to upload – the files can be in a variety of formats, including plain text, HTML, XML, RTF, MSWord, and PDF, or a Zip (archive) file containing documents in one of the supported formats

For the purposes of the workshop it might be best to try first with a simpler file format (like plain text or MSWord), but it’s also possible to use XML very powerfully by clicking on the options icon (when hovering in the Add Texts header) and defining XPath expressions to documents, body content and metadata such as title and author.

new When uploading files, you can now select multiple files at once by using the Ctrl and/or Shift keys.

Getting to Know the Tools

Each of the several tools in Voyant has its own particularities and peculiarities, but here are some general principles that apply to several tools.

Options. Many of the tools provide parameters directly visible (usually in the bottom part of the tool). The Contexts tool for instance (bottom right-hand corner of the default skin) has options for searching, for the context size (how many words to show on each side of the keyword in the table), and for expand size (how many words to show on each side of the keyword when you expand the occurrence by clicking on the plus icon in the first column of the row). In addition to these visible options, some tools also have additional options that can be accessed through the options icon in the top header. The Cirrus tool, for instance, has an option for modifying the stopword list.

Voyant Tools Options

Stopwords. The stopword list contains common words that usually have less meaning and are very common in most texts, such as determiners (“the”, “a”) and prepositions (“to”, “in”, “from”), etc. One person’s stopword is another person’s treasure, and it may be worth looking at the list of words to see if there are ones you’d prefer to show or if there are words that you don’t want to show and that should be added to the stopword list. You can edit the list by click on the options icon (in Cirrus, for instance) and clicking the edit button. Note that you can apply the newly selected or edited list to the current tool only or globally to all tools that support stopwords (globally is the default).

Voyant Tools Options

new Voyant 2.0 now uses auto-detect by default so it’s no longer necessary to choose a stopword list (unless the auto-detect option doesn’t work for you).

Table/Grid Headers. The column headers in table/grid views includes functionality that may not be obvious. First, a help tip will appear when you hover over most column headers to briefly explain what that column is showing. Next, a down arrow will appear in the right part of the column header that and clicking on the down arrow will allow you to sort by that column (when possible) and to toggle the visibility of columns. Finally, if a column is sortable, you can also click on the header to toggle between ascending and descending order for sorting the table by that column.

Grid Headers

newInfinite Scrolling Tables/Grids. Tables can sometimes contain a huge number of logical items (for instances tens of thousands of terms in a document) which would be impractical to load at once. In Voyant 1 there was a paging mechanism that allowed the user to see 50 items at a time by advancing or rewinding by “page”. In Voyant 2 items are loaded on-demand as the user scrolls through the table – in most cases that should happen fairly seamlessly.

Corpus/Document Modes. Some of the tools can operate at variable scale, either showing data at the corpus level or at the individual document level – this can be a bit confusing if you’re not sure what you’re seeing. For instance, by default Cirrus shows top frequency terms for the entire corpus, but you can also generate a Cirrus from the terms of an individual document – one way to do this is to click on the Documents tab in the lower left-hand panel and click on one of the document rows. The Cirrus that appears will be for just one document, and if you want to revert to Corpus mode you can click on the “reset” button that appears in the lower right-hand corner of the Cirrus tool.

Cirrus Scale

Resizing. The individual tool panels are resizable, the mouse pointer should change to a resize icon when you are hovering over the inner borders between tools and you can drag the border to resize. Similarly, the columns in table/grid tools are resizable.

Exploring More Tools

newThe way you access other tools in Voyant 2.0 has been improved and simplified, particularly with the introduction of tabs (multiple tools available from each panel) and the introduction of the tool switching menu.

In addition to the five tools that are displayed by default (Cirrus, Reader, Trends, Summary and Contexts), each of the five panels makes it easy to access additional tools, some of which we’ve mentioned already. Here are the other tools available from the tabs:

  • Corpus Terms: displays frequency and distribution information for terms (types or unique words) in the corpus
  • Links: displays a network graph of the collocates of keywords (the highest frequency terms that occur close to the specified search terms) – you can click on individual terms to fetch more terms and you can drag terms off the tool to remove them
  • Collocates: similar to Links, but this presents collocates of search terms in a table form
  • Documents: lists the documents in the corpus, including some metadata (where available, such as title and author), as well as counts of words/tokens, types and a ratio of types to tokens
  • Phrases: lists the recurring phrases in the corpus (though any phrase must be repeated in a document before it is counted at the corpus level); this is a new tool in Voyant 2.0 and one of the most useful functions can be to see the longest repeating phrases (without having to specify a search query); note that there are different options for handling overlapping phrases
  • Bubblelines: this is another representation of the distribution within each document in the corpus, it can be helpful for perceiving where different terms appear together (overlap)

All of these tools can be accessed through the tabs, but they can also be invoked from the tool switching menu (a windows-like icon) that appears when you hover over the header of any tool.

Tool Switch

If you click on the tool switching icon a nested menu will appear. The first items will be a list of one or more tools that fit most naturally in that tool panel, but you can also navigate tools by scale (corpus or document) or by tool type (visualizations, tables/grids, other).

The skin header (the blue bar at the top) also has a tool switching menu which allows you to replace the entire page with one tool. This is also a convenient way to access the ScatterPlot tool which provides a visualization of Correspondence Analysis or Principle Component Analysis (more complex analysis of how terms are shared between documents).

Note that some of the tools from the current 1.0 version of Voyant have note yet been implemented in version 2.0, such as TermsRadio, Knots, and Bubbles. Those should be implemented in the coming months, though some of the other tools may be abandoned, especially those that rely on Flash or Java.

Advanced Search Functionality

new Much of the advanced search functionality is new in Voyant 2.0 – we’ll go through some highlights below.

Help with the search syntax is displayed when you hover over the question mark icon in a search box. The hovering tip box will disappear after a few seconds, and you can click on the question mark to have a dialog box appear until you dismiss it.

Search Syntax

Search functionality is fairly consistent in all tools that support search. For experimentation, let’s work in the Corpus Terms tool (which is the second tab in the upper left-hand panel where the Cirrus wordcloud is displayed by default). These examples use the Austen corpus [localworkshopbeta].

  • exact match: think this searches the exact word (though it’s case insensitive, there’s currently no way to perform a case-sensitive search)
  • wildcard match: think* this matches the root of a word and includes variants as a single term (think, thinks, thinking, etc.), note that for now wildcards can’t be used at the beginning of words and produces inconsistent results when used in the middle of words
  • expanding wildcard match: ^think* this is similar to the previous wildcard match but this time each variant is counted and displayed as a separate term (this can be useful for seeing what terms are actually included in a wildcard match)
  • multiple matches: think*, ^think* you can search multiple terms (two or more) by separating them by commas – a simple search might be for exact matches think, thinking, but you can also use more complex searches like think*, ^think* to get the best of both worlds form wildcard matches (counting the total wildcard matches as one term and also seeing the individual matches).
  • combined matches: think|thinking use a combined match to merge two or more search terms into one result – this might be useful for counting singular and plural forms of a word, but not all wildcard forms (time|times but not timely, etc.)
  • phrase match: “time enough” this matches an exact phrase or sequence of words – note the use of quotes (if you exclude the quotes you’re essentially performing a combined match for time|enough, though that may change in the future)
  • proximity match: “time enough”~10 this is essentially a NEAR match, where the terms in quotes (there can be more than two) must occur within a specified number of words (in this case within 10 words, but you can specify a different number for the proximity); note that words can appear in any order, so enough might occur before time; it’s not possible to expand the match with the ^ operator like with wildcard searches, but you can use the Contexts tool to see the actual occurrences that are being matched
  • multiple matches: time*, time|times, “time enough”~10 it’s possible to mix and match the different syntaxes, as with this example that has a wildcard match, multiple matches, combined matches, and a proximity match

Exporting URLs, Tools & Data

A distinguishing feature of Voyant Tools is its ability to generate URLs that can be bookmarked or shared and that point to a specific corpus with specific parameters.

newThe URL in the browser location bar will now update automatically after you create a corpus – you can bookmark or share this URL directly.

To export the URL from the current skin (combination of tools, not just one tool), click on the export icon from the top blue header bar.

Voyant Export

This will cause a dialog box to appear with various export options, the first of which is a simple link that can be copied into the clipboard or clicked to open the URL in a new window.

Voyant Export Skin The same basic process works for individual tool panels as well (if you just want to export or share, say, the Cirrus visualization), except that additional parameters are usually included with the tool panels (specific search terms that have been selected, for instance).

In addition to exporting a URL, you can also generate a bibliographic entry for Voyant Tools (if you wish to cite it, which would be awfully kind of you :), or if you want to export a live dynamic tool panel. The exported tool works much like a YouTube clip that can be embedded into any website – it pulls interactive content from a remote site. For both of these options, expand the “Export View” menu (see the image above).

The HTML snippet for a live tool might look something like this:

<!– Exported from Voyant Tools: http://voyant-tools.org/.
Please note that this is an early version and the API may change.
Feel free to change the height and width values below: –>
<iframe style=’width: 100%; height: 400px’ src=’http://beta.voyant-tools.org:80/?corpus=austen&view=Cirrus’></iframe>

Which should produce a live tool like this:

Important notes about URLs and embedded tools:

  • During this workshop we’re using special instances of Voyant Tools that may not be accessible to others – that’s certainly true for a standalone (local) instance of Voyant running on your machine, but it’s also true for the workshop and beta URLs where corpora are less likely to remain accessible, unlike the current production version of Voyant Tools where corpora remain accessible as long as they’re visited regularly (at least once every three weeks).
  • Embedding the HTML snippet may be a bit trickier with some Content Management Systems. In WordPress for instance, if you’re not an administrator, you may want to install a plugin like iframe.

In addition to exporting a URL or a embedding an interactive tool, Voyant provides some additional data exporting features, depending on the tool. For instance, some visualizations (like Cirrus, Trends, and Links) allow you to export data as graphics (a PNG or SVG), while the table-oriented tools (like Corpus Terms, Contexts and Phrases) allow you to export data in different formats (HTML, tab-separated values, and JSON). The tab-separated values can be especially useful since you can copy the generated output into a clipboard and paste it directly into a spreadsheet program (like Excel or Google Spreadsheets).

Export Tab-Separated ValuesNote that in the current beta version it’s only possible to export the currently visible/loaded data, but that in a close future release it will be possible to export full datasets.

Voyant Tools Roadmap

Voyant Tools is an ongoing project and we’ll continue to improve and enhance the platform. Here’s a tentative roadmap for future development:

  • by fall 2015 we hope to release Voyant Tools 2.0 and replace the current 1.0 version – some of the major remaining work includes:
    • various bug fixes
    • allow for adding and reordering documents in existing corpora
    • adding a password protection for corpora
    • backwards compatibility issues to ensure that existing Voyant URLs continue to function correctly)
  • during fall 2015 and winter 2016 work will resume on Voyant Notebooks, a literate programming environment that allows a combination of writing, code snippets, dynamic tools, and other data output (more here). Voyant Notebooks is intended to leverage the existing analytic and visualization capabilities of Voyant while allowing users to customize some functionality and include a narrative description of their work
  • ongoing work to summer 2016 on the next version of Voyant Tools that will include functionality for part-of-speech tagging, lemmatization, and topic modelling (some work has already been done on each of these, but was put on hold to ensure that Voyant 2.0 could be released)

Please feel warmly encouraged to help improve and guide further development of Voyant Tools by providing us with feedback, including bug reports and feature requests. You can follow the developments on Github, Twitter, or contact us directly (sgsinclair at  Google’s email service).