ScatterPlot displays the correspondance of word use in a corpus. This visualization relies on a statistical analysis that takes the word’s correspondance from each document (where each document represents a dimension) and reduces it to a three dimensional space to easily visualize the data through a scatterplot.
- Getting Started
- Interface Elements
When you first arrive to the ScatterPlot tool you will see one of two possible screens:
ScatterPlot with a pre-loaded corpus. You were probably given a URL that included the corpus, or you’re viewing a page that has an embedded Voyeur tool in it. If you prefer, you can also start without a corpus.
ScatterPlot includes the standard set of interface elements (see image to the right). For more help with these see the Voyeur Tools Standard Interface Elements page.
This tool displays the results of a statistical analysis using a scatter plot visualization. There are two types of analysis available: Principal Component Analysis and Correspondence Analysis.
Principal Component Analysis is a technique which takes data in a multidimensional space and optimizes it, reducing the dimensions to a manageable subset. It is a way of transforming the data with respect to its own structure, so that associations between data points become more readily apparent. For example, consider a table of word frequencies for a corpus of ten documents. Each document can be thought of as a dimension, and each word frequency as a data point. Since we cannot visualize a ten dimensional space, we can apply PCA to reduce the number of dimensions to something feasible, like two or three. This is accomplished by transforming the data into a new space wherein the first few dimensions (or components) represent the largest amount of variability in the data. By discarding all but the first two or three dimensions, we will be left with a new data set which ideally contains most of the information in the original, but which is easy to visualize. In the resulting visualization, words that are grouped together are associated, i.e. they follow a similar usage in the corpus.
Correspondence Analysis is conceptually similar to PCA, but handles the data in such a way that both the rows and columns are analyzed. This means that given a table of word frequencies, both the words themselves and the document segments will be plotted in the resulting visualization.
The scatterplot is presented in the main display in the tool with a legend in the top left hand corner. Hovering over a word in the graph will display more information about the frequency of occurrence of that word.
Above the main display is the primary toolbar and to the right of the display is sub-panel providing a list of words that appear in the corpus as well as their frequencies.
The toolbar mainly comprises options for tweaking and exploring the plotting of the graph.
- “Analysis” allows the user to switch between plotting a Principal Component Analysis and plotting a Correspondence Analysis.
- “Frequency Type” allows the user to switch between plotting the words’ raw and relative frequencies.
- “Terms” allows the user to control how many words are displayed at once.
- “Clusters” allows the user to control the number of groups to cluster the words into. These clusters are determined automatically by the criteria of the analysis and words in a cluster would indicate a measure of similarity between words. Clusters of terms will appear as a single colour.
- “Dimensions” allows the user to switch between two or three dimensions.
- “Labels” allows the user to cycle through the label settings for the graph. The default setting has all the labels on. The next setting removes the legend but retains the labels on the points. And the last setting removes all the labels.
More terms can be added to the graph by the “Add Word” box in the top right-hand corner. When you begin typing a word into the box, a selection of auto-completed words will be suggested in a drop-down menu. To select the word click the word in the drop down menu of auto-completed words or finish typing the word in the text box and press “enter” on the keyboard. Not all words that are proposed by the auto-complete box occur within the text and selecting a term that does not then appear in the main display would indicate that this term is not present.
To remove a word from the graph the user has to first select it. To select a word click on the data point in the graph. You will know that is is selected because the data point will become yellow. Then in the top right-hand corner of the “Words” panel select “Remove”
Clicking on any of the words in the main display or in the “Words” panel to the right opens the corpus in a new instance of the Voyant Default skin with a focus on the selected word.
Like all Voyeur tools, ScatterPlot can be reused in a variety of ways:
- create a link that is specific to the corpus and options that are currently being used
- embed the current corpus and options as a tool in an external page
For more information see exporting and reusing Voyeur Tools.