Data Science: An Introduction/Thinking Like a Visual Artist
Five design elements--Harmony, Rhythm, Flow, Balance, and Focus--are essential considerations to think like a visual artist. A data scientist must think creatively to combine these elements together in appropriate proportions to convey the essential conclusions to the audience.
The Wiktionary defines Harmony as a pleasing combination of elements. In visual art, harmony is the concordant use of shape and color. Humans easily consume and thus are attracted to harmonious visualizations. A data scientist will want to make sure the visual representations of the solution blend together well. One aspect of harmony is to consistently use a coherent color scheme across an entire presentation. One popular set of color schemes is to take 3 or 4 adjacent colors from the color wheel, such as the "earth" tones on the upper right or the "cool" tones on the upper left. A "complementary" color scheme will take colors from opposite sides of the color wheel.
Similarly, when building a series of charts and graphs, the data scientist will want to have a consistent layout across them all, including using the same symbols and typefaces.
The Wikipedia defines Rhythm as a pattern marked by the regulated succession of strong and weak elements, or of opposite or different conditions across time and/or space. When a data scientist discovers rhythms in the data, she must then consider how to communicate these visually. A sine wave shows a consistent rhythm over time.
One useful device is to show deviations (blue line) in the data from the background rhythm (black line), such as this chart showing climate change over the last 550 million of years.
Another example is a depiction of the simulation of two waves interfering with each other.
A final example shows the correlated distributions between two variables. Because the highs and lows are off center but repeat symmetrically, the correlations are not linear, but regular and offset by a constant amount.
The Wiktionary defines Flow as the movement of a fluid. In visual art, flow is the illusion of movement through the manipulation of color and form. The human eye will naturally follow visual cues in the image. A data scientist will want to make sure the visual representations of the solution tell a "story" by inviting the eye to move from start to finish. Peter Paul Rubens was a baroque Dutch painter that perfected artistic flow. In his painting, "Fall of Phaeton," one cannot but help move the eye toward the upper right.
So it is with data scientists. Their analysis of the data leads to come conclusion. The trick to is to depict the that helps lead the recipient to the conclusion. Sometimes flow is an inherent component of the data, such as the depiction of a fluid moving around a solid, with the resulting turbulence.
Many times, the data represent concepts that are inter-related and those relationships are shown as a flow diagram, such as this one from the July 1987 Psychological Review.
Balance is an important visual cue. The simplest form of balance is symmetry. The Wikipedia defines Symmetry as self similarity across time, space, or scale. Human beings find symmetry comforting. Many living things show symmetry in space such as a butterfly.
Humans also create objects with symmetry, such as buildings, like the Taj Mahal.
A data scientist will play to the human preferences for symmetry in visually presenting the solutions to problems. In the following example, the four graphs show size symmetry--and three of them show scale symmetry. They are also placed so as to be both horizontally and vertically symmetric. This multiple symmetry aids in the understanding of the information being presented.
Balance: Rule of Thirds
While symmetrical balance can help an audience to feel comfortable with data presentations, asymmetrical balance is often more visually appealing. We are not talking about just any random asymmetry, but two particular asymmetries. The first one we will exam is the Rule of Thirds and the second is the Golden Ratio. According to the Wikipedia, the "Rule of Thirds" proposes that an image should be imagined as divided into nine equal parts by two equally-spaced horizontal lines and two equally-spaced vertical lines, and that important compositional elements should be placed along these lines or their intersections. Proponents of the technique claim that aligning a subject with these points creates more tension, energy and interest in the composition than simply centering the subject would.
In the example below, the picture has been cropped without and with the rule of thirds. The right hand picture has the mountain tops along the axis of the bottom third, the stone outcropping is positioned along the axis of the left third, and the clouds hover just above the top third.
Thinking like a visual artist means thinking of visualizing data in terms of the "rule of thirds." For example, the chart below takes advantage of the rule of thirds, by placing the horizontal rules in thirds, and added the comment on the upper horizontal rule at about the position of the (imagined) right-most rule.
Balance: Golden Ratio
Although somewhat similar to the Rule of Thirds, the Golden ratio is a much more well developed concept, both in mathematical theory and in application to real world problems. According to the Wikipedia, the "Golden Ratio" is defined as two quantities where the ratio of the sum of the quantities to the larger quantity is equal to the ratio of the larger quantity to the smaller one. The figures below illustrate the relationship.
The golden ratio is expressed algebraically as:
where the Greek letter "Phi" () represents the golden ratio. Its value is:
Many artists and architects have proportioned their works to approximate the golden ratio—especially in the form of the golden rectangle, in which the ratio of the longer side to the shorter is the golden ratio—believing this proportion to be aesthetically pleasing. A golden rectangle can be cut into a square and a smaller rectangle with the same aspect ratio. Mathematicians since Euclid have studied the golden ratio because of its unique and interesting properties. The golden ratio is also used in the analysis of financial markets, in strategies such as Fibonacci retracement. The golden ratio is commonly used in everyday design, for example in the shapes of postcards, playing cards, posters, wide-screen televisions, photographs, and light switch plates. Studies by psychologists, have been devised to test the idea that the golden ratio plays a role in human perception of beauty, such as female waist to hip ratio, male shoulder to hip ratios, and forehead to face ratios. While some early studies showed support for this hypothesis, later attempts to carefully test the hypothesis have been inconclusive.
- One classic example of the application of the golden ratio is the Great Mosque of Kairouan, built in 670 AD in Tunisia. The golden ratio is repeated from the overall design to the individual rooms and columns. Not only is it one of the oldest places of worship in the Islamic world, is one of the most impressive and largest Islamic monuments in North Africa. The mosque is a masterpiece of both architecture and Islamic art.
- Another example is Salvador Dalí's The Sacrament of the Last Supper. The dimensions of the canvas are a golden rectangle. A huge dodecahedron, in perspective so that edges appear in golden ratio to one another, is suspended above and behind Jesus and dominates the composition. See an image of the painting at the National Gallery of Art.
As a data scientist develops more sophisticated analytical skills, she will also need to develop more sophisticated visual presentation skills. The Golden Ratio is one way to put analytical and visual sophistication in harmony with each other. One simple way is to divide up the presentation charts in sections like example below:
The Wiktionary defines Focus as the concentration of attention. The data scientist will want to create visualizations to draw the audience's attention to the important point. Visual artists create focus by contrasting size (scale), color, and page position. It is important to make sure the visual elements have a function that supports the content. For example, in Francisco de Goya's painting, "The Vintage," he wants to highlight the grape harvest. First, he uses a triangular positioning of the people so that the grapes are at the apex of the triangle. He also places the grapes in the center of a bright (triangular) cloud that is surrounded by a dark cloud. He masterfully uses both design and color to bring focus to the subject of his painting.
Here is a good use of focus to highlight conclusions based on data. The following charts depict above and below average country GDP per capita from the World Factbook and malaria risk by country from the Centers for Disease Control.
The data scientist must use creativity to combine these five elements (Harmony, Rhythm, Flow, Balance, and Focus) together in appropriate proportions to convey the messages in interesting and informative ways to the audience. The creative process includes divergent thinking, which involves the generation of multiple answers to a problem; conceptual blending, in which solutions arise from the intersection of two quite different frames of reference; and, honing, in which an acceptable solution emerges from iterating over many successive unacceptable versions of the solution. In practice, creativity is often a team sport. When several people from diverse backgrounds come together to solve a problem, they can more easily engage in divergent thinking and conceptual blending. Honing is just good, old-fashioned elbow grease. Please do not fall in love with your first attempt. Think of your first attempt as the beginning (not the end) of a conversation that will engage both members of the data science team, as well as members of the potential audience for the team's findings.
The following graph is a good example of a science graphic using all five visual elements. See if you can see how harmony, rhythm, flow, balance, an focus are used.
- The image shows that the atoms in a molecule can be modeled as charged spheres connected by springs that maintain bond lengths and angles. The charged atoms interact with each other (via Coulomb's law) and with solvent. The shroud represents the region of hydrophobic repulsion, where the strength of the hydrophobic effect is approximately proportional to the surface area of the shroud. The shroud, shown extending only over the back of the molecule actually extends all the way around it. The model shown is called a molecular mechanics potential energy function, and it is used by programs like Folding@Home to simulate how molecules move and behave. The molecule shown is an alanine dipeptide.
Use R to create some tables and plots. Get into groups of 2 to 3 students. Try to work with at least one other person you have not been a group with before. Be sure everyone in your group understands all of the R code as you execute it. Let's start by examining a categorical variable.
#Generate Table for Categorical Variable #Remove Objects in workspace and print date rm(list=ls()) paste ("Today is:", date()) #Create a nominal (categorical) variable with 10,000 observations #Use the sample() function temp.a <- sample( LETTERS[1:4], 10000, replace=TRUE, prob=c(0.1, 0.2, 0.65, 0.05) ) #List the values temp.a #Get a summary - for nominal variables it gives a count summary(temp.a) #Produce a simple table showing the frequency distribution of the values of the variable table(temp.a) #Do a summation across all the categories, a total that we would put in the margin #Note that in order for the margin.table() function to work properly it needs a numeric input #The table(temp.a) produces four numbers, which are passed to margin.table() margin.table(table(temp.a)) #Now let's get R to print the frequencies and margins together addmargins(table(temp.a)) #Simarly, in order to calculate the percentage distribution, we need to pass a numeric argument prop.table(table(temp.a)) addmargins(prop.table(table(temp.a))) #The Hmisc library has a function that simply puts all of these together library(Hmisc) describe(temp.a)
Let's try exploring a continuous variable.
#Generate Table for Continuous Variable #Remove Objects in workspace and print date rm(list=ls()) paste ("Today is:", date()) #Create a continuous (numerical) variable with 10,000 observations #Use the rnorm() function to generate 10,000 random number with a mean of 0 and a standard deviation of 1 temp.a <- rnorm(10000, mean = 0, sd = 1) #List the values temp.a #Get a summary - for continuous variables it gives quintiles summary(temp.a) #Get descriptive statistics min(temp.a) #minimum max(temp.a) #maximum range(temp.a) #range median(temp.a) #median mean(temp.a) #mean var(temp.a) #variance sd(temp.a) #standard deviation # The "describe" function in the "psych" library gives all these with one command # item name ,item number, nvalid, mean, sd, median, mad (median absolute deviation), # min, max, skew, kurtosis, se (standard error of the mean) library(psych) describe(temp.a)
Now try this simple plot.
#Generate plots #Remove Objects in workspace and print date rm(list=ls()) paste ("Today is:", date()) #The plot character (pch=) parameter specifies which symbol will be used in the plot #Need x, y, and pch vectors from 1 to 25 temp.x<-1:25 temp.y<-1:25 temp.p<-1:25 #Set up plot of y on x using default plot character (no pch= specified) plot( main="Simple Plot of Y on X", x=temp.x, xlim=c(0,26), xlab="Independent Variable", y=temp.y, ylim=c(0,26), ylab="Dependent Variable" )
You will need to add the library "calibrate" to your R workspace in order to use the "textxy" command in the following example. Use the "package manager" and "package installer" commands from the R console pull down menus. The packages chapter in the R programming Wikibook is a place to start.
Now, let's replace that first plot command with the following.
#Set up plot of y on x specifying a different plot character for each point plot( main="Simple Plot of Y on X with Plot Characters", x=temp.x, xlim=c(0,26), xlab="Independent Variable", y=temp.y, ylim=c(0,26), ylab="Dependent Variable", pch=temp.p ) #Use textxy to label points with value of pch, #Shift axes slightly so labels don't sit on top of points temp.xshift <- temp.x-1.25 temp.yshift <- temp.y-0.2 #Use character expansion (cx=) library(calibrate) textxy(temp.xshift, temp.yshift, temp.p, cx=.6)
Be sure to print out a copy of that plot to use for reference.
Now, let's try changing the colors.
#Set up plot of y on x specifying different colors for each point plot( main="Simple Plot of Y on X with Colored Characters", x=temp.x, xlim=c(0,26), xlab="Independent Variable", y=temp.y, ylim=c(0,26), ylab="Dependent Variable", pch=16, col=temp.p ) #Use textxy to label points with value of pch, #Shift axes slightly so labels don't sit on top of points temp.xshift <- temp.x-1.25 temp.yshift <- temp.y-.2 #Use character expansion (cx=) library(calibrate) textxy(temp.xshift, temp.yshift, temp.p, cx=.6)
And, now let's change the size of the points in the plot
#Set up plot of y on x specifying different sizes for each point #Use the character expansion (cex=) parameter #A cex=1 is the default; a cex=2 is twice the default size; and a cex=.5 is half the default size #So as not to make the points too big, let's #Transform the temp.p variable from 1-25 to 0.2-5 temp.c = temp.p/5 plot( main="Simple Plot of Y on X with Sized Characters", x=temp.x, xlim=c(0,26), xlab="Independent Variable", y=temp.y, ylim=c(0,26), ylab="Dependent Variable", pch=16, col=3, cex=temp.c ) #Use textxy to label points with value of pch, #Shift axes slightly so labels don't sit on top of points temp.xshift <- temp.x-1.25 temp.yshift <- temp.y-.2 #Use character expansion (cx=) library(calibrate) textxy(temp.xshift, temp.yshift, temp.p, cx=.6)
Finally, let's draw a line through the points.
#Set up plot of y on x drawing a line through the points plot( main="Simple Plot of Y on X with Sized Characters", x=temp.x, xlim=c(0,26), xlab="Independent Variable", y=temp.y, ylim=c(0,26), ylab="Dependent Variable", pch=15, col=5, cex=2 ) #Find the slope and intercept of the line #Use the linear model (lm()) function and store results into the R object (temp.line) #The linear model is: Y = a +bX, where a is the intercept and b is the slope temp.line <- lm(temp.y~temp.x) temp.line #Now plot a line with that intercept (a) and slope (b) abline(temp.line)
- Livio, Mario (2003). The Golden Ratio: The Story of PHI, the World's Most Astonishing Number. New York: Broadway. ISBN 978-0767908160.
- Hans Rosling (2006). "Stats that reshape your world-view". Video. TED. http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html. Retrieved 15 August 2012.
- Tufte, Edward (2001) . The Visual Display of Quantitative Information (2nd ed.). Columbia, MD: Graphics Press. ISBN 978-0961392147.
- "Tableau Visual Guidebook". White Paper. Tableau Software. http://www.tableausoftware.com/learn/whitepapers/tableau-visual-guidebook. Retrieved 19 November 2012.
- See, for example, Douglas Kipperman and Melissa McKinstry (2008). "Design Rules of Thumb". WriteDesignOnline. http://www.writedesignonline.com/resources/design/rules/index.html. Retrieved 15 August 2012.
- Higgins, E. Tory (July 1987). "Self-discrepancy: A theory relating self and affect.". Psychological Review 94 (3): 319-340. http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=1987-34444-001. Retrieved 25 August 2012.
- Salvador Dali (1955). "Sacrament of the Last Supper". Painting. National Gallery of Art. http://www.nga.gov/volunteer/pdf/dali_infosheet.pdf. Retrieved 15 August 2012.
You are free:
- to Share — to copy, distribute, display, and perform the work (pages from this wiki)
- to Remix — to adapt or make derivative works
Under the following conditions:
- Attribution — You must attribute this work to Wikibooks. You may not suggest that Wikibooks, in any way, endorses you or your use of this work.
- Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
- Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
- Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
- Other Rights — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
- Notice — For any reuse or distribution, you must make clear to others the license terms of this work.The best way to do this is with a link to the following web page.