Take-home Exercise 2

Take-home Exercise

Creating data visualisation beyond default: peer critique and visualization remake of take-home exercise 1.

Leslie Long Nu https://www.linkedin.com/in/leslielongnu/ (SMU, MITB)https://scis.smu.edu.sg/master-it-business
May 1, 2022

1. Overview

This take-home exercise is done based on a take-home exercise 1 submission prepared by a classmate. The peer submission will be critiqued in terms of clarity and aesthetics, and the original design will be remade using the data visualization principles and best practice learnt in Lesson 1 and 2.

The dataset used in take-home exercise 1 and 2 is downloaded from VAST Challenge 2022, and processed by RStudio tidyverse family of packages and visualized by ggplot2 and its extensions.

2. Data Preparation

Data preparation steps taken by the original author of the critiqued graphs are listed here for easy reference. As this is not the focus of this exercise, I will not go into details about it.

Installing packages and loading libraries:

hide
# install and load libraries 
packages = c('tidyverse', 'knitr', 'ggdist', 'scales', 'grid', 'gridExtra',
             'formattable', 'patchwork')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import data:

hide
# read csv file
participants <- read_csv('data/Participants.csv')

Bin age variable:

hide
# bin age variable into 5 different age groups 
one <- c("18", "19", "20")
two <- c("21", "22", "23", "24", "25", "26", "27", "28", "29", "30" )
three <- c("31", "32", "33", "34", "35", "36", "37", "38", "39", "40")
four <- c( "41", "42", "43", "44", "45", "46", "47", "48", "49", "50")
five <-  c( "51", "52", "53", "54", "55", "56", "57", "58", "59", "60")

participants <- participants %>%
  mutate(age_modified = case_when(
    age %in% one ~ "<=20",
    age %in% two ~ "20's",
    age %in% three ~ "30's",
    age %in% four ~ "40's",
    age %in% five ~ "50+")) %>%
  select(-age)

3. Visualization Critique and Remake

There are four graphs in total in this take-home exercise 1 and they will be reviewed and remade in terms of clarity and aesthetics.

3.1 Distribution of Joviality

3.1.1 Critique

A histogram is used to reveal the distribution of residents’ joviality.

hide
# current graph
ggplot(data=participants,
       aes(x = joviality)) +
  geom_histogram(bins=20,
                 boundary = 50,
                 color="black",
                 fill="light blue") +
  coord_cartesian(xlim=c(0.01,1)) +
  labs(title = "Distribution of Joviality",
       caption = "demographic information, Ohio USA") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 1),
        plot.caption = element_text(hjust = 0))

Clarity

hide
# comparing shapes of histograms with different number of bins
p1 <- 
  ggplot(data=participants,
       aes(x = joviality)) +
  geom_histogram(bins=20,
                 boundary = 50,
                 color="black",
                 fill="light blue") +
  coord_cartesian(xlim=c(0.01,1)) +
  labs(title = "Distribution of Joviality (bins=20)",
       caption = "demographic information, Ohio USA") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 1),
        plot.caption = element_text(hjust = 0))

p2 <- 
  ggplot(data=participants,
       aes(x = joviality)) +
  geom_histogram(bins=10,
                 boundary = 50,
                 color="black",
                 fill="light blue") +
  #coord_cartesian(xlim=c(0.0, 1)) +
  labs(title = "Distribution of Joviality (bins=10)",
       caption = "demographic information, Ohio USA") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 1),
        plot.caption = element_text(hjust = 0))

p1|p2

Aesthetics

3.1.2 Remake

hide
# remake
ggplot(participants, 
       aes(joviality)) +
  # plot histogram
  geom_histogram(bins = 20,
                 color="#666666",
                 fill="light blue") +
  # plot density curve (y= ..density..*60 to match the second y-axis)
  geom_density(aes(y= ..density..*60), color= '#ff7373') +
  # add second y-axis, which is transformed from the primary y-axis (/60)
  scale_y_continuous('No. of\nResidents',
                     expand = c(0.05, 0.2),
                     sec.axis = sec_axis(~. /60, name= 'Density',
                                         breaks= seq(0, 1.2, 0.5) )) +
  # add mean joviality line
  geom_vline(aes(xintercept = mean(joviality)),
             linetype= 'dashed',
             size= .4,
             color= '#0e2f44') +
  # annotation for mean joviality line 
  geom_text(aes(mean(joviality), 65, 
                label= 'Mean Joviality Index',
                hjust= -0.1),
            check_overlap = T,
            color= '#0e2f44') +
  labs(y= 'No. of\nResidents', x= 'Joviality',
       title = "Distribution of Residents' Joviality Index",
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(axis.title.y= element_text(angle=0), 
        axis.title.y.right = element_text(angle=0, vjust = 1),
        panel.grid.major = element_line(color= 'grey', size = 0.1),
        panel.background= element_blank(), 
        axis.line= element_line(color= 'grey'),
        plot.caption = element_text(hjust = 0))

The makeover graph kept the good features of the original graph, such as the caption and the overall format, and improved on a few elements:

Clarity

Aesthetics

3.2 Distribution of Age for Different Household Types

3.2.1 Critique

A stacked bar chart is chosen to reveal the distribution of residents’ age for different household sizes.

hide
# current graph
ggplot(data=participants, 
       aes(x=age_modified, fill = educationLevel)) +
  geom_bar()+
  labs(
    title = "Distribution of Age for different household types",
    caption = "demographic information, Ohio USA"
    ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
    )

Clarity

Aesthetics

3.2.2 Remake

Further data preparation is first done before remaking graph:

hide
# data preparation for remake
# rename '<=20' and '50+' for a more formal formatting
participants2 <- participants
participants2$age_modified <-
  sub('<=20', '20 and below', participants$age_modified)
participants2$age_modified[participants2$age_modified == 'above 50+'] <- 'above 50'

# rename 'HighSchoolOrCollege' values in dataset for ease of reading
participants2$educationLevel <- sub('HighSchoolOrCollege', 
                                    'High School or College',
                                    participants2$educationLevel)

# calculate %composition of age groups for different education levels
percent <- 
  participants2 %>% 
  group_by(age_modified, educationLevel) %>%
  summarise(edu_size= n()) %>%
  mutate(edu_pct= percent(edu_size/sum(edu_size))) 
hide
# plot remake graph
# plot p1: bar chart of age group distribution
p1 <-  
  ggplot(participants2, aes(x= age_modified)) +
  geom_bar(fill= '#468499') +
  scale_y_continuous(expand = c(0.2, 0.2)) +
  labs(y= 'No. of\nResidents', 
       title = "Residents' Age Group Distribution",
       subtitle= 'Demographics in Engagement, Ohio') +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'grey'),
        panel.grid.major.y = element_line(size= 0.2, color = "grey"),
        axis.title.x = element_blank(),
        plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5))

# prepare sorted order of education levels from low to high levels
edu_lvl <- c('Low', 'High School or College', 'Bachelors', 'Graduate')

# plot p2: composition of corresponding age group in terms of education levels
p2<- 
  ggplot(percent, 
         aes(x= age_modified, y= edu_pct, 
             group= educationLevel, 
             color= factor(educationLevel, levels = edu_lvl))) +
  geom_line() +
  scale_color_discrete(name= 'Education Level') +
  scale_y_continuous(labels = percent_format(),
                     expand = c(0.2, 0.2)) +
  labs(y= 'Percent', x= 'Age Group', 
       title = "Composition of Residents' Education Level in Different Age Groups",
       caption = "Source: VAST Challenge 2022")+
  theme(legend.position = 'top', legend.direction = 'horizontal',
        axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), 
        axis.line= element_line(color= 'grey'),
        panel.grid.major = element_line(size= 0.2, color = "grey"),
        legend.key = element_rect(fill= NA), 
        legend.title = element_text(size = 8.5),
        plot.title = element_text(hjust=0.5),
        plot.caption = element_text(hjust = 0))

# use patchwork to stack 2 graphs 
p1/p2

Clarity

Aesthetics

3.3 Joviality Measure 1

3.3.1 Critique

This graph is a horizontal stacked bar chart designed to reveal the joviality distribution in relation to residents’ education level, age group and whether having kids.

hide
# current graph 
ggplot(data=participants, 
       aes(x= joviality, 
           y= educationLevel, fill = haveKids)) +
  geom_col() +
  theme_classic()+
  labs(
    title = "Joviality Measure",
    caption = "demographic information, Ohio USA"
    ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
    )+
  facet_wrap(~age_modified)

Clarity

Aesthetics

3.3.2 Remake

In this remake, I plan to split the four variables included in the current graph into two graphs. I feel the author’s main intention is to explore how other variables affect joviality, therefore, the first remake will be on relationship between whether having kids and joviality while the second will be on joviality distribution in different educational and age groups.

hide
# remake part 1
ggplot(participants2,
       aes(x= joviality, color= haveKids)) +
  geom_density() +
  labs(y= 'Density', x= 'Joviality',
       title = "Are Residents with Kids more Jovial?",
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  scale_color_discrete(name= 'Have Kids?', labels = c('True', 'False')) +
  theme(axis.title.y= element_text(angle=0),
        panel.grid.major = element_line(color= 'grey', size = 0.1),
        panel.background= element_blank(), 
        axis.line= element_line(color= 'grey'),
        plot.caption = element_text(hjust = 0),
        legend.key = element_rect(fill= NA))

This graph shows the difference in density curves of residents who have and do not have kids. It can be concluded that residents who have kids have a higher probability (area under the curve) of having lower joviality.

hide
# remake part 2
ggplot(participants2,
       aes(x= fct_rev(factor(educationLevel, levels= edu_lvl)), y= joviality)) +
  geom_violin(fill= '#66cdaa',
              color= NA,
              bw= 0.4) +
  geom_boxplot(width= 0.2,
               color = '#065535',
               alpha= 0.3) +
  stat_summary(aes(color= 'Mean'),
               fun= 'mean',
               size= 0.2) + 
  geom_hline(aes(yintercept = 0.5),
             color= 'black',
             linetype= 'dashed',
             size= .6) +
  scale_color_manual(name= 'Statistics', 
                     values = (Mean= '#f6546a')) +
  facet_wrap(~ age_modified) +
  labs(title= 'Joviality Distribution in Relation to Age and Interest Groups', 
       x= 'Interest Group', y= 'Joviality',
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(panel.background = element_blank(),
        axis.title.y = element_text(angle=0),
        axis.ticks.y = element_blank(),
        axis.line= element_line(color= 'grey'),
        panel.grid.major.y = element_line(color= 'grey', size = 0.1),
        plot.caption = element_text(hjust=0),
        legend.key = element_rect(fill= NA),
        axis.text = element_text(size=10),
        axis.title = element_text(size= 12),
        plot.title = element_text(size = 18),
        plot.subtitle = element_text(size = 14),
        legend.text = element_text(size= 12)) +
  coord_flip()

The remade graph shows the distribution of joviality for different age groups and education levels. Additional information, such as mean joviality and joviality=0.5, is marked on the graph for audience’s easy reference. The overall format and theme are also improved for a cleaner look.

3.4 Joviality Measure 2

3.4.1 Critique

A faceted bar chart is used to reveal the distribution of joviality with respect to different age groups and interest groups.

hide
# current graph 
ggplot(data=participants, 
       aes(x= interestGroup, 
           y= joviality)) +
  geom_col() +
  theme_classic()+
  labs(
    title = "Joviality Measure",
    caption = "demographic information, Ohio USA"
    ) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 1),
    plot.caption = element_text(hjust = 0)
    )+
  facet_wrap(~age_modified)

Clarity

Aesthetics

3.4.2 Remake

hide
# remake graph
ggplot(participants2,
       aes(x= interestGroup, y= joviality)) +
  geom_violin(fill= '#66cdaa',
              color= NA,
              bw= 0.4) +
  geom_boxplot(width= 0.2,
               color = '#065535',
               alpha= 0.3) +
  stat_summary(aes(color= 'Mean'),
               fun= 'mean',
               size= 0.2) + 
  geom_hline(aes(yintercept = 0.5),
             color= 'black',
             linetype= 'dashed',
             size= .6) +
  scale_color_manual(name= 'Statistics', 
                     values = (Mean= '#f6546a')) +
  facet_wrap(~ age_modified) +
  labs(title= 'Joviality Distribution in Relation to Age and Interest Groups', 
       x= 'Interest Group', y= 'Joviality',
       subtitle= 'Demographics in Engagement, Ohio',
       caption = "Source: VAST Challenge 2022") +
  theme(panel.background = element_blank(),
        axis.title.y = element_text(angle=0),
        axis.ticks.x = element_blank(),
        axis.line= element_line(color= 'grey'),
        panel.grid.major.y = element_line(color= 'grey', size = 0.1),
        plot.caption = element_text(hjust=0),
        legend.key = element_rect(fill= NA))

Clarity

Aesthetics

4. Learning Points

Take-home exercise 2 serves as a mirror image to take-home exercise 1. It provided me with an opportunity to reflect on what I have done for take-home exercise 1 after seeing it from different perspectives of classmates’ works. My key takeaways are: