Creating data visualisation beyond default: an analysis on the demographic of the city of Engagement, Ohio USA (VAST Challenge 2022).
In this take-home exercise, the demographic of the city of Engagement, Ohio USA will be revealed by applying the skills learnt in ISSS608 Visual Analytics Lesson 1 and Hands-on Exercise 1. It is assumed that the volunteer participants are representative of the city’s population.
The data used in this exercise is from VAST Challenge 2022,
and processed by RStudio tidyverse family of packages and
visualized by ggplot2 and its extensions.
Challenge 1: Locate and Prepare Relevant Data
One of the challenges faced in this take-home exercise is that the dataset from VAST Challenge 2022 includes a considerable large number of files, 88 in total, and it is challenging to locate all the demographic related data. Upon review, while the participant file contains the essential data on demographic of the city, such as age, education level, family size and so on, it does not provide any data on residents’ income level.
To include income information into the demographic reveal, data wrangling is required to derive income data, either from the status logs files -> job ID -> jobs file -> hourly rate; or from the financial journal -> wage. Considering that, although working days and hours are provided in jobs file, no information on public holidays, vocations taken or leaves applied are provided, which could affect the ultimate income as job pay is given in hourly rate instead of monthly lump sump. Therefore, income data will be derived from financial journal -> wage as it represents the actual income received by the residents.
Challenge 2: Outliers
From the financial journal, it is observed that the wage values range widely. Thus, derived monthly income is likely to have outliers. Outliers affect the clarity and effectiveness of distribution graphs, such as scatter plots and boxplots, as the majority of the data points becomes visually trivial in the attempt to fit the outliers on the graph.
In order to eliminate this problem, data should either be transformed and plotted in percentiles, or the format of visualization must be chosen carefully to represent the true distribution well.
Challenge 3: Variable Preparation
The variables need to be prepared and checked properly for visualization. For example, the age variable should be binned by intervals, such as below 20, 20-24, …, to allow for a more holistic understanding of the demographic representation. Another example is the education level variable, as it is in character format, when plotted the levels will be arranged alphabetically, first ‘Bachelor’, followed by ‘Graduate’, ‘High School or College’ and ‘Low’, which is not the logical order that we are used to perceive education level. For issues like this, manual adjustment using R code is required to produce a clear and comprehensive graph.
Challenge 4: Large Raw Data File
The financial journal file used to derive income data is 80.7 MB, larger than the 50 MB size limit set by GitHub. Therefore, the raw data cannot be pushed to GitHub repository directly.
This is resolved by preparing the dataframe for visualization, and then saving it in RDS format. The raw data is not pushed or commited while only the saved RDS file is referenced for visualization, and pushed to GitHub repository.
The following code chunk installs the required R packages and loads them onto RStudio environment.
packages = c('tidyverse', 'knitr', 'ggdist', 'scales', 'grid', 'gridExtra',
'formattable', 'patchwork')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The datasets used for this exercise are provided by VAST Challenge 2022.
Data is imported using read_csv() of readr package, which
is useful for reading delimited files into tibbles.
participants <- read_csv('rawdata/Participants.csv')
financial <- read_csv('rawdata/FinancialJournal.csv')
The following code chunk is used to have an overview of the datasets.
To find out the income level of the residents, residents’ monthly income is derived by calculating the total sum of wage divided by 15. The result is the average (mean) monthly income received by the residents as the data ranges for 15 months.
income <- financial %>%
filter(category == 'Wage') %>% # extract only wage data
select(participantId, amount) %>% # extract participant ID and amount columns
group_by(participantId) %>% # group by participant ID
summarise(Monthly_Income = sum(amount)/15)
# calculate average monthly income for each participant
# check the derived file income
summary(income)
It is observed from summary(income) that the monthly
income’s variance is very high as it ranges from min value of 139.9 to
max value of 17369.0. It has outliers too as the mean and median values
are 3328.3 and 3668.5 respectively. This indicates that we need to
choose an appropriate form of data visualization, or transform the data
into percentiles so that the plot is not distorted by the outliers.
Income file is joined with participants file based on participant ID to
allow for comparison of all demographic variables. Inner join is used as
both data contain exactly 1011 entries and unique identifier of
participant ID ranging from 0 to 1010.
participants <- inner_join(x= participants, y= income, by= 'participantId')
# confirm tables are joined correctly
head(participants)
The columns and values of participants file are renamed with below code chunk for better formatting and ease of reading.
# rename columns
participants <- participants %>%
rename('Participant_ID' = 'participantId',
'Household_Size' = 'householdSize',
'Have_Kids' = 'haveKids',
'Age' = 'age',
'Education_Level' = 'educationLevel',
'Interest_Group' = 'interestGroup',
'Joviality' = 'joviality')
# verify if the columns have been renamed correctly
colnames(participants)
#rename value
participants$Education_Level <- sub('HighSchoolOrCollege',
'High School or College',
participants$Education_Level)
Age variable is binned with the following code chunk:
Dataframe participants is saved and read in RDS format to avoid uploading large files to Git.
saveRDS(participants, 'data/participants.rds')
# A tibble: 6 × 9
Participant_ID Household_Size Have_Kids Age Education_Level
<dbl> <dbl> <lgl> <dbl> <chr>
1 0 3 TRUE 36 High School or College
2 1 3 TRUE 25 High School or College
3 2 3 TRUE 35 High School or College
4 3 3 TRUE 21 High School or College
5 4 3 TRUE 43 Bachelors
6 5 3 TRUE 32 High School or College
# … with 4 more variables: Interest_Group <chr>, Joviality <dbl>,
# Monthly_Income <dbl>, Age_Group <fct>
A bar chart was plotted using ggplot2 on residents’ age
groups distribution as follows:
geom_text() is used to add annotations of the count and
% values of each age grouptheme() is used to remove the x axis ticks as it is
categorical scale.From the bar chart it is observed that residents’ age distribution is mostly even, with 20 & Below as the smallest group (7.1%).
ggplot(data= participants,
aes(x= Age_Group)) +
geom_bar(fill= '#468499') +
ylim(0, 150) +
geom_text(stat = 'count',
aes(label= paste0(stat(count), ', ',
round(stat(count)/sum(stat(count))*100,
1), '%')), vjust= -0.5, size= 2.5) +
labs(y= 'No. of\nResidents', x= 'Age Group',
title = "Distribution of Residents' Age") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))

The bar chart of household size indicates that there are no single parent households in the city, as all households of size 2 have no kids.
ggplot(data= participants,
aes(x= Household_Size,
fill = Have_Kids)) +
geom_bar()+
ylim(0, 400) +
geom_text(stat = 'count',
aes(label= stat(count)),
vjust= -0.5,
size= 3) +
labs(title = 'Household Size of the Residents',
y= 'No of\nResidents', x= 'Household Size') +
theme(axis.title.y= element_text(angle=0),
axis.ticks.x= element_blank(),
panel.background= element_blank(),
axis.line= element_line(color= 'grey'))

Distribution of residents’ education level shows that more than half of the residents are high school or college educated, followed by bachelors, graduate and lastly low education levels.
participants %>%
mutate(Education= fct_infreq(Education_Level)) %>%
ggplot(aes(x= Education)) +
geom_bar(fill= '#6897bb') +
geom_text(stat = 'count',
aes(label= paste0(stat(count), ', ',
round(stat(count)/sum(stat(count))*100,
1), '%')), vjust= -0.5, size= 3) +
labs(y= 'No. of\nResidents', title = "Distribution of Residents' Education Level") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'))

Volin and boxplot of age distribution for different household sizes are plotted as follows:
scale= 'count' is used for geom_voilin()
as it gives a true representation of the distribution because areas are
scaled proportionally to the number of observations.as.factor(Household_Size) is used as household size is
categoricalEduLevels <- c('Low', 'High School or College', 'Bachelors', 'Graduate')
ggplot(data=participants,
aes(x= as.factor(Household_Size), y= Age)) +
geom_violin(fill= '#66cdaa',
scale = 'count',
color= NA,
bw= 0.4) +
geom_boxplot(width= 0.2,
color = '#065535',
alpha= 0.3) +
stat_summary(geom= 'point',
fun= 'mean',
color= '#ff7373',
size= 2) +
facet_grid(~factor(Education_Level, levels = EduLevels)) +
labs(title= 'Age Distribution for Different Household Sizes',
x= 'Household Size') +
theme(panel.background = element_blank(),
axis.ticks.x = element_blank(),
axis.line= element_line(color= 'grey'),
panel.grid.major.y = element_line(color= 'grey', size = 0.1))

From the plot it can be observed:
The distribution of both residents’ interest groups and education levels are plotted as follows:
patchwork is used to stack two graphs together, one
showing the total size of the interest groups, another showing the
corresponding composition of education levels.# plot p1: bar chart of interest group distribution in descending order
number <-
participants %>%
mutate(Interest= fct_infreq(Interest_Group))
p1 <-
ggplot(data= number, aes(x= Interest)) +
geom_bar(fill= '#468499') +
scale_y_continuous(expand = c(0.2, 0.2)) +
labs(y= 'No. of\nResidents',
title = "Total Number of Residents in Interest Groups") +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'),
panel.grid.major.y = element_line(size= 0.2, color = "grey"),
axis.title.x = element_blank(),
plot.title = element_text(hjust=0.5))
# plot p2: composition of corresponding interest group in terms of education levels
int_lvl <- c('J', 'H', 'G', 'F', 'A', 'C', 'D', 'I', 'B', 'E')
percent <-
participants %>%
group_by(Interest_Group, Education_Level) %>%
summarise(edu_size= n()) %>%
mutate(edu_pct= percent(edu_size/sum(edu_size)))
p2<-
ggplot(data= percent,
aes(x= factor(Interest_Group, levels = int_lvl), y= edu_pct,
group= Education_Level,
color= factor(Education_Level, levels = EduLevels))) +
geom_line() +
scale_color_discrete(name= 'Education Level') +
scale_y_continuous(labels = percent_format(),
expand = c(0.2, 0.2)) +
labs(y= 'Percent', x= 'Interest Group',
title = "Composition of Residents' Education Level in Interest Groups")+
theme(legend.position = 'top', legend.direction = 'horizontal',
axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey'),
panel.grid.major = element_line(size= 0.2, color = "grey"),
legend.key = element_rect(fill= NA), legend.title = element_text(size = 8.5),
plot.title = element_text(hjust=0.5))
# use patchwork to stack 2 graphs
p1/p2

Distribution of residents’ interest groups shows that residents’ interests are fairly evenly distributed among the 10 interest groups, with group J having the largest size, 116 residents, and group E having the smallest size, 83 residents. The education level composition of the interest groups indicates:
The following plot is created to show the distribution of joviality index for different interest groups using the half-eye plot. It indicates the density plot (grey shade above) as well as the medians (dots) and quantile intervals (line and shape below).
While the overall joviality distribution across different interest groups does not differ significantly, it is observed from the plot that interest group E has the highest median joviality index while interest group H has the lowest.
ggplot(participants,
aes(x= fct_rev(Interest_Group), y= Joviality)) +
stat_halfeye(adjust = .35,
width = .6,
color = '#20b2aa',
justification = -.15,
position = position_nudge(x = .12)) +
scale_x_discrete(expand= c(0.1, 0.1)) +
geom_hline(aes(yintercept = 0.5),
linetype= 'dashed',
color= '#f08080',
size= .6) +
coord_flip() +
labs(x = 'Interest Group',
title = 'Joviality Distribution in Different Interest Groups') +
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
axis.ticks.y = element_blank(),
panel.grid.major = element_line(size= 0.2, color = "grey"))

Joviality versus monthly income distribution is plotted with following:
ggplot(data= participants,
aes(x= Monthly_Income, y= Joviality, color= Joviality)) +
scale_color_gradient(low= '#133337', high = '#a0db8e') +
geom_point() +
scale_x_continuous(breaks= c(0,5000,10000,15000),
labels = c('0', '5K', '10K', '15K')) +
geom_vline(aes(xintercept = median(Monthly_Income,
na.rm = T)),
color= 'red',
linetype= 'dashed',
size= .6) +
geom_text(aes(median(Monthly_Income), 1,
label= 'Median Monthly Income', hjust= -0.1)) +
geom_hline(aes(yintercept = 0.5),
color= 'red',
linetype= 'dashed',
size= .6)+
labs(title = "Joviality versus Monthly Income", x= 'Monthly Income')+
theme(panel.background= element_blank(), axis.line= element_line(color= 'grey'),
panel.grid.major = element_line(size= 0.2, color = "grey"))

This plot revealed that here is no positive correlation between joviality and income level. In fact, some residents have very high joviality index with low income, while majority of those with monthly income higher than 10k have low joviality index.
This plot is done with the following:
# prepare data for background fill
d <- participants
d_bg <- d[, -5]
# prepare mean and median monthly income by different education levels
med_in <- participants %>%
group_by(Education_Level) %>%
summarise(med_in = median(Monthly_Income),
mean_in = mean(Monthly_Income))
ggplot(data= d,
aes(x= Monthly_Income, fill= Education_Level)) +
geom_histogram(data= d_bg,
fill= 'grey',
alpha= .5) +
geom_histogram(color= 'black') +
geom_vline(data= med_in,
aes(xintercept = med_in, color= 'Median'),
linetype= 'dashed',
size= .4) +
geom_vline(data= med_in,
aes(xintercept = mean_in, color= 'Mean'),
linetype= 'dashed',
size= .4) +
scale_color_manual(name= 'Statistics',
values = c(Median= 'dark blue', Mean= 'Red')) +
facet_wrap(~ factor(Education_Level, levels = EduLevels)) +
guides(fill= 'none') +
labs(y= 'No of Residents', x= 'Monthly Income',
title = 'Monthly Income Distribution Faceted by Education Level') +
theme_bw()

This plot indicates, unsurprisingly, that the income distribution is higher for higher education levels. In addition, it is worth noting that, the mean value is higher than median for higher education levels, it is the opposite case for low education level, and they are almost the same for high school or college education. This is because mean value is affected significantly by outliers, and there are more outliers (very high income) for bachelors and graduate levels. Also, from the overall income distribution shown in the background, it is observed that there is a group with very low income level, and they consist of low and high school or college education levels.
This take-home exercise provides an excellent opportunity to get
familiarized with data preparation and data visualization using R
packages, especially using tidyverse and
ggplot2 and their extensions. My key takeaways are:
ggplot2 in R is highly customizable as it
is based on the grammar of graphics, which allows for numerous ways of
graphical composition as required. The more I use it, the more I
understand systematically how complicated graphics can be constructed
layer by layer.