Project 1 IBM HR Analytics

Author:Jiaying Chen, Yingjin He, Sabrina Seow, Kashish Solanki, Roman Vazquez Lorenzo, Maximilian Stock

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
head(hr_cleaned)
## # A tibble: 6 × 19
##     age attrition daily_rate department      distance_from_ho… education  gender
##   <dbl> <chr>          <dbl> <chr>                       <dbl> <chr>      <chr> 
## 1    41 Yes             1102 Sales                           1 College    Female
## 2    49 No               279 Research & Dev…                 8 Below Col… Male  
## 3    37 Yes             1373 Research & Dev…                 2 College    Male  
## 4    33 No              1392 Research & Dev…                 3 Master     Female
## 5    27 No               591 Research & Dev…                 2 Below Col… Male  
## 6    32 No              1005 Research & Dev…                 2 College    Male  
## # … with 12 more variables: job_role <chr>, environment_satisfaction <chr>,
## #   job_satisfaction <chr>, marital_status <chr>, monthly_income <dbl>,
## #   num_companies_worked <dbl>, percent_salary_hike <dbl>,
## #   performance_rating <chr>, total_working_years <dbl>,
## #   work_life_balance <chr>, years_at_company <dbl>,
## #   years_since_last_promotion <dbl>
skim(hr_cleaned)
Table 1: Data summary
Name hr_cleaned
Number of rows 1470
Number of columns 19
_______________________
Column type frequency:
character 10
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
attrition 0 1 2 3 0 2 0
department 0 1 5 22 0 3 0
education 0 1 6 13 0 5 0
gender 0 1 4 6 0 2 0
job_role 0 1 7 25 0 9 0
environment_satisfaction 0 1 3 9 0 4 0
job_satisfaction 0 1 3 9 0 4 0
marital_status 0 1 6 8 0 3 0
performance_rating 0 1 9 11 0 2 0
work_life_balance 0 1 3 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 36.92 9.14 18 30 36 43 60 ▂▇▇▃▂
daily_rate 0 1 802.49 403.51 102 465 802 1157 1499 ▇▇▇▇▇
distance_from_home 0 1 9.19 8.11 1 2 7 14 29 ▇▅▂▂▂
monthly_income 0 1 6502.93 4707.96 1009 2911 4919 8379 19999 ▇▅▂▁▂
num_companies_worked 0 1 2.69 2.50 0 1 2 4 9 ▇▃▂▂▁
percent_salary_hike 0 1 15.21 3.66 11 12 14 18 25 ▇▅▃▂▁
total_working_years 0 1 11.28 7.78 0 6 10 15 40 ▇▇▂▁▁
years_at_company 0 1 7.01 6.13 0 3 5 9 40 ▇▂▁▁▁
years_since_last_promotion 0 1 2.19 3.22 0 0 1 3 15 ▇▁▁▁▁

The hr_cleaned dataset holds fictional information of a total of 1470 employees on variables such as the employees’ education, job specifics, performance and attrition. All in all, there are 9 numeric and 10 non-numeric variables

Given the wide range of information, one can try to make inferences about dependencies among variables. For example, it might be interesting to find out if a combination of variables might be a good predictor for an employee’s decision to leave the company (attrition = Yes).

  • How often do people leave the company (attrition)
number_attrition<-hr_cleaned%>%
  group_by(attrition)%>%
  summarize(number_count = count(attrition))%>%
  mutate(perc = number_count/sum(number_count)) #we introduce a second column, namely percent, to represent the count as a percentage of the total

number_attrition
## # A tibble: 2 × 3
##   attrition number_count  perc
##   <chr>            <int> <dbl>
## 1 No                1233 0.839
## 2 Yes                237 0.161

Looking at the sheer counts of YES and NO entries for the attrition column, around 237 of all employees have left the company. This corresponds to a percentage of around 16% of all recorded employees. At the same time, 1233 employees have been staying with the company.

In addition, it might be interesting to plot the distribution of age faceted by the attrition value to see whether one can make any inferences from that relationship.

Looking at the two density plots, it is observable that values in the age variable tend to be lower for Attrition = Yes. One might speculate that younger employees tend to leave the company more often than older employees, because, on average, it is easier for them to find and adapt to a new job. Furthermore, as the analysis further below shows, compared to their older colleagues, they tend to have a lower salary, giving them less reason to stay with the company.

  • How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?
n<-length(hr_cleaned$age) 

pic_data<-data.frame(cbind(c(hr_cleaned$age,hr_cleaned$monthly_income, 
                  hr_cleaned$years_since_last_promotion), 
                c(rep("age",n),rep("monthly_income",n),rep("years_since_last_promotion",n)))) 
colnames(pic_data)<-c("Num","Attribute") 
pic_data$Num<-as.integer(pic_data$Num) 

distribution<-ggplot(pic_data, aes(x=Num))+ 
  geom_histogram()+ 
  facet_wrap(vars(Attribute), scale="free")+ 
  labs(title = "Distribution across the dataset", subtitle = "Counts of values for age, monthly income and years since last promotion", x="",y="Frequency")+ 
  
  theme_bw() 
distribution 

Looking at the three graphs, values for the age variable seem to be the most normally distributed, even though the distribution is skewed slightly to the right. This can be explained by the fact that the majority of the workforce lies in the age group 25-40. The remaining two graphs are skewed heavily to the right. The skewness of monthly income can be explained by the fact that most salaries tend to center around the mean, while there are large outliers on the higher side of income. Similarly, the distribution of years_since_last_promotion is in accordance with the frequency of employee promotion; people usually get promoted every 1-3 years.

  • How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total.
hr_satis <- hr_cleaned %>% 
  group_by(job_satisfaction) %>% 
  count(sort=TRUE) %>% 
  rename(count=n)%>%
  mutate(percent=count/1470)%>%
  mutate(percent = percent(percent))


hr_satis 
## # A tibble: 4 × 3
## # Groups:   job_satisfaction [4]
##   job_satisfaction count percent
##   <chr>            <int> <chr>  
## 1 Very High          459 31%    
## 2 High               442 30%    
## 3 Low                289 20%    
## 4 Medium             280 19%
work_life_satis <- hr_cleaned %>% 
  group_by(work_life_balance) %>% 
  count(sort=TRUE) %>% 
  rename(count=n)%>%
  mutate(percent=count/1470)%>%
  mutate(percent = percent(percent))
 
work_life_satis 
## # A tibble: 4 × 3
## # Groups:   work_life_balance [4]
##   work_life_balance count percent
##   <chr>             <int> <chr>  
## 1 Better              893 61%    
## 2 Good                344 23%    
## 3 Best                153 10%    
## 4 Bad                  80 5%

As for job satisfaction, 61% of employees reported that they are very highly or highly satisfied with their work while 19% of employees showed a low job satisfaction level. In general, job satisfaction levels seem to be evenly distributed throughout employees.

As for work-life balance,the majority (60%) of employees indicate that they have a better balance and only 5% think their work-life balance is bad.

  • Is there any relationship between monthly_income and education? monthly_income and gender?
plot1 <- ggplot(hr_cleaned, aes(x = fct_reorder(education, -monthly_income), y=monthly_income))+ #we include a negative "-" sign infront of monthly_income in the fct_reorder function, in order to sort the x axis based on decreasing y values
  geom_boxplot()+
  
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Education vs income", subtitle = "A look at an employee's academic background and monthly income", x = "Education", y = "Monthly income")+
  
  theme_bw()
plot1 

plot2 <-ggplot(hr_cleaned, aes(x = fct_reorder(gender, -monthly_income), y=monthly_income))+ 
  geom_boxplot()+
  
  scale_y_continuous(label = dollar)+
  
  labs(title = "Gender vs income", subtitle = "A look at an employee's sex and monthly income", x = "Gender", y = "Monthly income")+
  
  theme_bw()
plot2 

The average monthly income seems to be based on the level of education. We can infer from the box plot that higher the education level, the higher the monthly income. There also exist a couple of outliers within the dataset - as some masters students tend to be paid higher than others. This could be relative for specific schools / universities and exceptional candidates that are paid highly for particular roles.

Interestingly, salaries don’t differ across genders as may be expected.

  • Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first
income_vs_job <- ggplot(hr_cleaned, aes(x=fct_reorder(job_role, -monthly_income), y=monthly_income)) + 
  geom_boxplot()+
  
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Income distribution by job", subtitle = "A look at the relationship between job titles and monthly income", x = "Job" , y="Monthly income")+
  
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) #we adjust the labels on the x axis such that they are vertically aligned

  

income_vs_job

Across job roles, salaries differ significantly. For instance the monthly median salary for Managers is $17,500 while the one for Sales Representatives is just above $2,500. In addition, one can observe three salary groups: 1) around $2,700 2) around $7,000 and 3) around $17,000)

  • Calculate and plot a bar chart of the mean (or median?) income by education level.
#We opted for plotting the MEDIAN monthly income, as it is typically less prone to outliers

median_income <- hr_cleaned%>%
  group_by(education)%>%
  summarise(median_monthly_income = median(monthly_income))%>%
  
  ggplot(aes(x = fct_reorder(education, -median_monthly_income), y = median_monthly_income))+
    geom_col(fill = "lightblue")+
  
    scale_y_continuous(labels = dollar)+
  
    labs(title = "Median income by education", subtitle = "A look at the relationship between academic background and median monthly income", x = "Education" , y="Median monthly income")+
  
    theme_bw()

median_income

The median salaries across education levels are aligned with the distribution of the salaries. PhDs have the highest median salary while employees without a college degree have the lowest median salary. Interestingly, the median salary for College graduates is slightly higher than the one for Bachelor graduates, given that the higher qualification is a Bachelors degree.

  • Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes
plot7 <- hr_cleaned %>%  
  group_by(education) %>%  
  ggplot(aes(x=monthly_income))+ 
    geom_histogram(fill = "lightblue")+
    facet_wrap(vars(education), scales = "free_y")+
  
    scale_x_continuous(labels = dollar)+
  
    labs(title = "Income distribution by education", subtitle = "A look at the relationship between edcuation level and monthly income", x = "Monthly income" , y="Count")+
  
    ggthemes::theme_stata()+
    theme(plot.background = element_blank()) #we remove the blue background to make the design more coherent
plot7 

Even though the total number of employees differs heavily across the educational backgrounds, the overall distribution stays similar. One can observe a reduction in counts of employees for salaries between $12,000 and $17,000. The frequency of employees tends to increase at the higher end of monthly income levels.

A similar picture is shown in the following density plot:

  • Plot income vs age, faceted by job_role
plot8 <- ggplot(hr_cleaned, aes(x=age, y=monthly_income))+ 
  geom_point(size=0.1)+ 
  geom_smooth(method = "lm")+
  facet_wrap(vars(job_role), scales = "free")+
  
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Income by age and job title", subtitle = "A look at the relationship between age, job title and monthly income level", x = "Age" , y="Monthly income")+
  
  ggthemes::theme_stata()+
  theme(plot.background = element_blank())

plot8 

The positive relationship between age and income stays consistent across job titles, i.e. the higher the age, the higher the monthly income. The strength of relationship, however, varies. For instance, the increase in salaries for research directors is higher than for research scientists.