Descriptive analysis in R – Data Science Portfolio

This post shows an easy descriptive statistical analysis exercise of the Mid-Atlantic Wage Data showing some boxplots and checking for data normality.

The dataset can be found here: https://github.com/selva86/datasets/blob/master/Wage.csv

The fields in the data are the following:

year: Year when the data was collected.
maritl: marital status: 1. Never Married, 2. Married, 3. Widowed, 4. Divorced, and 5. Separated.
age: worker’s age.
race: 1. White, 2. Black, 3. Asian, and 4. Other.
education: Education level: 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree.
region: Always Mid-Atlantic.
jobclass: Job type 1. Industrial, 2. Information.
health: Helath status: 1. <=Good, 2. >=Very Good)
health_ins: Health insurance 1. Yes, 2. No.
logwage: logarithm of wage.
wage: ($1000s)

descriptive_analysis.utf8

1 Data analysis

Reading the file

data<-read.csv2("./Wage.csv",header = TRUE, sep = ",", stringsAsFactors = TRUE )

Summary of the variable type and levels.

str(data)

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 1 level "2. Middle Atlantic": 1 1 1 1 1 1 1 1 1 1 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : Factor w/ 508 levels "3","3.04139268515822",..: 126 105 354 426 126 342 452 287 315 347 ...
##  $ wage      : Factor w/ 508 levels "100.013486924706",..: 397 376 117 189 397 105 215 50 78 110 ...

Although logwave and wage have been imported as factor, they seem to be numerical, so we transform the variables and check the dataset again with str.

data$logwage<-as.numeric(as.character(data$logwage))
data$wage<-as.numeric(as.character(data$wage))
str(data)

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 1 level "2. Middle Atlantic": 1 1 1 1 1 1 1 1 1 1 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

2 Descriptive Analysis and Visualization

2.1 Descriptive Analysis

First thing to do is showing a statistical summary of the data.

summary(data)

##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##               education                  region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic:3000   1. Industrial :1544  
##  2. HS Grad        :971                             2. Information:1456  
##  3. Some College   :650                                                  
##  4. College Grad   :685                                                  
##  5. Advanced Degree:426                                                  
##                                                                          
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34

The mean and the median of the variable year are the same so it has a symmetric distribution, the same happens with logwave where both values are not the same but very close. Age and wage have more skewed distributions, since there’s more difference between their mean and median. Comenzamos con las variables numéricas:

We can also see the levels in the factor variables and the number of samples per level. As we can see the variables health and health_ins are less balanced than jobclass, and region has only one value.

2.2 Visualization

Let’s start showing some boxplots to check the distribution of the variables and outliers.

2.2.1 Race vs age

As we can see in the first boxplot the distribution of the ages per race is similar, we can find only two outliers.

plot(x=data$race,y=data$age, xlab = "race", ylab = "age")

2.2.2 Jobclass vs age

Also similar distribution of the age per jobclasses, some outliers in both cases. the mean of the age of the people with an industrial jobclass seems slightly lower than the one of the people with an information jobclass.

plot(data$jobclass, data$age, xlab = "jobclass", ylab = "age")

2.2.3 Health status vs age

The ages of the people with a very good health status seems slower than the ones with a good or less health status.

plot(data$health, data$age, xlab = "health", ylab = "age")

2.2.4 Health insurance vs age

Also the mean of the age of the people without a health insurance seems lower than the mean of the age of the people with a health insurance.

plot(data$health_ins, data$age, xlab = "health_ins", ylab = "age")

2.3 Normality test

Let’s check visually if the wage variable has a normal distribution.

First, we create a density plot, as we can see the distribution does not seem normal.

library(ggplot2)
ggplot(data, aes(x=wage)) + 
  geom_density()

Let’s perform another test with a qqplot that compares the data points with a normal distribution. As we can see the data points move away from the normal distribution lines so we can say that the wage variable does not have a normal distribution.

library(car)
qqPlot(data$wage)

## [1]  207 1230