# Descriptive analysis in R

Posted by

This post shows an easy descriptive statistical analysis exercise of the Mid-Atlantic Wage Data showing some boxplots and checking for data normality.

The dataset can be found here: https://github.com/selva86/datasets/blob/master/Wage.csv

The fields in the data are the following:

• year: Year when the data was collected.
• maritl: marital status: 1. Never Married, 2. Married, 3. Widowed, 4. Divorced, and 5. Separated.
• age: worker’s age.
• race: 1. White, 2. Black, 3. Asian, and 4. Other.
• region: Always Mid-Atlantic.
• jobclass: Job type 1. Industrial, 2. Information.
• health: Helath status: 1. <=Good, 2. >=Very Good)
• health_ins: Health insurance 1. Yes, 2. No.
• logwage: logarithm of wage.
• wage: ($1000s) descriptive_analysis.utf8 ## 1 Data analysis Reading the file data<-read.csv2("./Wage.csv",header = TRUE, sep = ",", stringsAsFactors = TRUE ) Summary of the variable type and levels. str(data) ## 'data.frame': 3000 obs. of 11 variables: ##$ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $age : int 18 24 45 43 50 54 44 30 41 52 ... ##$ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ... ##$ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $region : Factor w/ 1 level "2. Middle Atlantic": 1 1 1 1 1 1 1 1 1 1 ... ##$ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ... ##$ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $logwage : Factor w/ 508 levels "3","3.04139268515822",..: 126 105 354 426 126 342 452 287 315 347 ... ##$ wage      : Factor w/ 508 levels "100.013486924706",..: 397 376 117 189 397 105 215 50 78 110 ...

Although logwave and wage have been imported as factor, they seem to be numerical, so we transform the variables and check the dataset again with str.

data$logwage<-as.numeric(as.character(data$logwage))
data$wage<-as.numeric(as.character(data$wage))
str(data)
## 'data.frame':    3000 obs. of  11 variables:
##  $year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ... ##$ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ... ##$ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ... ##$ region    : Factor w/ 1 level "2. Middle Atlantic": 1 1 1 1 1 1 1 1 1 1 ...
##  $jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ... ##$ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ... ##$ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $wage : num 75 70.5 131 154.7 75 ... ## 2 Descriptive Analysis and Visualization ### 2.1 Descriptive Analysis First thing to do is showing a statistical summary of the data. summary(data) ## year age maritl race ## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480 ## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293 ## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190 ## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37 ## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55 ## Max. :2009 Max. :80.00 ## education region jobclass ## 1. < HS Grad :268 2. Middle Atlantic:3000 1. Industrial :1544 ## 2. HS Grad :971 2. Information:1456 ## 3. Some College :650 ## 4. College Grad :685 ## 5. Advanced Degree:426 ## ## health health_ins logwage wage ## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09 ## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38 ## Median :4.653 Median :104.92 ## Mean :4.654 Mean :111.70 ## 3rd Qu.:4.857 3rd Qu.:128.68 ## Max. :5.763 Max. :318.34 The mean and the median of the variable year are the same so it has a symmetric distribution, the same happens with logwave where both values are not the same but very close. Age and wage have more skewed distributions, since there’s more difference between their mean and median. Comenzamos con las variables numéricas: We can also see the levels in the factor variables and the number of samples per level. As we can see the variables health and health_ins are less balanced than jobclass, and region has only one value. ### 2.2 Visualization Let’s start showing some boxplots to check the distribution of the variables and outliers. #### 2.2.1 Race vs age As we can see in the first boxplot the distribution of the ages per race is similar, we can find only two outliers. plot(x=data$race,y=data$age, xlab = "race", ylab = "age") #### 2.2.2 Jobclass vs age Also similar distribution of the age per jobclasses, some outliers in both cases. the mean of the age of the people with an industrial jobclass seems slightly lower than the one of the people with an information jobclass. plot(data$jobclass, data$age, xlab = "jobclass", ylab = "age") #### 2.2.3 Health status vs age The ages of the people with a very good health status seems slower than the ones with a good or less health status. plot(data$health, data$age, xlab = "health", ylab = "age") #### 2.2.4 Health insurance vs age Also the mean of the age of the people without a health insurance seems lower than the mean of the age of the people with a health insurance. plot(data$health_ins, data$age, xlab = "health_ins", ylab = "age") ### 2.3 Normality test Let’s check visually if the wage variable has a normal distribution. First, we create a density plot, as we can see the distribution does not seem normal. library(ggplot2) ggplot(data, aes(x=wage)) + geom_density() Let’s perform another test with a qqplot that compares the data points with a normal distribution. As we can see the data points move away from the normal distribution lines so we can say that the wage variable does not have a normal distribution. library(car) qqPlot(data$wage)

## [1]  207 1230