Inferential analysis in R – Data Science Portfolio

This post continues the analysis of the Mid-Atlantic Wage dataset by performing some inferential statistics with language R.

The data used is the same as in the post about Descriptive Analysis with R.

We start reading the dataset and performing the transformations described in that article:

dordorica_Actividad_4_Publish_inferential.utf8

data<-read.csv2("./Wage.csv",header = TRUE, sep = ",", stringsAsFactors = TRUE )

data$logwage<-as.numeric(as.character(data$logwage))
data$wage<-as.numeric(as.character(data$wage))

Age Confidence Interval

Let’s calculate the 95% confidence interval of the mean age variable, that means that we can be 95% certain that the mean of the population of the study is between the resultant values.

Let’s check the normality of the variable with shapiro test and visually with qqplot.

shapiro.test(data$age)

## 
##  Shapiro-Wilk normality test
## 
## data:  data$age
## W = 0.99106, p-value = 9.416e-13

qqPlot(data$age)

## [1]  329 1566

According to the previous tests, the distribution of the data is not normal, but as we have a sufficiently large sample size, we can consider the data as normally distributed by applying the Central Limit Theorem, so we can use t-test to calcualte the confidence interval.

We start calculating standard deviation, degrees of freedom and standard error.

#Calculate standard deviation using sd function that substract 1 from the divisor
s = sd(data$age)
paste('Standard deviation:',s,sep=' ')

## [1] "Standard deviation: 11.5424056099517"

paste('Degrees of freedom:',length(data$age)-1, sep=' ')

## [1] "Degrees of freedom: 2999"

#Calculate the standard error
es = s/sqrt(length(data$age))
paste('Standard error:',es,sep=' ')

## [1] "Standard error: 0.21073453068149"

Next, we calculate the t critical value. As the contrast is bilateral we have to divide by 2 the confidence.

qt(0.025,length(data$age)-1)

## [1] -1.960755

\[t_{0.025,204} = \pm1.960755\]

Next, we calculate margin of error and mean.

t=1.960755
#El margen de error es t * el error estÃ¡ndar
ME=t*es
paste("Margin of error:",ME,sep=' ')

## [1] "Margin of error: 0.413198784706385"

mean1 = mean(data$age)
paste ("Mean:",mean1,sep= ' ')

## [1] "Mean: 42.4146666666667"

interval1=mean1-ME
paste("Interval-1:",interval1,sep = ' ')

## [1] "Interval-1: 42.0014678819603"

interval2=mean1+ME
paste("Interval-2",interval2,sep= ' ')

## [1] "Interval-2 42.8278654513731"

So the confidence interval is:

\[(42.41467-0.4131988,42.41467+0.4131988) = (42.00147,42.82787)\]

We can be 95% certain that the mean age of the population is between 42.00147 y 42.82787

Let’s repeat the exercise to calculate separated confidence intervals for the jobclass, so we can see which is the difference between the mean age of the population by this parameter.

First, we take from the dataset only samples from industrial jobclass and perform the same calculations.

dataindustrial<-data[data$jobclass=='1. Industrial',]


s = sd(dataindustrial$age)
paste('Standard deviation:',s,sep=' ' )

## [1] "Standard deviation: 11.6915431496252"

## [1] 11.69154

paste('Degrees of freedom:',length(dataindustrial$age)-1,sep=' ')

## [1] "Degrees of freedom: 1543"

es = s/sqrt(length(dataindustrial$age))
paste('Standard error:',es,sep=' ')

## [1] "Standard error: 0.297541938981843"

t critical value.

qt(0.025,length(dataindustrial$age)-1)

## [1] -1.961503

\[t_{0.025,1543} = \pm1.961503\]

Margin of error and mean.

t=1.961503
#El margen de error es z * el error estÃ¡ndar
ME=t*es
paste("Margin of error:",ME,sep=' ')

## [1] "Margin of error: 0.583629405938702"

mean1 = mean(dataindustrial$age)
paste("Mean:",mean1,sep=' ')

## [1] "Mean: 41.3983160621762"

paste("Interval-1:",interval1,sep=' ')

## [1] "Interval-1: 42.0014678819603"

interval2=mean1+ME
paste("Interval-2:",interval2,sep=' ')

## [1] "Interval-2: 41.9819454681149"

So the 95% confidence interval of the mean age of the population with industrial jobs is:

\[(41.39832-0.5836294,41.39832+0.5836294) = (40.81469,41.98195)\]

Now, we repeat the process with the information jobs:

datainformation<-data[data$jobclass=='2. Information',]

s = sd(datainformation$age)
paste('Standard deviation:',s,sep=' ')

## [1] "Standard deviation: 11.2865189449133"

paste('Degrees of freedom:',length(datainformation$age)-1,sep=' ')

## [1] "Degrees of freedom: 1455"

es = s/sqrt(length(datainformation$age))
paste('Standard error:',es,sep=' ')

## [1] "Standard error: 0.295787166733181"

qt(0.025,1455)

## [1] -1.961596

\[t_{0.025,1455} = \pm1.961596\]

Next we calculate the margin of error and the mean:

t=1.961596
#El margen de error es z * el error estÃ¡ndar
ME=t*es
paste("Margin of error:",ME,sep=' ')

## [1] "Margin of error: 0.580214923115141"

mean1 = mean(datainformation$age)
paste("Mean:",mean1,sep=' ')

## [1] "Mean: 43.4924450549451"

interval1=mean1-ME
paste("Interval-1:",interval1,sep=' ')

## [1] "Interval-1: 42.9122301318299"

interval2=mean1+ME
paste("Interval-2",interval2,sep=' ')

## [1] "Interval-2 44.0726599780602"

So the 95% confidence interval is:

\[(43.49245-0.5802149,43.49245+0.5802149) = (42.91223,44.07266)\]

In conclusion, we can say with a 95% of confidence that the mean age of the population with information jobs and the mean of age of the population with industrial jobs are not overlapped and the mean age of the population with information jobs is greater than the mean age of the population with information jobs.

Mean difference contrast

Let’s check if the mean wage of the workers with a health insurance is 20k$ greater than the mean wage of the workers without a health insurance. We will use a 95% confidence level.

Having m1 as the mean wage of the workers with a health insurance and m2 as the mean wage without a health insurance, the null and alternative hypotheses would be:

Null hypotheses: m1-m2+20 > 0 Alternative hypotheses: m1-m2-20 <= 0

We consider wage variable as normally distributed, so first we have to check if the variances of both populations are equal or not. We have to apply Snedecor F test

First of all we spli the data to separate both groups:

data_healthins<-data[data$health_ins=='1. Yes',]
data_no_healthins<-data[data$health_ins=='2. No',]

Apply the Sndecor F test:

#Calculate degrees of freedom
m=nrow(data_healthins)-1
n=nrow(data_no_healthins)-1

paste("Degrees of freedom:",m,',',n,sep=' ')

## [1] "Degrees of freedom: 2082 , 916"

#Calculate variances
sd2Healthins=sd(data_healthins$wage)^2
sd2_No_Healthins=sd(data_no_healthins$wage)^2

paste("Variance 1:",sd2Healthins,sep=' ')

## [1] "Variance 1: 1700.48872792992"

paste("Variance 2:",sd2_No_Healthins,sep=' ')

## [1] "Variance 2: 1293.97780299454"

f=sd2Healthins/sd2_No_Healthins
paste("Test Statistic:",f,sep=' ')

## [1] "Test Statistic: 1.31415602647482"

Critical values calculation.

qf(0.025,2082,916)

## [1] 0.8969443

qf(0.975,2082,916)

## [1] 1.117556

We need a 95% confidence, and the test is bilateral, so we look for the critical values in the 0.025 table: \[f({0.025,2082,916})\simeq0.90\] \[f({0.975,2082,916})\simeq1.11\]

The test statistic is outside the acceptance region so we must reject the null hypothesis, so we cannot infer that both variances are equal.

To continue with the test, we must apply a t-test, but we must estimate the degrees of freedom because the population variances are unknown and different.

#In order to get the degrees of freedom we must apply an estimation.

num = (sd2Healthins/nrow(data_healthins)+sd2_No_Healthins/nrow(data_no_healthins))^2
denom = (1/(nrow(data_healthins)-1)*((sd2Healthins/nrow(data_healthins))^2))+(1/(nrow(data_no_healthins)-1)*((sd2_No_Healthins/nrow(data_no_healthins))^2))
df=num/denom
paste("Degrees of freedom:",df,sep=' ')

## [1] "Degrees of freedom: 1989.49191186935"

#now, we calculate the test statistic 

m1=mean(data_healthins$wage)
m2=mean(data_no_healthins$wage)
paste("Mean with a health insurance:",m1,sep=' ')

## [1] "Mean with a health insurance: 120.238313749863"

paste("Mean without a health insurance:",m2,sep=' ')

## [1] "Mean without a health insurance: 92.3167034506711"

#Test statistics

t=(m1-m2-20)/sqrt((sd2Healthins/nrow(data_healthins))+(sd2_No_Healthins/nrow(data_no_healthins)))
paste("Test statistic:",t,sep=' ')

## [1] "Test statistic: 5.30772122287136"

cv=qt(0.05,df,lower.tail = FALSE)
paste("Critical value:",cv,sep=' ')

## [1] "Critical value: 1.6456198945794"

P-value calculation.

pv=pt(5.307721,df,lower.tail=FALSE)
paste("P-value:",pv,sep=' ')

## [1] "P-value: 6.1692031996626e-08"

The test statistic is greater than the critical value, and the p-value is smaller than 0.05 so, we can reject the null hypothesis. We cannot ensure that the wage mean of the workers with a health insurance is 20k$ greater than the mean wage of the workers without one.