Data Mining in R

Posted by

This post describes an analysis performed on an online news dataset. Data cleaning, data transformation, and dimensinality reduction are performed. Next, we try some supervised and unsupervised models such as decision trees, clustering and logistic models to check their accuracy on the prediction of the popularity of the news.


Introduction


For this analysis I have chosen a dataset with features about news published on the web www.mashable.com. This dataset can be found at the following address: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

In the dataset we can find parameters collected from the published news, its usefulness lies in being able to make predictive models about the possible popularity of other news based on these parameters.

Popularity is based on the number of times the page is shared, indicated in the "shares" column of the dataset.

Popularity is based on the number of times the page is shared, indicated in the "shares" column of the dataset.

The reasons for choosing this dataset are several: it has a sufficient number of variables to be able to perform dimensionality reduction, it also contains many continuous variables, which allows to perform discretization. It also allows the application of supervised models, since the target variable is available, and unsupervised models, ignoring it.

The variables contained in the dataset are the following (in summary):

  • url: url of the news.
  • Timedelta: days between dataset publication and data collection.
  • number of words, unique words, words without meaning (prepositions, pronouns, articles) and unique words without meaning.
  • number of references and references to the same page.
  • Number of references and references to the same page.
  • Number of references and references to the same page.
  • number of images and videos.
  • average word length.
  • number of keywords.
  • type of channel where the news is published.
  • keyword rankings (best, worst and average).
  • Maximum number of keywords.
  • maximum, minimum and average number of references to the article from the same page.
  • Day of the week in which the article was published.
  • Day of the week in which the article was published.
  • Metrics of the category model (LDA) of the article.
  • Metrics of the category model (LDA) of the article.
  • Other sentiment analysis metrics such as positive or negative polarity, or subjectivity.
  • Number of times the article has been shared (this will be the target variable for determining popularity)
  • .

We start by reading the dataset and displaying a summary of the data. We see that all the variables except the url are numeric, there are some binary ones, those indicating whether the article was published on a certain day of the week and whether the article belongs to a particular type of channel. Most of them are continuous, in some cases like subjectivity and polarity they have a defined range between 0 and 1 or between -1 and 1. Others have a wider range of values like number of words or keywords.

data<-read.csv('../Datos/OnlineNewsPopularity.csv')
summary(data)
##                                                              url       
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :    1  
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/   :    1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:    1  
##  http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/      :    1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :    1  
##  http://mashable.com/2013/01/07/beewi-smart-toys/              :    1  
##  (Other)                                                       :39638  
##    timedelta     n_tokens_title n_tokens_content n_unique_tokens   
##  Min.   :  8.0   Min.   : 2.0   Min.   :   0.0   Min.   :  0.0000  
##  1st Qu.:164.0   1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  0.4709  
##  Median :339.0   Median :10.0   Median : 409.0   Median :  0.5392  
##  Mean   :354.5   Mean   :10.4   Mean   : 546.5   Mean   :  0.5482  
##  3rd Qu.:542.0   3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  0.6087  
##  Max.   :731.0   Max.   :23.0   Max.   :8474.0   Max.   :701.0000  
##                                                                    
##  n_non_stop_words    n_non_stop_unique_tokens   num_hrefs     
##  Min.   :   0.0000   Min.   :  0.0000         Min.   :  0.00  
##  1st Qu.:   1.0000   1st Qu.:  0.6257         1st Qu.:  4.00  
##  Median :   1.0000   Median :  0.6905         Median :  8.00  
##  Mean   :   0.9965   Mean   :  0.6892         Mean   : 10.88  
##  3rd Qu.:   1.0000   3rd Qu.:  0.7546         3rd Qu.: 14.00  
##  Max.   :1042.0000   Max.   :650.0000         Max.   :304.00  
##                                                               
##  num_self_hrefs       num_imgs         num_videos    average_token_length
##  Min.   :  0.000   Min.   :  0.000   Min.   : 0.00   Min.   :0.000       
##  1st Qu.:  1.000   1st Qu.:  1.000   1st Qu.: 0.00   1st Qu.:4.478       
##  Median :  3.000   Median :  1.000   Median : 0.00   Median :4.664       
##  Mean   :  3.294   Mean   :  4.544   Mean   : 1.25   Mean   :4.548       
##  3rd Qu.:  4.000   3rd Qu.:  4.000   3rd Qu.: 1.00   3rd Qu.:4.855       
##  Max.   :116.000   Max.   :128.000   Max.   :91.00   Max.   :8.042       
##                                                                          
##   num_keywords    data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   : 1.000   Min.   :0.00000           Min.   :0.000                
##  1st Qu.: 6.000   1st Qu.:0.00000           1st Qu.:0.000                
##  Median : 7.000   Median :0.00000           Median :0.000                
##  Mean   : 7.224   Mean   :0.05295           Mean   :0.178                
##  3rd Qu.: 9.000   3rd Qu.:0.00000           3rd Qu.:0.000                
##  Max.   :10.000   Max.   :1.00000           Max.   :1.000                
##                                                                          
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.0000         Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.0000         1st Qu.:0.0000      
##  Median :0.0000      Median :0.0000         Median :0.0000      
##  Mean   :0.1579      Mean   :0.0586         Mean   :0.1853      
##  3rd Qu.:0.0000      3rd Qu.:0.0000         3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.0000         Max.   :1.0000      
##                                                                 
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   445   1st Qu.:  141.8  
##  Median :0.0000        Median : -1.00   Median :   660   Median :  235.5  
##  Mean   :0.2126        Mean   : 26.11   Mean   :  1154   Mean   :  312.4  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  357.0  
##  Max.   :1.0000        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##                                                                           
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172847   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :244572   Median :1024  
##  Mean   : 13612   Mean   :752324   Mean   :259282   Mean   :1117  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:330980   3rd Qu.:2057  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##                                                                   
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3562   1st Qu.: 2382   1st Qu.:   639           
##  Median :  4356   Median : 2870   Median :  1200           
##  Mean   :  5657   Mean   : 3136   Mean   :  3999           
##  3rd Qu.:  6020   3rd Qu.: 3600   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##                                                            
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0.0           Min.   :0.000    
##  1st Qu.:  1100            1st Qu.:   981.2           1st Qu.:0.000    
##  Median :  2800            Median :  2200.0           Median :0.000    
##  Mean   : 10329            Mean   :  6401.7           Mean   :0.168    
##  3rd Qu.:  8000            3rd Qu.:  5200.0           3rd Qu.:0.000    
##  Max.   :843300            Max.   :843300.0           Max.   :1.000    
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1864     Mean   :0.1875       Mean   :0.1833     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.00000     Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.00000     Median :0.00000   Median :0.0000  
##  Mean   :0.1438    Mean   :0.06188     Mean   :0.06904   Mean   :0.1309  
##  3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.00000     Max.   :1.00000   Max.   :1.0000  
##                                                                          
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18460   Mean   :0.14126   Mean   :0.21632   Mean   :0.22377  
##  3rd Qu.:0.24096   3rd Qu.:0.15083   3rd Qu.:0.33422   3rd Qu.:0.37576  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.92653  
##                                                                         
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.00000   Min.   :0.0000      Min.   :-0.39375         
##  1st Qu.:0.02857   1st Qu.:0.3962      1st Qu.: 0.05776         
##  Median :0.04073   Median :0.4535      Median : 0.11912         
##  Mean   :0.23403   Mean   :0.4434      Mean   : 0.11931         
##  3rd Qu.:0.39999   3rd Qu.:0.5083      3rd Qu.: 0.17783         
##  Max.   :0.92719   Max.   :1.0000      Max.   : 0.72784         
##                                                                 
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02838            1st Qu.:0.009615           1st Qu.:0.6000     
##  Median :0.03902            Median :0.015337           Median :0.7105     
##  Mean   :0.03962            Mean   :0.016612           Mean   :0.6822     
##  3rd Qu.:0.05028            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15549            Max.   :0.184932           Max.   :1.0000     
##                                                                           
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3062        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3588        Median :0.10000      
##  Mean   :0.2879      Mean   :0.3538        Mean   :0.09545      
##  3rd Qu.:0.3846      3rd Qu.:0.4114        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##                                                                 
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3284       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2533       Median :-0.5000      
##  Mean   :0.7567        Mean   :-0.2595       Mean   :-0.5219      
##  3rd Qu.:1.0000        3rd Qu.:-0.1869       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##                                                                   
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1500     Median : 0.00000        
##  Mean   :-0.1075       Mean   :0.2824     Mean   : 0.07143        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.15000        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##                                                                   
##  abs_title_subjectivity abs_title_sentiment_polarity     shares      
##  Min.   :0.0000         Min.   :0.0000               Min.   :     1  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3418         Mean   :0.1561               Mean   :  3395  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :843300  
## 
str(data)
## 'data.frame':    39644 obs. of  61 variables:
##  $ url                          : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words             : num  1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

Data cleaning


We start by saving the original dataset, in case we will need it later on.

dataorig<-data

Let's check if there are null or empty values in the dataset, in this case no null or empty values are obtained, so the dataset is ready to continue with the analysis.

colSums(is.na(data))
##                           url                     timedelta 
##                             0                             0 
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                        shares 
##                             0
colSums(data=="")
##                           url                     timedelta 
##                             0                             0 
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                        shares 
##                             0

We check the values of the variables that must be binary to verify that they do not have any value outside their domain (0 or 1), in this case everything is correct.

apply(data[,c(14:19,32:39)],2,function(x) levels(as.factor(x)))
##      data_channel_is_lifestyle data_channel_is_entertainment
## [1,] "0"                       "0"                          
## [2,] "1"                       "1"                          
##      data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## [1,] "0"                 "0"                    "0"                 
## [2,] "1"                 "1"                    "1"                 
##      data_channel_is_world weekday_is_monday weekday_is_tuesday
## [1,] "0"                   "0"               "0"               
## [2,] "1"                   "1"               "1"               
##      weekday_is_wednesday weekday_is_thursday weekday_is_friday
## [1,] "0"                  "0"                 "0"              
## [2,] "1"                  "1"                 "1"              
##      weekday_is_saturday weekday_is_sunday is_weekend
## [1,] "0"                 "0"               "0"       
## [2,] "1"                 "1"               "1"

It seems that the variable is_weekend provides redundant information, since it provides the same information as the variables weekday_is_saturday and weekday_is_sunday. Let's remove it.

data$is_weekend<-NULL

Let's check now that the binary variables that need exclusive do not have several "1" values for the same record, nor all values at 0.

#Día de la semana
nrow(data[rowSums(data[,c(32:38)])>1,])
## [1] 0
nrow(data[rowSums(data[,c(32:38)])<1,])
## [1] 0
#Tipo de canal
nrow(data[rowSums(data[,c(14:19)])>1,])
## [1] 0
nrow(data[rowSums(data[,c(14:19)])<1,])
## [1] 6134

We see that we have 6134 rows without any type of channel assigned, in order not to delete them, we are going to create a new variable called data_channel_is_other with value 1 in these rows and 0 in the rest, in this way these records are also classified.

#Create column with value 0
data$data_channel_is_other<-0

#Assign value 1 to rows without an assigned channel type
data[rowSums(data[,c(14:19)])<1,]$data_channel_is_other<-1

Let's see the correlations between continuous variables. In this case, given the number of variables, we are going to change the names of the rows and columns to numbers, so that the graph is somewhat clearer.

library(corrplot)
M<-cor(data[,c(3:13,20:31,40:60)])

#Save the variable names
coln<-colnames(M)
rown<-rownames(M)
colnames(M)<-1:44
rownames(M)<-1:44
corrplot(M, type = "upper", method = "circle", tl.cex = 0.6)

We see a high correlation between some variables, we obtain the graph only with the rows and columns that interest us to see it better.

#We return the variable names to the matrix to make it clearer.

colnames(M)<-coln
rownames(M)<-rown
corrplot(M[c(3:5,12,13,19,29),c(3:5,14,16,20,32)], type = "upper", method = "number", tl.cex = 0.9)

The variables n_unique_tokens, n_non_stop_words and n_non_stop_unique_tokens have a maximum correlation. The first variable refers to the total number of different words, the other two variables refer to the number of empty, meaningless words (prepositions, articles or pronouns). It seems that the latter do not provide additional information with respect to the total number of different words, so we are going to eliminate them. The rest of the variables, even though they have a high correlation, as they are not maximum (1), we will leave them since we will apply the svd algorithm on them later.

data$n_non_stop_words<-NULL
data$n_non_stop_unique_tokens<-NULL

Descriptive Analysis


We are going to perform a descriptive analysis of the dataset.

First let's look at the distributions of the variables. After studying the distributions of all the variables, given the amount of them, for clarity, we show only the most representative ones since the rest have a distribution similar to some of these.

We can observe that the distributions of the variables are quite heterogeneous, none of them has a normal distribution, although some of them are close. Some of the variables have much density in a reduced range and much less density in the rest of the values.

library(dplyr)
library(ggplot2)

dens <- lapply(colnames(data[,c(3,4,10,11,23,40,42,46,47,49,52)]),function(cn) ggplot(data,aes(x=data[,cn]))+geom_density(color="darkblue", fill="lightblue")+labs(x=cn))
gridExtra::marrangeGrob(grobs=dens, nrow=2, ncol=2)