Data Mining in R

Publicado por

This post describes an analysis performed on an online news dataset. Data cleaning, data transformation, and dimensinality reduction are performed. Next, we try some supervised and unsupervised models such as decision trees, clustering and logistic models to check their accuracy on the prediction of the popularity of the news.


Introduction


For this analysis I have chosen a dataset with features about news published on the web www.mashable.com. This dataset can be found at the following address: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

In the dataset we can find parameters collected from the published news, its usefulness lies in being able to make predictive models about the possible popularity of other news based on these parameters.

Popularity is based on the number of times the page is shared, indicated in the "shares" column of the dataset.

Popularity is based on the number of times the page is shared, indicated in the "shares" column of the dataset.

The reasons for choosing this dataset are several: it has a sufficient number of variables to be able to perform dimensionality reduction, it also contains many continuous variables, which allows to perform discretization. It also allows the application of supervised models, since the target variable is available, and unsupervised models, ignoring it.

The variables contained in the dataset are the following (in summary):

  • url: url of the news.
  • Timedelta: days between dataset publication and data collection.
  • number of words, unique words, words without meaning (prepositions, pronouns, articles) and unique words without meaning.
  • number of references and references to the same page.
  • Number of references and references to the same page.
  • Number of references and references to the same page.
  • number of images and videos.
  • average word length.
  • number of keywords.
  • type of channel where the news is published.
  • keyword rankings (best, worst and average).
  • Maximum number of keywords.
  • maximum, minimum and average number of references to the article from the same page.
  • Day of the week in which the article was published.
  • Day of the week in which the article was published.
  • Metrics of the category model (LDA) of the article.
  • Metrics of the category model (LDA) of the article.
  • Other sentiment analysis metrics such as positive or negative polarity, or subjectivity.
  • Number of times the article has been shared (this will be the target variable for determining popularity)
  • .

We start by reading the dataset and displaying a summary of the data. We see that all the variables except the url are numeric, there are some binary ones, those indicating whether the article was published on a certain day of the week and whether the article belongs to a particular type of channel. Most of them are continuous, in some cases like subjectivity and polarity they have a defined range between 0 and 1 or between -1 and 1. Others have a wider range of values like number of words or keywords.

data<-read.csv('../Datos/OnlineNewsPopularity.csv')
summary(data)
##                                                              url       
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :    1  
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/   :    1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:    1  
##  http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/      :    1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :    1  
##  http://mashable.com/2013/01/07/beewi-smart-toys/              :    1  
##  (Other)                                                       :39638  
##    timedelta     n_tokens_title n_tokens_content n_unique_tokens   
##  Min.   :  8.0   Min.   : 2.0   Min.   :   0.0   Min.   :  0.0000  
##  1st Qu.:164.0   1st Qu.: 9.0   1st Qu.: 246.0   1st Qu.:  0.4709  
##  Median :339.0   Median :10.0   Median : 409.0   Median :  0.5392  
##  Mean   :354.5   Mean   :10.4   Mean   : 546.5   Mean   :  0.5482  
##  3rd Qu.:542.0   3rd Qu.:12.0   3rd Qu.: 716.0   3rd Qu.:  0.6087  
##  Max.   :731.0   Max.   :23.0   Max.   :8474.0   Max.   :701.0000  
##                                                                    
##  n_non_stop_words    n_non_stop_unique_tokens   num_hrefs     
##  Min.   :   0.0000   Min.   :  0.0000         Min.   :  0.00  
##  1st Qu.:   1.0000   1st Qu.:  0.6257         1st Qu.:  4.00  
##  Median :   1.0000   Median :  0.6905         Median :  8.00  
##  Mean   :   0.9965   Mean   :  0.6892         Mean   : 10.88  
##  3rd Qu.:   1.0000   3rd Qu.:  0.7546         3rd Qu.: 14.00  
##  Max.   :1042.0000   Max.   :650.0000         Max.   :304.00  
##                                                               
##  num_self_hrefs       num_imgs         num_videos    average_token_length
##  Min.   :  0.000   Min.   :  0.000   Min.   : 0.00   Min.   :0.000       
##  1st Qu.:  1.000   1st Qu.:  1.000   1st Qu.: 0.00   1st Qu.:4.478       
##  Median :  3.000   Median :  1.000   Median : 0.00   Median :4.664       
##  Mean   :  3.294   Mean   :  4.544   Mean   : 1.25   Mean   :4.548       
##  3rd Qu.:  4.000   3rd Qu.:  4.000   3rd Qu.: 1.00   3rd Qu.:4.855       
##  Max.   :116.000   Max.   :128.000   Max.   :91.00   Max.   :8.042       
##                                                                          
##   num_keywords    data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   : 1.000   Min.   :0.00000           Min.   :0.000                
##  1st Qu.: 6.000   1st Qu.:0.00000           1st Qu.:0.000                
##  Median : 7.000   Median :0.00000           Median :0.000                
##  Mean   : 7.224   Mean   :0.05295           Mean   :0.178                
##  3rd Qu.: 9.000   3rd Qu.:0.00000           3rd Qu.:0.000                
##  Max.   :10.000   Max.   :1.00000           Max.   :1.000                
##                                                                          
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.0000         Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.0000         1st Qu.:0.0000      
##  Median :0.0000      Median :0.0000         Median :0.0000      
##  Mean   :0.1579      Mean   :0.0586         Mean   :0.1853      
##  3rd Qu.:0.0000      3rd Qu.:0.0000         3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.0000         Max.   :1.0000      
##                                                                 
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   445   1st Qu.:  141.8  
##  Median :0.0000        Median : -1.00   Median :   660   Median :  235.5  
##  Mean   :0.2126        Mean   : 26.11   Mean   :  1154   Mean   :  312.4  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  357.0  
##  Max.   :1.0000        Max.   :377.00   Max.   :298400   Max.   :42827.9  
##                                                                           
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:172847   1st Qu.:   0  
##  Median :  1400   Median :843300   Median :244572   Median :1024  
##  Mean   : 13612   Mean   :752324   Mean   :259282   Mean   :1117  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:330980   3rd Qu.:2057  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3613  
##                                                                   
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3562   1st Qu.: 2382   1st Qu.:   639           
##  Median :  4356   Median : 2870   Median :  1200           
##  Mean   :  5657   Mean   : 3136   Mean   :  3999           
##  3rd Qu.:  6020   3rd Qu.: 3600   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##                                                            
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0.0           Min.   :0.000    
##  1st Qu.:  1100            1st Qu.:   981.2           1st Qu.:0.000    
##  Median :  2800            Median :  2200.0           Median :0.000    
##  Mean   : 10329            Mean   :  6401.7           Mean   :0.168    
##  3rd Qu.:  8000            3rd Qu.:  5200.0           3rd Qu.:0.000    
##  Max.   :843300            Max.   :843300.0           Max.   :1.000    
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.0000       Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.0000       1st Qu.:0.0000     
##  Median :0.0000     Median :0.0000       Median :0.0000     
##  Mean   :0.1864     Mean   :0.1875       Mean   :0.1833     
##  3rd Qu.:0.0000     3rd Qu.:0.0000       3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.0000       Max.   :1.0000     
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.00000     Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.00000     1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.00000     Median :0.00000   Median :0.0000  
##  Mean   :0.1438    Mean   :0.06188     Mean   :0.06904   Mean   :0.1309  
##  3rd Qu.:0.0000    3rd Qu.:0.00000     3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.00000     Max.   :1.00000   Max.   :1.0000  
##                                                                          
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03334   Median :0.04000   Median :0.04000  
##  Mean   :0.18460   Mean   :0.14126   Mean   :0.21632   Mean   :0.22377  
##  3rd Qu.:0.24096   3rd Qu.:0.15083   3rd Qu.:0.33422   3rd Qu.:0.37576  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.92653  
##                                                                         
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.00000   Min.   :0.0000      Min.   :-0.39375         
##  1st Qu.:0.02857   1st Qu.:0.3962      1st Qu.: 0.05776         
##  Median :0.04073   Median :0.4535      Median : 0.11912         
##  Mean   :0.23403   Mean   :0.4434      Mean   : 0.11931         
##  3rd Qu.:0.39999   3rd Qu.:0.5083      3rd Qu.: 0.17783         
##  Max.   :0.92719   Max.   :1.0000      Max.   : 0.72784         
##                                                                 
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02838            1st Qu.:0.009615           1st Qu.:0.6000     
##  Median :0.03902            Median :0.015337           Median :0.7105     
##  Mean   :0.03962            Mean   :0.016612           Mean   :0.6822     
##  3rd Qu.:0.05028            3rd Qu.:0.021739           3rd Qu.:0.8000     
##  Max.   :0.15549            Max.   :0.184932           Max.   :1.0000     
##                                                                           
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1852      1st Qu.:0.3062        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3588        Median :0.10000      
##  Mean   :0.2879      Mean   :0.3538        Mean   :0.09545      
##  3rd Qu.:0.3846      3rd Qu.:0.4114        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##                                                                 
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3284       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2533       Median :-0.5000      
##  Mean   :0.7567        Mean   :-0.2595       Mean   :-0.5219      
##  3rd Qu.:1.0000        3rd Qu.:-0.1869       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##                                                                   
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1500     Median : 0.00000        
##  Mean   :-0.1075       Mean   :0.2824     Mean   : 0.07143        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.15000        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##                                                                   
##  abs_title_subjectivity abs_title_sentiment_polarity     shares      
##  Min.   :0.0000         Min.   :0.0000               Min.   :     1  
##  1st Qu.:0.1667         1st Qu.:0.0000               1st Qu.:   946  
##  Median :0.5000         Median :0.0000               Median :  1400  
##  Mean   :0.3418         Mean   :0.1561               Mean   :  3395  
##  3rd Qu.:0.5000         3rd Qu.:0.2500               3rd Qu.:  2800  
##  Max.   :0.5000         Max.   :1.0000               Max.   :843300  
## 
str(data)
## 'data.frame':    39644 obs. of  61 variables:
##  $ url                          : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words             : num  1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

Data cleaning


We start by saving the original dataset, in case we will need it later on.

dataorig<-data

Let's check if there are null or empty values in the dataset, in this case no null or empty values are obtained, so the dataset is ready to continue with the analysis.

colSums(is.na(data))
##                           url                     timedelta 
##                             0                             0 
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                        shares 
##                             0
colSums(data=="")
##                           url                     timedelta 
##                             0                             0 
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                        shares 
##                             0

We check the values of the variables that must be binary to verify that they do not have any value outside their domain (0 or 1), in this case everything is correct.

apply(data[,c(14:19,32:39)],2,function(x) levels(as.factor(x)))
##      data_channel_is_lifestyle data_channel_is_entertainment
## [1,] "0"                       "0"                          
## [2,] "1"                       "1"                          
##      data_channel_is_bus data_channel_is_socmed data_channel_is_tech
## [1,] "0"                 "0"                    "0"                 
## [2,] "1"                 "1"                    "1"                 
##      data_channel_is_world weekday_is_monday weekday_is_tuesday
## [1,] "0"                   "0"               "0"               
## [2,] "1"                   "1"               "1"               
##      weekday_is_wednesday weekday_is_thursday weekday_is_friday
## [1,] "0"                  "0"                 "0"              
## [2,] "1"                  "1"                 "1"              
##      weekday_is_saturday weekday_is_sunday is_weekend
## [1,] "0"                 "0"               "0"       
## [2,] "1"                 "1"               "1"

It seems that the variable is_weekend provides redundant information, since it provides the same information as the variables weekday_is_saturday and weekday_is_sunday. Let's remove it.

data$is_weekend<-NULL

Let's check now that the binary variables that need exclusive do not have several "1" values for the same record, nor all values at 0.

#Día de la semana
nrow(data[rowSums(data[,c(32:38)])>1,])
## [1] 0
nrow(data[rowSums(data[,c(32:38)])<1,])
## [1] 0
#Tipo de canal
nrow(data[rowSums(data[,c(14:19)])>1,])
## [1] 0
nrow(data[rowSums(data[,c(14:19)])<1,])
## [1] 6134

We see that we have 6134 rows without any type of channel assigned, in order not to delete them, we are going to create a new variable called data_channel_is_other with value 1 in these rows and 0 in the rest, in this way these records are also classified.

#Create column with value 0
data$data_channel_is_other<-0

#Assign value 1 to rows without an assigned channel type
data[rowSums(data[,c(14:19)])<1,]$data_channel_is_other<-1

Let's see the correlations between continuous variables. In this case, given the number of variables, we are going to change the names of the rows and columns to numbers, so that the graph is somewhat clearer.

library(corrplot)
M<-cor(data[,c(3:13,20:31,40:60)])

#Save the variable names
coln<-colnames(M)
rown<-rownames(M)
colnames(M)<-1:44
rownames(M)<-1:44
corrplot(M, type = "upper", method = "circle", tl.cex = 0.6)

We see a high correlation between some variables, we obtain the graph only with the rows and columns that interest us to see it better.

#We return the variable names to the matrix to make it clearer.

colnames(M)<-coln
rownames(M)<-rown
corrplot(M[c(3:5,12,13,19,29),c(3:5,14,16,20,32)], type = "upper", method = "number", tl.cex = 0.9)

The variables n_unique_tokens, n_non_stop_words and n_non_stop_unique_tokens have a maximum correlation. The first variable refers to the total number of different words, the other two variables refer to the number of empty, meaningless words (prepositions, articles or pronouns). It seems that the latter do not provide additional information with respect to the total number of different words, so we are going to eliminate them. The rest of the variables, even though they have a high correlation, as they are not maximum (1), we will leave them since we will apply the svd algorithm on them later.

data$n_non_stop_words<-NULL
data$n_non_stop_unique_tokens<-NULL

Descriptive Analysis


We are going to perform a descriptive analysis of the dataset.

First let's look at the distributions of the variables. After studying the distributions of all the variables, given the amount of them, for clarity, we show only the most representative ones since the rest have a distribution similar to some of these.

We can observe that the distributions of the variables are quite heterogeneous, none of them has a normal distribution, although some of them are close. Some of the variables have much density in a reduced range and much less density in the rest of the values.

library(dplyr)
library(ggplot2)

dens <- lapply(colnames(data[,c(3,4,10,11,23,40,42,46,47,49,52)]),function(cn) ggplot(data,aes(x=data[,cn]))+geom_density(color="darkblue", fill="lightblue")+labs(x=cn))
gridExtra::marrangeGrob(grobs=dens, nrow=2, ncol=2)

Next we will normalize the data to a range (0,1) and obtain the density plots again for comparison.

library("BBmisc")
datanorm<-data[,c(3,4,10,11,23,40,42,46,47,49,52)]

datanorm<-normalize(datanorm,method="range")

We obtain the plots with the normalized data.

We see that the distributions of the variables do not change.

We see that the distributions of the variables do not vary, the graphs are the same only that the range of these is from 0 to 1, this would allow us to make comparisons between variables since they are now of the same order of magnitude.

library(dplyr)
library(ggplot2)

dens <- lapply(colnames(datanorm[,c(1:ncol(datanorm))]),function(cn) ggplot(datanorm,aes(x=datanorm[,cn]))+geom_density(color="darkblue", fill="lightblue")+labs(x=cn))
gridExtra::marrangeGrob(grobs=dens, nrow=2, ncol=2)


Discretization


Let's see how each of these variables behaves by applying different types of discretization.

To perform this discretization we will use the discretize function of the arules package.

First we perform discretization with equal-amplitude intervals.

Amplitude

library (arules)

disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) ggplot(data,aes(discretize(data[,cnum],breaks=5,method="interval")))+geom_bar(color="darkblue",fill="lightblue")+
                 xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1)))
gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)

We can observe that in most cases, this type of discretization groups in a quite irregular way, we observe how the shape of the graph of the intervals is quite similar to the shape of the density graph of each variable, having more values in the intervals where more density of values exists in that variable, which is logical if we use this type of discretization.

Frequency

Next, we perform discretization with equal-frequency intervals.

library (arules)

disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) ggplot(data,aes(discretize(data[,cnum],breaks=5,method="frequency")))+geom_bar(color="darkblue",fill="lightblue")+
                 xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1)))

gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)

In this case we see how in most of the variables we get more uniform graphs in terms of the number of values, its shape is not so similar to the density plot, but depending on the type of variable we get very "wide" intervals with very different values within them. For example in the case of LDA_03 the last interval is quite wide since there is little diversity of values in that range. It is striking the case of min_positive_polarity, where, given the irregularity of the density of its values, it is not possible to discretize with equal frequencies.

K-means

library (arules)

disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) ggplot(data,aes(discretize(data[,cnum],breaks = 5,method="cluster")))+geom_bar(color="darkblue",fill="lightblue")+
                 xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1)))

gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)

In this case the algorithm tries to minimize the variation of values within the cluster, this makes the real distribution of the values more visible, having a higher frequency of them where the variable has more density, in this regard the graphs are more similar to the discretization by equal intervals, although in some cases more regularity is achieved.

Normal function

We are going to perform the discretization according to the normal function, although none of the variables seems to follow this type of distribution, we will see how they respond. We create a function that calculates the probabilities according to the chosen z (-1.25,-0.5,0.5,1.25) and creates the intervals according to these proportions.

#Función que crea los intervalos 

zdiscr<-function(x)
{
#Save the probabilities of the chosen z values (-1.25,-0.5,0.5,1.25)
 probs<-c(pnorm(-1.25),pnorm(-0.5)-pnorm(-1.25),pnorm(0.5)-pnorm(-0.5),pnorm(1.25)-pnorm(0.5),1-pnorm(1.25))
 #Sort the values
 col<-sort(x,decreasing=FALSE)
 #Define the intervals
 #Get an array with the unique values
 
 uniq<-unique(col)

 #Index of the value in the interval
 cut1<-1
 #Value in the interval
 lowint<-col[1]
 
 intervals<-vector() #Interval vector (return value)
 interval_levels<-vector() #levels vector, they are saved to be applied at the end and not to reorder the intervals.
 
 
 preuniq<-0 #Controls the index of uniq to sum the values in proportion to z
 for (i in 1:length(probs))
 {
    #we look for the cutoff number of values * probability of z.
    uniqindex<-round(length(uniq)*probs[i])
    
    #number in the selected position
    highint<-uniq[preuniq+uniqindex]
   
 
    
	#We check if the last interval is possible. It is possible that only the last value is left and we have left the array.
    #In this case highint is null. We assign the same value of lowint and the last bin will be only that value.
    if(is.na(highint))
    {
       highint=lowint
       cut2<-length(col)
    }
    else
    {
      
      #If there are few values, highint will be the smallest value, we have to adjust and take the next one, since 
      #the first interval with only one value, it is not possible to

      if(highint==min(uniq))
      {
        uniqindex<-uniqindex+1
        highint<-uniq[preuniq+uniqindex]

      }
       #We look for the one where is the last value that will go in the interval.

      if(i<length(probs))
        cut2<-max(which(col<highint))
      else
        #In the last interval we are looking for the value to be included because it will be closed on the right side
        cut2<-max(which(col<=highint))
   
    }
    
    preuniq<-preuniq+uniqindex
    
    #It is only possible to create the interval if there is cut2. In cases
    #where the largest value of the interval coincides with the minimum value of the column, such an interval is not possible.
    if(is.finite(cut1) && is.finite(cut2))
    {
         #Create as many entries in the interval array as there are values in that interval.
		 if(i<length(probs))
        {
		  #We have to save the levels to prevent r from sorting them.
          interval_levels<-c(interval_levels,paste0("[",round(lowint,3),",",round(highint,3),")"))
          cut2-cut1+1
		  #Copy the interval into the array as many times as there are values of the interval in the original array.	          
		  intervals<-c(intervals,rep(paste0("[",round(lowint,3),",",round(highint,3),")"),cut2-cut1+1))
          table(intervals)
        }
        else
        {
		  #The last interval will be closed from the right.
          interval_levels<-c(interval_levels,paste0("[",round(lowint,3),",",round(highint,3),"]"))
          intervals<-c(intervals,rep(paste0("[",round(lowint,3),",",round(highint,3),"]"),cut2-cut1+1))
          table(intervals)
        }
          
        #We move the cutoff, for the following interval
        cut1<-cut2+1
    }
    #The smaller value of the interval will be the larger of the previous one.
    lowint<-highint
 
 }
 
#We change the result to type dataframe, we assign the levels again 
 #to be in the original order and return the result.

 intervals<-as.data.frame(intervals)
 intervals$intervals<-factor(intervals$intervals,levels=interval_levels)
  return(intervals)
}
 
#We apply the function to all the columns and obtain the graph.

disc <- lapply(c(3,4,10,11,23,40,42,46,47,49,52),function(cnum) {
           intervals<-zdiscr(data[,cnum])      
           ggplot(intervals,aes(x=intervals))+geom_bar(stat="count",color="darkblue",fill="lightblue")+
                 xlab(colnames(data)[cnum])+theme(axis.text.x = element_text(angle = 45, hjust = 1))
})

      
#ggplot(intervals,aes(x=intervals))+geom_bar(stat="count",color="darkblue",fill="lightblue")+
#                 xlab(colnames(data)[23])+theme(axis.text.x = element_text(angle = 45, hjust = 1))

gridExtra::marrangeGrob(grobs=disc, nrow=2, ncol=2)

Despite the fact that none of the variables follows a normal distribution we can see how some of them admit well this type of discretization.

The highest frequency of values tends to be concentrated in the central intervals, this is normal since this is where the highest proportion of values exist in the normal distribution. We see several variables where the resulting plot resembles the normal distribution, although the width of the intervals is quite uneven in cases where the density of values of the variable is more concentrated at one end, an example of this would be the variable LDA_03. We also observe how, if there are few different values, we obtain intervals with only one value inside them, especially the last one, this occurs in num_keywords and kw_avg_max.


SVD dimensionality reduction


We are going to perform dimensionality reduction of the dataset by applying SVD (singular value decomposition) on the continuous numerical variables of the dataset.

SVD decomposes a matrix into singular values and singular vectors. The value matrix is decomposed into a vector (d) of singular values and 2 matrices of singular vectors (u and v), the first refers to the rows of the matrix (principal components) and the other the columns. The length of the vector "d" will be equal to the number of columns of the matrix, the matrix "u" has the same size of the original matrix and the matrix "v" will have the same number of rows and columns, matching the number of columns of the dataset.

We then apply the vector "d" to the matrix "u", the matrix "v" will have the same number of rows and columns, matching the number of columns of the dataset.

Next we apply svd on the continuous variables and display the values of the vectors of our dataset. To simplify the presentation, only the first 5 rows of u and v are displayed.

#Creamos un dataset con las variables continuas
datanum <- (data[,c(2:10,17:28,36:56)])

#Scaling the data 
datanum<-scale(datanum)

#We apply SVD and display the data.
datanum.svd<-svd(datanum)
"d"
## [1] "d"
datanum.svd$d
##  [1] 435.86495908 392.97389437 373.00211802 329.50144643 300.54098598
##  [6] 288.33208123 259.87679145 246.96165444 240.03495172 238.71929170
## [11] 233.14862176 220.20153443 210.79590732 207.06389298 199.54850472
## [16] 197.60486053 193.48271139 187.50067045 179.32521247 166.75167621
## [21] 160.72846361 154.20653132 152.64297871 148.43576736 140.48371299
## [26] 134.89712031 134.31326141 128.75125574 126.60830703 116.45830674
## [31] 108.21799033 105.02227950  85.88362175  83.93061932  73.04281379
## [36]  62.37521050  56.29744729  55.59221505  48.63510408  39.93920605
## [41]  32.11331643   0.02854457
"u"
## [1] "u"
datanum.svd$u[1:5,]
##               [,1]        [,2]         [,3]         [,4]         [,5]
## [1,] -3.385812e-03 0.006482267 -0.006124425  0.003765417 -0.007649380
## [2,]  7.160552e-05 0.010893263 -0.004993432  0.003369068 -0.003279016
## [3,] -8.940032e-03 0.008645131 -0.005460680  0.001766840 -0.004323186
## [4,] -1.973601e-03 0.005191147 -0.009384980  0.004877264 -0.004584371
## [5,] -9.191780e-03 0.008729192 -0.004061428 -0.001309566  0.004055826
##               [,6]         [,7]          [,8]          [,9]         [,10]
## [1,]  0.0008022925  0.005712196 -5.810998e-03 -0.0023044690 -0.0001962938
## [2,]  0.0015301889 -0.004711192 -6.837987e-03 -0.0002711066 -0.0038951485
## [3,] -0.0023831448 -0.003891976 -1.620292e-03  0.0004742731  0.0009639064
## [4,] -0.0004544707 -0.001793962 -7.970181e-05  0.0009011354 -0.0017215056
## [5,]  0.0132565096  0.002553895  3.154651e-03 -0.0029947192  0.0065904581
##             [,11]         [,12]        [,13]         [,14]         [,15]
## [1,] -0.004417958  0.0013223745  0.004017870 -0.0003510615  0.0018983328
## [2,]  0.002964934  0.0033843555  0.004391979 -0.0055342393  0.0002399426
## [3,] -0.003068118 -0.0006437289 -0.006344687  0.0013045685 -0.0009672438
## [4,] -0.003810086  0.0040769780  0.002817938  0.0042052519  0.0005383038
## [5,] -0.001261883 -0.0054531252 -0.004692972  0.0019186179  0.0028051413
##             [,16]         [,17]         [,18]         [,19]         [,20]
## [1,] -0.003207650  0.0054956882  0.0017243019  0.0059939961 -0.0058545529
## [2,] -0.000585199  0.0009942088  0.0013589702  0.0008725356 -0.0001222914
## [3,]  0.001199733 -0.0031858479  0.0057170035 -0.0044088114 -0.0043207743
## [4,] -0.002522068 -0.0027396626 -0.0012290974 -0.0039576367 -0.0030495272
## [5,] -0.003332712  0.0091605835  0.0007881861  0.0068524002  0.0011518221
##              [,21]         [,22]         [,23]         [,24]         [,25]
## [1,]  0.0009219086  0.0008116234  0.0006152438  0.0009257692  0.0029916862
## [2,] -0.0053754431  0.0025061767 -0.0014188080  0.0019004582 -0.0007882620
## [3,]  0.0014927664 -0.0024788368  0.0065281555 -0.0026304395 -0.0002210594
## [4,] -0.0006189323 -0.0005332388  0.0026593720 -0.0010316101  0.0029397696
## [5,] -0.0075918668 -0.0012860026  0.0103827916  0.0053658913  0.0025345687
##              [,26]        [,27]         [,28]        [,29]        [,30]
## [1,] -2.419917e-03 -0.001785278 -0.0018219301 -0.005497054 -0.004750886
## [2,]  5.550875e-03 -0.003660889 -0.0004846979 -0.002622815  0.001913659
## [3,] -3.168172e-03 -0.001859394 -0.0055794480 -0.003508295 -0.014145093
## [4,]  7.931757e-05 -0.002427204 -0.0062993395 -0.003865837  0.003513539
## [5,]  5.583558e-04  0.003984597 -0.0006616196 -0.001520384 -0.005504363
##              [,31]        [,32]         [,33]        [,34]       [,35]
## [1,]  0.0030965488 -0.010320805 -0.0005335837  0.001892414 -0.03463568
## [2,] -0.0019229863 -0.012759956 -0.0020608487  0.001078080 -0.03443611
## [3,]  0.0063091260 -0.008155755 -0.0042873231  0.001326227 -0.03550967
## [4,] -0.0004069632 -0.005223992 -0.0033778086  0.014733847 -0.03473534
## [5,]  0.0009561588 -0.005076812 -0.0047335245 -0.001599254 -0.03613900
##             [,36]         [,37]        [,38]        [,39]        [,40]
## [1,] -0.003873997  0.0007563766 -0.009059118 -0.008668327 0.0019382965
## [2,]  0.001449192 -0.0010396272 -0.009097761  0.002476414 0.0003082738
## [3,]  0.001506646 -0.0068628751 -0.009921063  0.008989540 0.0028175504
## [4,] -0.003497095 -0.0023530477 -0.008894878  0.002151546 0.0023306861
## [5,]  0.001489364 -0.0077657183 -0.006413718 -0.001870882 0.0023400405
##              [,41]         [,42]
## [1,]  0.0032175328 -0.0038690460
## [2,]  0.0019817553 -0.0016703504
## [3,]  0.0004969286 -0.0006988299
## [4,] -0.0041086339  0.0050197330
## [5,]  0.0055072330  0.0027956778
"v"
## [1] "v"
datanum.svd$v[1:5,]
##              [,1]         [,2]         [,3]          [,4]         [,5]
## [1,] -0.160176528  0.212655760 -0.087917290  0.3006809405 -0.127347683
## [2,]  0.057624242 -0.072963877  0.015681959 -0.0839603606  0.002522261
## [3,] -0.138756231 -0.120186226 -0.118620240 -0.1098440427  0.323525008
## [4,] -0.001886699  0.001056695  0.000614258  0.0000989348 -0.005865528
## [5,] -0.151918979 -0.139421275 -0.008222312 -0.0630964546  0.215453543
##              [,6]         [,7]         [,8]          [,9]       [,10]
## [1,]  0.117861359 -0.079799426  0.018431306 -0.0509609438 -0.12012821
## [2,] -0.001433014  0.249892639 -0.005786841 -0.0134710244  0.08172337
## [3,]  0.322829291 -0.025924881 -0.084098736  0.0166977660 -0.03816705
## [4,] -0.003166784 -0.007444727  0.015356609  0.0005415314  0.01111672
## [5,]  0.281549104 -0.110350543  0.174827337 -0.0046773757 -0.11503890
##             [,11]       [,12]       [,13]        [,14]        [,15]
## [1,]  0.058916882 -0.07596585  0.03692845  0.004615699  0.001669931
## [2,] -0.082433542  0.19048019 -0.12721847  0.150026085  0.097274270
## [3,]  0.012894253  0.01190671  0.02189248  0.052507880 -0.001656651
## [4,] -0.004631027 -0.04208872  0.11083770 -0.031435153  0.889656752
## [5,] -0.098173031 -0.05976804  0.11049366 -0.181561340  0.013260949
##            [,16]        [,17]       [,18]       [,19]        [,20]
## [1,] -0.01189324 -0.035835667 -0.06846093  0.07692079  0.014744969
## [2,] -0.07750860  0.719846877  0.42223127 -0.01179334  0.238552938
## [3,] -0.01955257  0.008153615  0.12450141 -0.19118297 -0.195908668
## [4,]  0.42276868 -0.067618755  0.04806171 -0.05294360 -0.002468255
## [5,] -0.07532025  0.096635603 -0.16154077  0.11702986 -0.017647491
##            [,21]        [,22]       [,23]       [,24]         [,25]
## [1,] -0.08091506 -0.041017076 -0.00530985  0.08481055  1.042435e-01
## [2,] -0.03235169  0.003388664 -0.03830150 -0.06659852 -5.409947e-02
## [3,]  0.16466115  0.013542924 -0.06230209  0.11593981 -3.239975e-02
## [4,]  0.03424717 -0.028228721  0.02885058 -0.03467925 -8.355276e-06
## [5,]  0.10974381 -0.045298941 -0.28812328 -0.49280635  2.411568e-01
##            [,26]        [,27]       [,28]        [,29]         [,30]
## [1,]  0.12961870  0.040784050 -0.22389909 -0.765407997  0.0110983098
## [2,]  0.08309065  0.029003632 -0.15065757 -0.150774013  0.0015689583
## [3,] -0.17315350  0.131474721  0.08361562 -0.066177029  0.6204194884
## [4,]  0.00751278 -0.001757247 -0.01010677 -0.003471738 -0.0005628256
## [5,]  0.13278299 -0.138708900 -0.42363307  0.158593423 -0.0138376222
##             [,31]         [,32]        [,33]         [,34]         [,35]
## [1,]  0.079271564  0.2402076559 -0.011142287 -0.0759035859  0.0373670293
## [2,] -0.004399013  0.0101972524 -0.009848919 -0.0046400713  0.0070068610
## [3,]  0.340866487  0.0186629140 -0.047343141 -0.0587009917 -0.0054278059
## [4,]  0.005633113 -0.0001072322  0.006400669  0.0003668092  0.0005151918
## [5,] -0.073535012  0.0758025349  0.034274521  0.0092647868 -0.0221446993
##             [,36]         [,37]         [,38]         [,39]        [,40]
## [1,]  0.017364096 -0.0166638004 -0.0270754730 -0.0011494122 -0.027729431
## [2,]  0.003595495 -0.0002643839  0.0144764347  0.0003837793 -0.010483728
## [3,] -0.085136156 -0.0605387853  0.0323258233  0.0499971586 -0.014185718
## [4,] -0.001710734  0.0030371435  0.0001139472 -0.0019293816  0.001175496
## [5,] -0.031068643  0.0131640133 -0.0375182867 -0.0136114318  0.026783131
##             [,41]         [,42]
## [1,]  0.005351511  1.889274e-05
## [2,]  0.016655576  6.602757e-08
## [3,]  0.035034272 -1.350970e-04
## [4,] -0.014758312 -8.276680e-03
## [5,] -0.043744018 -2.845891e-06

In the following graph we can observe how the singular values of the vector "d" are ordered, the smaller ones will have less importance in the explanation of the total variation of the dataset.

plot( datanum.svd$d, xlab="Column", ylab="Singular value", pch=19 )

In the following graph we show the percentage of variance explained by the values of "d", you can see how they go in decreasing order. The values of variability explained from the singular value 30 onwards are very close to 0 so they will be less important in explaining this variability. This means that we can eliminate these values without fear of losing too much information when applying a model to them.

plot(prop.table(datanum.svd$d^2),ylab="Percent variability explained")

We see how with the first 30 columns we have an explained variability of 96.5%, so we can dispense with these singular values without losing much information.

sum(prop.table(datanum.svd$d^2)[1:30])
## [1] 0.9653823

The decomposition of the matrix into singular values and vectors allows us to recover the original data with the following multiplication::

X=udt(v)

We show the first 5 rows of the resulting dataset which will be the same as the original.

X=datanum.svd$u %*% diag(datanum.svd$d) %*% t(datanum.svd$v)
X[1:5,]
##          [,1]       [,2]        [,3]         [,4]       [,5]       [,6]
## [1,] 1.757858  0.7574377 -0.69520168  0.032771459 -0.6074549 -0.3355619
## [2,] 1.757858 -0.6616483 -0.61878600  0.016055679 -0.6957005 -0.5949557
## [3,] 1.757858 -0.6616483 -0.71218294  0.007644432 -0.6957005 -0.5949557
## [4,] 1.757858 -0.6616483 -0.03293246 -0.012619006 -0.1662272 -0.8543496
## [5,] 1.757858  1.2304663  1.11542538 -0.037654383  0.7162282  4.0741340
##            [,7]       [,8]       [,9]      [,10]      [,11]      [,12]
## [1,] -0.4265204 -0.3042644  0.1564722 -0.5195597 -0.3749188 -0.2991069
## [2,] -0.4265204 -0.3042644  0.4328325 -0.5195597 -0.3749188 -0.2991069
## [3,] -0.4265204 -0.3042644 -0.1834123 -0.5195597 -0.3749188 -0.2991069
## [4,] -0.4265204 -0.3042644 -0.1697560 -0.5195597 -0.3749188 -0.2991069
## [5,]  1.8600374 -0.3042644  0.1593979 -0.5195597 -0.3749188 -0.2991069
##           [,13]      [,14]     [,15]     [,16]      [,17]      [,18]
## [1,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [2,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [3,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [4,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
## [5,] -0.5031815 -0.2347523 -3.507303 -1.919153 -0.9821441 -0.9275832
##          [,19]      [,20]      [,21]      [,22]      [,23]      [,24]
## [1,] -2.378984 -0.1774565 -0.2396733 -0.2723188  1.2006179  1.0788132
## [2,] -2.378984 -0.2025848 -0.2517627 -0.2723188  2.3392236 -0.4151391
## [3,] -2.378984 -0.1560771 -0.2293875 -0.2723188  0.1262222 -0.4912049
## [4,] -2.378984 -0.2025848 -0.2517627 -0.2723188 -0.5933112  1.2655193
## [5,] -2.378984 -0.1749741  0.1382189 -0.2723188 -0.5930846 -0.5118729
##           [,25]      [,26]      [,27]      [,28]      [,29]     [,30]
## [1,] -0.6249133 -0.6182679 -0.6705322  0.6705848 -0.2759425 0.3463989
## [2,] -0.5891460 -0.5883279 -0.6363734 -0.8752172  0.3057706 0.2015314
## [3,] -0.6484942 -0.6451289  1.5497390  2.2183898  2.1048455 0.9895888
## [4,]  0.9864774 -0.6601322 -0.7104755 -0.1158723 -0.1919372 0.1036470
## [5,] -0.6654226 -0.6612604  2.2525402  0.6010382  1.6681430 2.0083034
##            [,31]       [,32]      [,33]      [,34]       [,35]      [,36]
## [1,] -0.26907511  0.45782152 -0.3660720  0.2373340  0.06386407 -0.2289379
## [2,] -0.08550635  0.26909259 -0.1361899 -0.6400320 -0.87095652 -0.2289379
## [3,] -0.65880896  0.92001481 -0.9290484  1.3583836  0.06386407  0.9817856
## [4,]  0.37897951 -0.08140399  0.2907339  0.3074379  0.57376621  0.1746366
## [5,] -0.41423531  0.93616673 -0.9487223  0.5481279 -0.87095652  0.9817856
##           [,37]       [,38]       [,39]      [,40]      [,41]      [,42]
## [1,] -0.7083605 -0.26889113 -0.96987381  0.6712369 -0.9754199 -1.8106960
## [2,]  1.1021597  1.36740642  0.07864114 -0.8707956 -0.2690728  0.8377381
## [3,] -1.6217761 -0.95785852 -0.27086384 -0.8707956 -0.2690728  0.8377381
## [4,] -0.8625735 -0.26889113 -0.62036882 -0.8707956 -0.2690728  0.8377381
## [5,]  0.3079398  0.07559256  0.60289861  0.5310522  0.2446342 -1.5699293

In this case, since we are trying to reduce the dimensionality, we will be interested in working with a reduced version of the data, in order to apply the appropriate models more easily and to make the algorithms work more efficiently. In this case, to obtain the relevant information, it is enough to multiply the reduced versions of "u" and "d", with the columns we are interested in, so we can implement the models with the most relevant information.

datareduced=datanum.svd$u[,1:30] %*% diag(datanum.svd$d[1:30]) 
datareduced[1:5,]
##             [,1]     [,2]      [,3]       [,4]       [,5]       [,6]
## [1,] -1.47575683 2.547362 -2.284423  1.2407105 -2.2989522  0.2313267
## [2,]  0.03121033 4.280768 -1.862561  1.1101126 -0.9854788  0.4412026
## [3,] -3.89664683 3.397311 -2.036845  0.5821763 -1.2992944 -0.6871371
## [4,] -0.86022337 2.039985 -3.500617  1.6070654 -1.3777914 -0.1310385
## [5,] -4.00637470 3.430344 -1.514921 -0.4315038  1.2189420  3.8222770
##           [,7]        [,8]        [,9]       [,10]      [,11]      [,12]
## [1,]  1.484467 -1.43509356 -0.55315311 -0.04685913 -1.0300407  0.2911889
## [2,] -1.224329 -1.68872056 -0.06507506 -0.92984709  0.6912702  0.7452403
## [3,] -1.011434 -0.40014993  0.11384211  0.23010305 -0.7153275 -0.1417501
## [4,] -0.466209 -0.01968329  0.21630400 -0.41095659 -0.8883163  0.8977568
## [5,]  0.663698  0.77907774 -0.71883727  1.57326948 -0.2942063 -1.2007865
##           [,13]       [,14]       [,15]      [,16]      [,17]      [,18]
## [1,]  0.8469506 -0.07269215  0.37880947 -0.6338472  1.0633207  0.3233078
## [2,]  0.9258112 -1.14594114  0.04788018 -0.1156382  0.1923622  0.2548078
## [3,] -1.3374341  0.27012904 -0.19301206  0.2370730 -0.6164065  1.0719420
## [4,]  0.5940097  0.87075582  0.10741771 -0.4983728 -0.5300773 -0.2304566
## [5,] -0.9892594  0.39727649  0.55976174 -0.6585601  1.7724145  0.1477854
##           [,19]      [,20]       [,21]       [,22]       [,23]      [,24]
## [1,]  1.0748746 -0.9762565  0.14817696  0.12515763  0.09391265  0.1374173
## [2,]  0.1564676 -0.0203923 -0.86398671  0.38646882 -0.21657108  0.2820960
## [3,] -0.7906110 -0.7204964  0.23993006 -0.38225283  0.99647710 -0.3904513
## [4,] -0.7097040 -0.5085138 -0.09948003 -0.08222891  0.40593446 -0.1531278
## [5,]  1.2288081  0.1920683 -1.22022908 -0.19831000  1.58486024  0.7964902
##            [,25]       [,26]      [,27]       [,28]      [,29]      [,30]
## [1,]  0.42028318 -0.32643986 -0.2397866 -0.23457578 -0.6959727 -0.5532801
## [2,] -0.11073797  0.74879709 -0.4917060 -0.06240546 -0.3320702  0.2228615
## [3,] -0.03105525 -0.42737721 -0.2497413 -0.71836093 -0.4441793 -1.6473136
## [4,]  0.41298974  0.01069971 -0.3260057 -0.81104787 -0.4894471  0.4091807
## [5,]  0.35606562  0.07532059  0.5351842 -0.08518436 -0.1924933 -0.6410288

Data conversion and creation of train and test sets

To optimize results, we are going to convert the binary data to factor and normalize the numerical data, except shares which is the target variable and we will remove it for the models when we create another column with the thresholds. We apply the scale function as we did with the svd algorithm in practice 1.

library(BBmisc)
datascale<-data
datascale[,c(12:17,30:36,59)]<-lapply(c(12:17,30:36), function(x) as.factor(datascale[,x]))
datascale[,c(3:11,18:29,37:57)]<-lapply(c(3:11,18:29,37:58), function(x) datascale[,x]<-as.numeric(scale(datascale[,x]),center=TRUE,scale=TRUE))

str(datascale)
## 'data.frame':    39644 obs. of  59 variables:
##  $ url                          : Factor w/ 39644 levels "http://mashable.com/2013/01/07/amazon-instant-video-browser/",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  0.757 -0.662 -0.662 -0.662 1.23 ...
##  $ n_tokens_content             : num  -0.6952 -0.6188 -0.7122 -0.0329 1.1154 ...
##  $ n_unique_tokens              : num  0.03277 0.01606 0.00764 -0.01262 -0.03765 ...
##  $ num_hrefs                    : num  -0.607 -0.696 -0.696 -0.166 0.716 ...
##  $ num_self_hrefs               : num  -0.336 -0.595 -0.595 -0.854 4.074 ...
##  $ num_imgs                     : num  -0.427 -0.427 -0.427 -0.427 1.86 ...
##  $ num_videos                   : num  -0.304 -0.304 -0.304 -0.304 -0.304 ...
##  $ average_token_length         : num  0.156 0.433 -0.183 -0.17 0.159 ...
##  $ num_keywords                 : num  -1.165 -1.689 -0.641 -0.117 -0.117 ...
##  $ data_channel_is_lifestyle    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ data_channel_is_entertainment: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
##  $ data_channel_is_bus          : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
##  $ data_channel_is_socmed       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ data_channel_is_tech         : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
##  $ data_channel_is_world        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ kw_min_min                   : num  -0.375 -0.375 -0.375 -0.375 -0.375 ...
##  $ kw_max_min                   : num  -0.299 -0.299 -0.299 -0.299 -0.299 ...
##  $ kw_avg_min                   : num  -0.503 -0.503 -0.503 -0.503 -0.503 ...
##  $ kw_min_max                   : num  -0.235 -0.235 -0.235 -0.235 -0.235 ...
##  $ kw_max_max                   : num  -3.51 -3.51 -3.51 -3.51 -3.51 ...
##  $ kw_avg_max                   : num  -1.92 -1.92 -1.92 -1.92 -1.92 ...
##  $ kw_min_avg                   : num  -0.982 -0.982 -0.982 -0.982 -0.982 ...
##  $ kw_max_avg                   : num  -0.928 -0.928 -0.928 -0.928 -0.928 ...
##  $ kw_avg_avg                   : num  -2.38 -2.38 -2.38 -2.38 -2.38 ...
##  $ self_reference_min_shares    : num  -0.177 -0.203 -0.156 -0.203 -0.175 ...
##  $ self_reference_max_shares    : num  -0.24 -0.252 -0.229 -0.252 0.138 ...
##  $ self_reference_avg_sharess   : num  -0.244 -0.264 -0.226 -0.264 -0.134 ...
##  $ weekday_is_monday            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weekday_is_tuesday           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_wednesday         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_thursday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_friday            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_saturday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_sunday            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LDA_00                       : num  1.201 2.339 0.126 -0.593 -0.593 ...
##  $ LDA_01                       : num  1.079 -0.415 -0.491 1.266 -0.512 ...
##  $ LDA_02                       : num  -0.625 -0.589 -0.648 0.986 -0.665 ...
##  $ LDA_03                       : num  -0.618 -0.588 -0.645 -0.66 -0.661 ...
##  $ LDA_04                       : num  -0.671 -0.636 1.55 -0.71 2.253 ...
##  $ global_subjectivity          : num  0.671 -0.875 2.218 -0.116 0.601 ...
##  $ global_sentiment_polarity    : num  -0.276 0.306 2.105 -0.192 1.668 ...
##  $ global_rate_positive_words   : num  0.346 0.202 0.99 0.104 2.008 ...
##  $ global_rate_negative_words   : num  -0.2691 -0.0855 -0.6588 0.379 -0.4142 ...
##  $ rate_positive_words          : num  0.4578 0.2691 0.92 -0.0814 0.9362 ...
##  $ rate_negative_words          : num  -0.366 -0.136 -0.929 0.291 -0.949 ...
##  $ avg_positive_polarity        : num  0.237 -0.64 1.358 0.307 0.548 ...
##  $ min_positive_polarity        : num  0.0639 -0.871 0.0639 0.5738 -0.871 ...
##  $ max_positive_polarity        : num  -0.229 -0.229 0.982 0.175 0.982 ...
##  $ avg_negative_polarity        : num  -0.708 1.102 -1.622 -0.863 0.308 ...
##  $ min_negative_polarity        : num  -0.2689 1.3674 -0.9579 -0.2689 0.0756 ...
##  $ max_negative_polarity        : num  -0.9699 0.0786 -0.2709 -0.6204 0.6029 ...
##  $ title_subjectivity           : num  0.671 -0.871 -0.871 -0.871 0.531 ...
##  $ title_sentiment_polarity     : num  -0.975 -0.269 -0.269 -0.269 0.245 ...
##  $ abs_title_subjectivity       : num  -1.811 0.838 0.838 0.838 -1.57 ...
##  $ abs_title_sentiment_polarity : num  0.1389 -0.6896 -0.6896 -0.6896 -0.0871 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...
##  $ data_channel_is_other        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...

Given the large size of the dataset, to speed up processes and to be able to execute functions that otherwise run out of memory, we will use the sample.split function to perform the exercise with 35% of the data.

set.seed(277)

library(caTools)
datascale$division=sample.split(datascale$shares,SplitRatio=.35)

summary(datascale)
##                                                              url       
##  http://mashable.com/2013/01/07/amazon-instant-video-browser/  :    1  
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/   :    1  
##  http://mashable.com/2013/01/07/apple-40-billion-app-downloads/:    1  
##  http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/      :    1  
##  http://mashable.com/2013/01/07/att-u-verse-apps/              :    1  
##  http://mashable.com/2013/01/07/beewi-smart-toys/              :    1  
##  (Other)                                                       :39638  
##    timedelta     n_tokens_title    n_tokens_content  n_unique_tokens    
##  Min.   :  8.0   Min.   :-3.9728   Min.   :-1.1601   Min.   : -0.15571  
##  1st Qu.:164.0   1st Qu.:-0.6616   1st Qu.:-0.6379   1st Qu.: -0.02197  
##  Median :339.0   Median :-0.1886   Median :-0.2919   Median : -0.00255  
##  Mean   :354.5   Mean   : 0.0000   Mean   : 0.0000   Mean   :  0.00000  
##  3rd Qu.:542.0   3rd Qu.: 0.7574   3rd Qu.: 0.3598   3rd Qu.:  0.01718  
##  Max.   :731.0   Max.   : 5.9608   Max.   :16.8273   Max.   :198.95195  
##                                                                         
##    num_hrefs       num_self_hrefs        num_imgs       
##  Min.   :-0.9604   Min.   :-0.85435   Min.   :-0.54687  
##  1st Qu.:-0.6075   1st Qu.:-0.59496   1st Qu.:-0.42652  
##  Median :-0.2545   Median :-0.07617   Median :-0.42652  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.2750   3rd Qu.: 0.18323   3rd Qu.:-0.06549  
##  Max.   :25.8662   Max.   :29.23534   Max.   :14.85731  
##                                                         
##    num_videos       average_token_length  num_keywords    
##  Min.   :-0.30426   Min.   :-5.3863      Min.   :-3.2600  
##  1st Qu.:-0.30426   1st Qu.:-0.0827      1st Qu.:-0.6410  
##  Median :-0.30426   Median : 0.1372      Median :-0.1172  
##  Mean   : 0.00000   Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.:-0.06083   3rd Qu.: 0.3631      3rd Qu.: 0.9304  
##  Max.   :21.84842   Max.   : 4.1370      Max.   : 1.4542  
##                                                           
##  data_channel_is_lifestyle data_channel_is_entertainment
##  0:37545                   0:32587                      
##  1: 2099                   1: 7057                      
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  0:33386             0:37321                0:32298             
##  1: 6258             1: 2323                1: 7346             
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  data_channel_is_world   kw_min_min        kw_max_min     
##  0:31217               Min.   :-0.3893   Min.   :-0.2991  
##  1: 8427               1st Qu.:-0.3893   1st Qu.:-0.1838  
##                        Median :-0.3893   Median :-0.1280  
##                        Mean   : 0.0000   Mean   : 0.0000  
##                        3rd Qu.:-0.3175   3rd Qu.:-0.0399  
##                        Max.   : 5.0392   Max.   :77.0469  
##                                                           
##    kw_avg_min        kw_min_max         kw_max_max        kw_avg_max     
##  Min.   :-0.5048   Min.   :-0.23475   Min.   :-3.5073   Min.   :-1.9192  
##  1st Qu.:-0.2748   1st Qu.:-0.23475   1st Qu.: 0.4241   1st Qu.:-0.6398  
##  Median :-0.1238   Median :-0.21061   Median : 0.4241   Median :-0.1089  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.0719   3rd Qu.:-0.09851   3rd Qu.: 0.4241   3rd Qu.: 0.5307  
##  Max.   :68.4868   Max.   :14.30841   Max.   : 0.4241   Max.   : 4.3228  
##                                                                          
##    kw_min_avg         kw_max_avg         kw_avg_avg     
##  Min.   :-0.98302   Min.   :-0.92758   Min.   :-2.3790  
##  1st Qu.:-0.98214   1st Qu.:-0.34352   1st Qu.:-0.5716  
##  Median :-0.08221   Median :-0.21340   Median :-0.2016  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.82608   3rd Qu.: 0.05948   3rd Qu.: 0.3523  
##  Max.   : 2.19427   Max.   :47.99950   Max.   :30.6731  
##                                                         
##  self_reference_min_shares self_reference_max_shares
##  Min.   :-0.20258          Min.   :-0.25176         
##  1st Qu.:-0.17021          1st Qu.:-0.22495         
##  Median :-0.14179          Median :-0.18352         
##  Mean   : 0.00000          Mean   : 0.00000         
##  3rd Qu.:-0.07086          3rd Qu.:-0.05677         
##  Max.   :42.52066          Max.   :20.30270         
##                                                     
##  self_reference_avg_sharess weekday_is_monday weekday_is_tuesday
##  Min.   :-0.26441           0:32983           0:32254           
##  1st Qu.:-0.22388           1: 6661           1: 7390           
##  Median :-0.17354                                               
##  Mean   : 0.00000                                               
##  3rd Qu.:-0.04963                                               
##  Max.   :34.56639                                               
##                                                                 
##  weekday_is_wednesday weekday_is_thursday weekday_is_friday
##  0:32209              0:32377             0:33943          
##  1: 7435              1: 7267             1: 5701          
##                                                            
##                                                            
##                                                            
##                                                            
##                                                            
##  weekday_is_saturday weekday_is_sunday     LDA_00       
##  0:37191             0:36907           Min.   :-0.7020  
##  1: 2453             1: 2737           1st Qu.:-0.6067  
##                                        Median :-0.5750  
##                                        Mean   : 0.0000  
##                                        3rd Qu.: 0.2143  
##                                        Max.   : 2.8231  
##                                                         
##      LDA_01             LDA_02            LDA_03            LDA_04       
##  Min.   :-0.64293   Min.   :-0.7667   Min.   :-0.7581   Min.   :-0.8093  
##  1st Qu.:-0.52908   1st Qu.:-0.6654   1st Qu.:-0.6613   1st Qu.:-0.7105  
##  Median :-0.49116   Median :-0.6249   Median :-0.6225   Median :-0.6684  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04358   3rd Qu.: 0.4179   3rd Qu.: 0.5149   3rd Qu.: 0.5739  
##  Max.   : 3.57153   Max.   : 2.4940   Max.   : 2.3807   Max.   : 2.3970  
##                                                                          
##  global_subjectivity global_sentiment_polarity global_rate_positive_words
##  Min.   :-3.79973    Min.   :-5.293054         Min.   :-2.27354          
##  1st Qu.:-0.40454    1st Qu.:-0.635010         1st Qu.:-0.64495          
##  Median : 0.08645    Median :-0.001986         Median :-0.03454          
##  Mean   : 0.00000    Mean   : 0.000000         Mean   : 0.00000          
##  3rd Qu.: 0.55674    3rd Qu.: 0.603761         3rd Qu.: 0.61132          
##  Max.   : 4.77038    Max.   : 6.278010         Max.   : 6.64784          
##                                                                          
##  global_rate_negative_words rate_positive_words rate_negative_words
##  Min.   :-1.5342            Min.   :-3.5864     Min.   :-1.84388   
##  1st Qu.:-0.6462            1st Qu.:-0.4319     1st Qu.:-0.65799   
##  Median :-0.1177            Median : 0.1492     Median :-0.05081   
##  Mean   : 0.0000            Mean   : 0.0000     Mean   : 0.00000   
##  3rd Qu.: 0.4735            3rd Qu.: 0.6196     3rd Qu.: 0.61914   
##  Max.   :15.5451            Max.   : 1.6711     Max.   : 4.55997   
##                                                                    
##  avg_positive_polarity min_positive_polarity max_positive_polarity
##  Min.   :-3.38452      Min.   :-1.33837      Min.   :-3.0540      
##  1st Qu.:-0.45514      1st Qu.:-0.63725      1st Qu.:-0.6325      
##  Median : 0.04716      Median : 0.06386      Median : 0.1746      
##  Mean   : 0.00000      Mean   : 0.00000      Mean   : 0.0000      
##  3rd Qu.: 0.55100      3rd Qu.: 0.06386      3rd Qu.: 0.9818      
##  Max.   : 6.18100      Max.   :12.68394      Max.   : 0.9818      
##                                                                   
##  avg_negative_polarity min_negative_polarity max_negative_polarity
##  Min.   :-5.79739      Min.   :-1.64683      Min.   :-9.35799     
##  1st Qu.:-0.53911      1st Qu.:-0.61337      1st Qu.:-0.18349     
##  Median : 0.04847      Median : 0.07559      Median : 0.07864     
##  Mean   : 0.00000      Mean   : 0.00000      Mean   : 0.00000     
##  3rd Qu.: 0.56856      3rd Qu.: 0.76456      3rd Qu.: 0.60290     
##  Max.   : 2.03189      Max.   : 1.79801      Max.   : 1.12716     
##                                                                   
##  title_subjectivity title_sentiment_polarity abs_title_subjectivity
##  Min.   :-0.8708    Min.   :-4.0363          Min.   :-1.8107       
##  1st Qu.:-0.8708    1st Qu.:-0.2691          1st Qu.:-0.9279       
##  Median :-0.4082    Median :-0.2691          Median : 0.8377       
##  Mean   : 0.0000    Mean   : 0.0000          Mean   : 0.0000       
##  3rd Qu.: 0.6712    3rd Qu.: 0.2960          3rd Qu.: 0.8377       
##  Max.   : 2.2133    Max.   : 3.4981          Max.   : 0.8377       
##                                                                    
##  abs_title_sentiment_polarity     shares       data_channel_is_other
##  Min.   :-0.6896              Min.   :     1   0:37545              
##  1st Qu.:-0.6896              1st Qu.:   946   1: 2099              
##  Median :-0.6896              Median :  1400                        
##  Mean   : 0.0000              Mean   :  3395                        
##  3rd Qu.: 0.4151              3rd Qu.:  2800                        
##  Max.   : 3.7294              Max.   :843300                        
##                                                                     
##   division      
##  Mode :logical  
##  FALSE:25499    
##  TRUE :14145    
##                 
##                 
##                 
## 

Let's keep the 14145 rows that have TRUE, we compare the values with the original dataset

newdatascale<-datascale[datascale$division==TRUE,]
summary(newdatascale)
##                                                            url       
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ :    1  
##  http://mashable.com/2013/01/07/chuck-hagel-website/         :    1  
##  http://mashable.com/2013/01/07/cosmic-events-doomsday/      :    1  
##  http://mashable.com/2013/01/07/earth-size-planets-milky-way/:    1  
##  http://mashable.com/2013/01/07/felt-audio-pulse-speaker/    :    1  
##  http://mashable.com/2013/01/07/ford-glympse/                :    1  
##  (Other)                                                     :14139  
##    timedelta     n_tokens_title      n_tokens_content   
##  Min.   :  8.0   Min.   :-3.972849   Min.   :-1.160064  
##  1st Qu.:163.0   1st Qu.:-0.661648   1st Qu.:-0.637890  
##  Median :342.0   Median :-0.188620   Median :-0.294019  
##  Mean   :354.5   Mean   : 0.002766   Mean   :-0.003818  
##  3rd Qu.:545.0   3rd Qu.: 0.757438   3rd Qu.: 0.357637  
##  Max.   :731.0   Max.   : 4.068638   Max.   :15.320251  
##                                                         
##  n_unique_tokens       num_hrefs         num_self_hrefs     
##  Min.   :-0.155712   Min.   :-0.960437   Min.   :-0.854350  
##  1st Qu.:-0.021780   1st Qu.:-0.607455   1st Qu.:-0.594956  
##  Median :-0.002379   Median :-0.254473   Median :-0.335562  
##  Mean   :-0.004774   Mean   : 0.006377   Mean   : 0.000027  
##  3rd Qu.: 0.017442   3rd Qu.: 0.275000   3rd Qu.: 0.183226  
##  Max.   : 0.128322   Max.   :25.866207   Max.   :18.340797  
##                                                             
##     num_imgs           num_videos        average_token_length
##  Min.   :-0.546866   Min.   :-0.304264   Min.   :-5.386321   
##  1st Qu.:-0.426520   1st Qu.:-0.304264   1st Qu.:-0.081199   
##  Median :-0.426520   Median :-0.304264   Median : 0.137250   
##  Mean   : 0.001753   Mean   :-0.002848   Mean   : 0.000181   
##  3rd Qu.:-0.065485   3rd Qu.:-0.060828   3rd Qu.: 0.363095   
##  Max.   :12.811446   Max.   :17.953439   Max.   : 3.162214   
##                                                              
##   num_keywords       data_channel_is_lifestyle
##  Min.   :-3.260001   0:13374                  
##  1st Qu.:-0.641007   1:  771                  
##  Median :-0.117209                            
##  Mean   : 0.005252                            
##  3rd Qu.: 0.930389                            
##  Max.   : 1.454187                            
##                                               
##  data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
##  0:11627                       0:11942             0:13347               
##  1: 2518                       1: 2203             1:  798               
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  data_channel_is_tech data_channel_is_world   kw_min_min       
##  0:11550              0:11104               Min.   :-0.389280  
##  1: 2595              1: 3041               1st Qu.:-0.389280  
##                                             Median :-0.389280  
##                                             Mean   :-0.004804  
##                                             3rd Qu.:-0.317475  
##                                             Max.   : 3.847204  
##                                                                
##    kw_max_min         kw_avg_min         kw_min_max       
##  Min.   :-0.29911   Min.   :-0.50479   Min.   :-0.234752  
##  1st Qu.:-0.18454   1st Qu.:-0.27766   1st Qu.:-0.234752  
##  Median :-0.12855   Median :-0.12391   Median :-0.210609  
##  Mean   : 0.01112   Mean   : 0.01002   Mean   : 0.005899  
##  3rd Qu.:-0.03990   3rd Qu.: 0.07109   3rd Qu.:-0.098513  
##  Max.   :77.04685   Max.   :68.48678   Max.   :14.308406  
##                                                           
##    kw_max_max          kw_avg_max         kw_min_avg        
##  Min.   :-3.507303   Min.   :-1.91915   Min.   :-0.9830232  
##  1st Qu.: 0.424126   1st Qu.:-0.64013   1st Qu.:-0.9821441  
##  Median : 0.424126   Median :-0.11348   Median :-0.0858464  
##  Mean   : 0.000872   Mean   :-0.00275   Mean   :-0.0007414  
##  3rd Qu.: 0.424126   3rd Qu.: 0.53249   3rd Qu.: 0.8282670  
##  Max.   : 0.424126   Max.   : 4.32279   Max.   : 2.1917123  
##                                                             
##    kw_max_avg         kw_avg_avg        self_reference_min_shares
##  Min.   :-0.92758   Min.   :-2.378984   Min.   :-0.20258         
##  1st Qu.:-0.34260   1st Qu.:-0.575329   1st Qu.:-0.17097         
##  Median :-0.21287   Median :-0.197807   Median :-0.14179         
##  Mean   : 0.00423   Mean   : 0.005212   Mean   : 0.00193         
##  3rd Qu.: 0.06113   3rd Qu.: 0.353306   3rd Qu.:-0.06580         
##  Max.   :47.99950   Max.   :30.673132   Max.   :34.77444         
##                                                                  
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :-0.251763         Min.   :-0.264409          0:11752          
##  1st Qu.:-0.227389         1st Qu.:-0.224816          1: 2393          
##  Median :-0.183516         Median :-0.173543                           
##  Mean   : 0.008241         Mean   : 0.006025                           
##  3rd Qu.:-0.056772         3rd Qu.:-0.049634                           
##  Max.   :20.302705         Max.   :28.251163                           
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  0:11468            0:11526              0:11596            
##  1: 2677            1: 2619              1: 2549            
##                                                             
##                                                             
##                                                             
##                                                             
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday
##  0:12123           0:13276             0:13129          
##  1: 2022           1:  869             1: 1016          
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##      LDA_00              LDA_01              LDA_02         
##  Min.   :-0.632822   Min.   :-0.560156   Min.   :-0.702245  
##  1st Qu.:-0.606776   1st Qu.:-0.529085   1st Qu.:-0.673544  
##  Median :-0.575079   Median :-0.491170   Median :-0.624917  
##  Mean   :-0.008645   Mean   :-0.006069   Mean   : 0.002803  
##  3rd Qu.: 0.204995   3rd Qu.: 0.040190   3rd Qu.: 0.420783  
##  Max.   : 2.796469   Max.   : 3.544347   Max.   : 2.494028  
##                                                             
##      LDA_03             LDA_04          global_subjectivity
##  Min.   :-0.69645   Min.   :-0.746401   Min.   :-3.799731  
##  1st Qu.:-0.66126   1st Qu.:-0.710468   1st Qu.:-0.399456  
##  Median :-0.62254   Median :-0.669101   Median : 0.092228  
##  Mean   : 0.01211   Mean   :-0.002534   Mean   : 0.007891  
##  3rd Qu.: 0.56121   3rd Qu.: 0.573797   3rd Qu.: 0.564676  
##  Max.   : 2.35851   Max.   : 2.394678   Max.   : 4.770378  
##                                                            
##  global_sentiment_polarity global_rate_positive_words
##  Min.   :-5.15335          Min.   :-2.27354          
##  1st Qu.:-0.64523          1st Qu.:-0.66369          
##  Median :-0.02027          Median :-0.04814          
##  Mean   :-0.01105          Mean   :-0.00753          
##  3rd Qu.: 0.60796          3rd Qu.: 0.61188          
##  Max.   : 5.21704          Max.   : 5.55056          
##                                                      
##  global_rate_negative_words rate_positive_words rate_negative_words
##  Min.   :-1.534211          Min.   :-3.58637    Min.   :-1.84388   
##  1st Qu.:-0.637561          1st Qu.:-0.43190    1st Qu.:-0.64316   
##  Median :-0.108981          Median : 0.12637    Median :-0.01421   
##  Mean   : 0.007059          Mean   :-0.01122    Mean   : 0.01374   
##  3rd Qu.: 0.480242          3rd Qu.: 0.61959    3rd Qu.: 0.61914   
##  Max.   :11.379826          Max.   : 1.67108    Max.   : 4.55997   
##                                                                    
##  avg_positive_polarity min_positive_polarity max_positive_polarity
##  Min.   :-3.384518     Min.   :-1.338367     Min.   :-3.053959    
##  1st Qu.:-0.454255     1st Qu.:-0.637251     1st Qu.:-0.632512    
##  Median : 0.043393     Median : 0.063864     Median : 0.174637    
##  Mean   : 0.001005     Mean   : 0.003956     Mean   :-0.008165    
##  3rd Qu.: 0.561257     3rd Qu.: 0.063864     3rd Qu.: 0.981786    
##  Max.   : 4.963569     Max.   : 9.879480     Max.   : 0.981786    
##                                                                   
##  avg_negative_polarity min_negative_polarity max_negative_polarity
##  Min.   :-5.79739      Min.   :-1.646826     Min.   :-9.357993    
##  1st Qu.:-0.54558      1st Qu.:-0.613375     1st Qu.:-0.183488    
##  Median : 0.04722      Median : 0.075593     Median : 0.078641    
##  Mean   :-0.00382      Mean   : 0.003193     Mean   :-0.005856    
##  3rd Qu.: 0.56390      3rd Qu.: 0.764560     3rd Qu.: 0.602899    
##  Max.   : 2.03189      Max.   : 1.798011     Max.   : 1.127156    
##                                                                   
##  title_subjectivity  title_sentiment_polarity abs_title_subjectivity
##  Min.   :-0.870796   Min.   :-4.036257        Min.   :-1.81070      
##  1st Qu.:-0.870796   1st Qu.:-0.269073        1st Qu.:-0.92788      
##  Median :-0.485287   Median :-0.269073        Median : 0.83774      
##  Mean   :-0.001223   Mean   :-0.002584        Mean   : 0.00646      
##  3rd Qu.: 0.671237   3rd Qu.: 0.248915        3rd Qu.: 0.83774      
##  Max.   : 2.213270   Max.   : 3.498112        Max.   : 0.83774      
##                                                                     
##  abs_title_sentiment_polarity     shares       data_channel_is_other
##  Min.   :-0.689649            Min.   :     1   0:13374              
##  1st Qu.:-0.689649            1st Qu.:   942   1:  771              
##  Median :-0.689649            Median :  1400                        
##  Mean   : 0.002646            Mean   :  4404                        
##  3rd Qu.: 0.415107            3rd Qu.:  2900                        
##  Max.   : 3.729377            Max.   :843300                        
##                                                                     
##  division      
##  Mode:logical  
##  TRUE:14145    
##                
##                
##                
##                
## 

It seems that the statistical data of the variables of the reduced dataset are more or less similar to those of the original dataset so we will perform the analysis with these 14145 rows

datascale<-newdatascale

To make the predictive models we will divide the class shares in 2, setting the threshold at 1400 shares, a higher value indicates a popular page, and a lower value indicates an unpopular page. We will create a new column, with the values 'NP' (not popular) for < of 1400 shares and 'P' for >= of 1400 shares, we will create it as a factor.

datascale$popularity<-ifelse(datascale$shares<1400,'NP','P')
datascale$popularity<- factor(datascale$popularity)

To perform the modeling, we will divide the dataset in 2, to have training and test records, we will do it based on the new popularity variable, which will be the target variable.

set.seed(277)

library(caTools)
datascale$division=sample.split(datascale$popularity,SplitRatio=0.7)

summary(datascale)
##                                                            url       
##  http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/ :    1  
##  http://mashable.com/2013/01/07/chuck-hagel-website/         :    1  
##  http://mashable.com/2013/01/07/cosmic-events-doomsday/      :    1  
##  http://mashable.com/2013/01/07/earth-size-planets-milky-way/:    1  
##  http://mashable.com/2013/01/07/felt-audio-pulse-speaker/    :    1  
##  http://mashable.com/2013/01/07/ford-glympse/                :    1  
##  (Other)                                                     :14139  
##    timedelta     n_tokens_title      n_tokens_content   
##  Min.   :  8.0   Min.   :-3.972849   Min.   :-1.160064  
##  1st Qu.:163.0   1st Qu.:-0.661648   1st Qu.:-0.637890  
##  Median :342.0   Median :-0.188620   Median :-0.294019  
##  Mean   :354.5   Mean   : 0.002766   Mean   :-0.003818  
##  3rd Qu.:545.0   3rd Qu.: 0.757438   3rd Qu.: 0.357637  
##  Max.   :731.0   Max.   : 4.068638   Max.   :15.320251  
##                                                         
##  n_unique_tokens       num_hrefs         num_self_hrefs     
##  Min.   :-0.155712   Min.   :-0.960437   Min.   :-0.854350  
##  1st Qu.:-0.021780   1st Qu.:-0.607455   1st Qu.:-0.594956  
##  Median :-0.002379   Median :-0.254473   Median :-0.335562  
##  Mean   :-0.004774   Mean   : 0.006377   Mean   : 0.000027  
##  3rd Qu.: 0.017442   3rd Qu.: 0.275000   3rd Qu.: 0.183226  
##  Max.   : 0.128322   Max.   :25.866207   Max.   :18.340797  
##                                                             
##     num_imgs           num_videos        average_token_length
##  Min.   :-0.546866   Min.   :-0.304264   Min.   :-5.386321   
##  1st Qu.:-0.426520   1st Qu.:-0.304264   1st Qu.:-0.081199   
##  Median :-0.426520   Median :-0.304264   Median : 0.137250   
##  Mean   : 0.001753   Mean   :-0.002848   Mean   : 0.000181   
##  3rd Qu.:-0.065485   3rd Qu.:-0.060828   3rd Qu.: 0.363095   
##  Max.   :12.811446   Max.   :17.953439   Max.   : 3.162214   
##                                                              
##   num_keywords       data_channel_is_lifestyle
##  Min.   :-3.260001   0:13374                  
##  1st Qu.:-0.641007   1:  771                  
##  Median :-0.117209                            
##  Mean   : 0.005252                            
##  3rd Qu.: 0.930389                            
##  Max.   : 1.454187                            
##                                               
##  data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed
##  0:11627                       0:11942             0:13347               
##  1: 2518                       1: 2203             1:  798               
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  data_channel_is_tech data_channel_is_world   kw_min_min       
##  0:11550              0:11104               Min.   :-0.389280  
##  1: 2595              1: 3041               1st Qu.:-0.389280  
##                                             Median :-0.389280  
##                                             Mean   :-0.004804  
##                                             3rd Qu.:-0.317475  
##                                             Max.   : 3.847204  
##                                                                
##    kw_max_min         kw_avg_min         kw_min_max       
##  Min.   :-0.29911   Min.   :-0.50479   Min.   :-0.234752  
##  1st Qu.:-0.18454   1st Qu.:-0.27766   1st Qu.:-0.234752  
##  Median :-0.12855   Median :-0.12391   Median :-0.210609  
##  Mean   : 0.01112   Mean   : 0.01002   Mean   : 0.005899  
##  3rd Qu.:-0.03990   3rd Qu.: 0.07109   3rd Qu.:-0.098513  
##  Max.   :77.04685   Max.   :68.48678   Max.   :14.308406  
##                                                           
##    kw_max_max          kw_avg_max         kw_min_avg        
##  Min.   :-3.507303   Min.   :-1.91915   Min.   :-0.9830232  
##  1st Qu.: 0.424126   1st Qu.:-0.64013   1st Qu.:-0.9821441  
##  Median : 0.424126   Median :-0.11348   Median :-0.0858464  
##  Mean   : 0.000872   Mean   :-0.00275   Mean   :-0.0007414  
##  3rd Qu.: 0.424126   3rd Qu.: 0.53249   3rd Qu.: 0.8282670  
##  Max.   : 0.424126   Max.   : 4.32279   Max.   : 2.1917123  
##                                                             
##    kw_max_avg         kw_avg_avg        self_reference_min_shares
##  Min.   :-0.92758   Min.   :-2.378984   Min.   :-0.20258         
##  1st Qu.:-0.34260   1st Qu.:-0.575329   1st Qu.:-0.17097         
##  Median :-0.21287   Median :-0.197807   Median :-0.14179         
##  Mean   : 0.00423   Mean   : 0.005212   Mean   : 0.00193         
##  3rd Qu.: 0.06113   3rd Qu.: 0.353306   3rd Qu.:-0.06580         
##  Max.   :47.99950   Max.   :30.673132   Max.   :34.77444         
##                                                                  
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :-0.251763         Min.   :-0.264409          0:11752          
##  1st Qu.:-0.227389         1st Qu.:-0.224816          1: 2393          
##  Median :-0.183516         Median :-0.173543                           
##  Mean   : 0.008241         Mean   : 0.006025                           
##  3rd Qu.:-0.056772         3rd Qu.:-0.049634                           
##  Max.   :20.302705         Max.   :28.251163                           
##                                                                        
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  0:11468            0:11526              0:11596            
##  1: 2677            1: 2619              1: 2549            
##                                                             
##                                                             
##                                                             
##                                                             
##                                                             
##  weekday_is_friday weekday_is_saturday weekday_is_sunday
##  0:12123           0:13276             0:13129          
##  1: 2022           1:  869             1: 1016          
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##      LDA_00              LDA_01              LDA_02         
##  Min.   :-0.632822   Min.   :-0.560156   Min.   :-0.702245  
##  1st Qu.:-0.606776   1st Qu.:-0.529085   1st Qu.:-0.673544  
##  Median :-0.575079   Median :-0.491170   Median :-0.624917  
##  Mean   :-0.008645   Mean   :-0.006069   Mean   : 0.002803  
##  3rd Qu.: 0.204995   3rd Qu.: 0.040190   3rd Qu.: 0.420783  
##  Max.   : 2.796469   Max.   : 3.544347   Max.   : 2.494028  
##                                                             
##      LDA_03             LDA_04          global_subjectivity
##  Min.   :-0.69645   Min.   :-0.746401   Min.   :-3.799731  
##  1st Qu.:-0.66126   1st Qu.:-0.710468   1st Qu.:-0.399456  
##  Median :-0.62254   Median :-0.669101   Median : 0.092228  
##  Mean   : 0.01211   Mean   :-0.002534   Mean   : 0.007891  
##  3rd Qu.: 0.56121   3rd Qu.: 0.573797   3rd Qu.: 0.564676  
##  Max.   : 2.35851   Max.   : 2.394678   Max.   : 4.770378  
##                                                            
##  global_sentiment_polarity global_rate_positive_words
##  Min.   :-5.15335          Min.   :-2.27354          
##  1st Qu.:-0.64523          1st Qu.:-0.66369          
##  Median :-0.02027          Median :-0.04814          
##  Mean   :-0.01105          Mean   :-0.00753          
##  3rd Qu.: 0.60796          3rd Qu.: 0.61188          
##  Max.   : 5.21704          Max.   : 5.55056          
##                                                      
##  global_rate_negative_words rate_positive_words rate_negative_words
##  Min.   :-1.534211          Min.   :-3.58637    Min.   :-1.84388   
##  1st Qu.:-0.637561          1st Qu.:-0.43190    1st Qu.:-0.64316   
##  Median :-0.108981          Median : 0.12637    Median :-0.01421   
##  Mean   : 0.007059          Mean   :-0.01122    Mean   : 0.01374   
##  3rd Qu.: 0.480242          3rd Qu.: 0.61959    3rd Qu.: 0.61914   
##  Max.   :11.379826          Max.   : 1.67108    Max.   : 4.55997   
##                                                                    
##  avg_positive_polarity min_positive_polarity max_positive_polarity
##  Min.   :-3.384518     Min.   :-1.338367     Min.   :-3.053959    
##  1st Qu.:-0.454255     1st Qu.:-0.637251     1st Qu.:-0.632512    
##  Median : 0.043393     Median : 0.063864     Median : 0.174637    
##  Mean   : 0.001005     Mean   : 0.003956     Mean   :-0.008165    
##  3rd Qu.: 0.561257     3rd Qu.: 0.063864     3rd Qu.: 0.981786    
##  Max.   : 4.963569     Max.   : 9.879480     Max.   : 0.981786    
##                                                                   
##  avg_negative_polarity min_negative_polarity max_negative_polarity
##  Min.   :-5.79739      Min.   :-1.646826     Min.   :-9.357993    
##  1st Qu.:-0.54558      1st Qu.:-0.613375     1st Qu.:-0.183488    
##  Median : 0.04722      Median : 0.075593     Median : 0.078641    
##  Mean   :-0.00382      Mean   : 0.003193     Mean   :-0.005856    
##  3rd Qu.: 0.56390      3rd Qu.: 0.764560     3rd Qu.: 0.602899    
##  Max.   : 2.03189      Max.   : 1.798011     Max.   : 1.127156    
##                                                                   
##  title_subjectivity  title_sentiment_polarity abs_title_subjectivity
##  Min.   :-0.870796   Min.   :-4.036257        Min.   :-1.81070      
##  1st Qu.:-0.870796   1st Qu.:-0.269073        1st Qu.:-0.92788      
##  Median :-0.485287   Median :-0.269073        Median : 0.83774      
##  Mean   :-0.001223   Mean   :-0.002584        Mean   : 0.00646      
##  3rd Qu.: 0.671237   3rd Qu.: 0.248915        3rd Qu.: 0.83774      
##  Max.   : 2.213270   Max.   : 3.498112        Max.   : 0.83774      
##                                                                     
##  abs_title_sentiment_polarity     shares       data_channel_is_other
##  Min.   :-0.689649            Min.   :     1   0:13374              
##  1st Qu.:-0.689649            1st Qu.:   942   1:  771              
##  Median :-0.689649            Median :  1400                        
##  Mean   : 0.002646            Mean   :  4404                        
##  3rd Qu.: 0.415107            3rd Qu.:  2900                        
##  Max.   : 3.729377            Max.   :843300                        
##                                                                     
##   division       popularity
##  Mode :logical   NP:6566   
##  FALSE:4244      P :7579   
##  TRUE :9901                
##                            
##                            
##                            
## 

Let's see if the division is adequate and both subsets resemble each other and the original.

"Percentage of total P and NP:"
## [1] "Percentage of total P and NP:"
tab<-table(datascale$popularity)
prop.table(tab)
## 
##        NP         P 
## 0.4641923 0.5358077
"Percentage of P and NP in training set:"
## [1] "Percentage of P and NP in training set:"
tab<-table(datascale$popularity[datascale$division==TRUE])
prop.table(tab)
## 
##        NP         P 
## 0.4641955 0.5358045
"Percentage of P and NP in test set:"
## [1] "Percentage of P and NP in test set:"
tab<-table(datascale$popularity[datascale$division==FALSE])
prop.table(tab)
## 
##        NP         P 
## 0.4641847 0.5358153

We observe that the percentages in the variable are quite consistent with what we can use these divisions to make our models.

We are going to separate the training and test datasets, then we remove the division variable from both.

Let's separate the training and test datasets, then remove the division variable from both, also url and timedelta because they do not provide information, and shares because being the initial target variable it would interfere in the result.

datascale$url<-NULL
datascale$timedelta<-NULL

train<-datascale[datascale$division == TRUE,]
test <- datascale[datascale$division == FALSE,]

train$division<-NULL
test$division<-NULL
train$shares<-NULL
test$shares<-NULL

Decision Tree

We are going to start with the implementation of a model on a decision tree, we will use the rpart (CART) algorithm, applying a pruning.

library('caret')
set.seed(277)
model_prune <- train(
  x=train[,1:ncol(train)-1],y=train$popularity, method = "rpart",
  trControl = trainControl("cv", number = 10,classProbs = FALSE),
  tuneLength = 10
  )
#Show the graph
plot(model_prune)

We show the best complexity parameter achieved by the model.

model_prune$bestTune
##            cp
## 1 0.002828547

We show the graph and the tree rules.

library(rpart.plot)
par(xpd = NA)
prp(model_prune$finalModel)

model_prune$finalModel
## n= 9901 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 9901 4596 P (0.4641955 0.5358045)  
##    2) kw_avg_avg< -0.1696958 5104 2199 NP (0.5691614 0.4308386)  
##      4) data_channel_is_tech=0 3924 1524 NP (0.6116208 0.3883792)  
##        8) data_channel_is_socmed=0 3725 1381 NP (0.6292617 0.3707383)  
##         16) kw_max_max>=-0.4576834 3015 1021 NP (0.6613599 0.3386401) *
##         17) kw_max_max< -0.4576834 710  350 P (0.4929577 0.5070423)  
##           34) weekday_is_saturday=0 692  342 NP (0.5057803 0.4942197)  
##             68) num_hrefs< 0.8485965 636  300 NP (0.5283019 0.4716981) *
##             69) num_hrefs>=0.8485965 56   14 P (0.2500000 0.7500000) *
##           35) weekday_is_saturday=1 18    0 P (0.0000000 1.0000000) *
##        9) data_channel_is_socmed=1 199   56 P (0.2814070 0.7185930) *
##      5) data_channel_is_tech=1 1180  505 P (0.4279661 0.5720339)  
##       10) kw_max_avg< -0.3832603 270  121 NP (0.5518519 0.4481481)  
##         20) kw_min_avg< 0.04304663 195   74 NP (0.6205128 0.3794872) *
##         21) kw_min_avg>=0.04304663 75   28 P (0.3733333 0.6266667) *
##       11) kw_max_avg>=-0.3832603 910  356 P (0.3912088 0.6087912) *
##    3) kw_avg_avg>=-0.1696958 4797 1691 P (0.3525120 0.6474880)  
##      6) data_channel_is_entertainment=1 948  473 NP (0.5010549 0.4989451)  
##       12) self_reference_min_shares< -0.1189926 616  269 NP (0.5633117 0.4366883)  
##         24) weekday_is_sunday=0 555  227 NP (0.5909910 0.4090090) *
##         25) weekday_is_sunday=1 61   19 P (0.3114754 0.6885246) *
##       13) self_reference_min_shares>=-0.1189926 332  128 P (0.3855422 0.6144578) *
##      7) data_channel_is_entertainment=0 3849 1216 P (0.3159262 0.6840738)  
##       14) self_reference_avg_sharess< -0.1697771 1489  582 P (0.3908664 0.6091336)  
##         28) LDA_02>=0.8527931 155   63 NP (0.5935484 0.4064516)  
##           56) num_imgs< -0.1256575 127   41 NP (0.6771654 0.3228346) *
##           57) num_imgs>=-0.1256575 28    6 P (0.2142857 0.7857143) *
##         29) LDA_02< 0.8527931 1334  490 P (0.3673163 0.6326837) *
##       15) self_reference_avg_sharess>=-0.1697771 2360  634 P (0.2686441 0.7313559) *

We see the tree graph and the generated rules. We see that it first checks the value of kw_avg_avg (Avg. keyword (avg. shares)), at the next level it checks the type of channel where the news was published, on the one hand, it looks if it is technology or not, on the other hand it looks if it is entertainment or not, from there it continues discriminating by different variables.

For example, one of the paths would be, if kw_avg_avg <-0.17 and the news was published in an entertainment channel and also the value of the average number of times the referenced articles have been shared is <-0.17, the page will be considered popular.

Let's run the prediction with the test dataset using this model.

predicted.classes <- predict( model_prune, test[,1:ncol(test)-1])

print(sprintf("The accuracy of the tree is: %.4f %%",100*sum(predicted.classes == test$popularity) / length(predicted.classes)))
## [1] "The accuracy of the tree is:: 64.1376 %"

We see that the predictive capacity of the tree is 64.13%.

We obtain the confusion matrix:

.
table(test$popularity,predicted.classes)
##     predicted.classes
##        NP    P
##   NP 1210  760
##   P   762 1512

Unsupervised Models

We are going to use the kmeans algorithm with the pam function that allows us to use different distances.

We start by running the algorithm with a Euclidean distance.

library(cluster)

pameucl<-pam(train[,1:ncol(train)-1],2,metric="euclidean")

We will evaluate the algorithm with the same training dataset, since, as clustering algorithms are not used for predictive tasks, there is no predict function for them.

conf_matrix<-table(pameucl$clustering,train$popularity)
conf_matrix
##    
##       NP    P
##   1 3145 3442
##   2 1451 1863
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 50.58075

We see that it stays at 50.58% effectiveness, well below the tree.

Let's evaluate the quality of the partitions.

d<-daisy(train[,1:ncol(train)-1])


library(cluster)
library(factoextra)
sil<-silhouette(pameucl$clustering,d)
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

In this case we see how the silhouettes are below zero, so it does not seem that this algorithm gets good partitions (cohesive with each other), at least by forcing it to make 2 partitions.

We repeat with the manhattan distance

library(cluster)
pammanh<-pam(train[,1:ncol(train)-1],2,metric="manhattan")
table(pammanh$clustering)
## 
##    1    2 
## 5587 4314
conf_matrix<-table(pammanh$clustering,train$popularity)
conf_matrix
##    
##       NP    P
##   1 2779 2808
##   2 1817 2497
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 53.28755

We see that we get a somewhat higher predictive power, reaching 53.29%

.

We study the silhouettes:

library(cluster)
library(factoextra)
sil<-silhouette(pammanh$clustering,d)
fviz_silhouette(sil,label=FALSE,print.summary=FALSE)

As in the previous model, we can see that we do not achieve well cohesive groups by performing 2 partitions with these algorithms.

Supervised Models

We go with the supervised models, we perform a logistic model on the data.

We include the functions tic() and toc() from the pracma library to show the time the logistic model takes to run, this will allow us to compare performances with and without SVD.

library(pracma)
tic("Logit:")
logit<-glm(formula = popularity ~ ., family = binomial(link = "logit"), 
    data = train[,1:ncol(train)])
toc()
## elapsed time is 0.350000 seconds
summary(logit)
## 
## Call:
## glm(formula = popularity ~ ., family = binomial(link = "logit"), 
##     data = train[, 1:ncol(train)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7641  -1.0558   0.5594   1.0134   2.1526  
## 
## Coefficients: (3 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     0.879545   0.122491   7.181 6.95e-13 ***
## n_tokens_title                  0.033466   0.022558   1.484 0.137928    
## n_tokens_content                0.031245   0.037785   0.827 0.408289    
## n_unique_tokens                -2.342211   1.240585  -1.888 0.059027 .  
## num_hrefs                       0.118144   0.029309   4.031 5.55e-05 ***
## num_self_hrefs                 -0.086719   0.026527  -3.269 0.001079 ** 
## num_imgs                        0.031888   0.027089   1.177 0.239140    
## num_videos                     -0.033925   0.025017  -1.356 0.175080    
## average_token_length           -0.226168   0.072230  -3.131 0.001741 ** 
## num_keywords                    0.067116   0.026710   2.513 0.011979 *  
## data_channel_is_lifestyle1     -0.242671   0.149726  -1.621 0.105068    
## data_channel_is_entertainment1 -0.317804   0.093865  -3.386 0.000710 ***
## data_channel_is_bus1           -0.278576   0.147472  -1.889 0.058890 .  
## data_channel_is_socmed1         0.762917   0.146636   5.203 1.96e-07 ***
## data_channel_is_tech1           0.477328   0.138928   3.436 0.000591 ***
## data_channel_is_world1         -0.147545   0.139054  -1.061 0.288661    
## kw_min_min                      0.183267   0.044253   4.141 3.45e-05 ***
## kw_max_min                      0.236125   0.076744   3.077 0.002092 ** 
## kw_avg_min                     -0.285507   0.073322  -3.894 9.87e-05 ***
## kw_min_max                     -0.072657   0.024510  -2.964 0.003033 ** 
## kw_max_max                     -0.027829   0.047715  -0.583 0.559733    
## kw_avg_max                     -0.051563   0.042765  -1.206 0.227923    
## kw_min_avg                     -0.090672   0.033649  -2.695 0.007047 ** 
## kw_max_avg                     -0.586559   0.067550  -8.683  < 2e-16 ***
## kw_avg_avg                      0.979314   0.079906  12.256  < 2e-16 ***
## self_reference_min_shares       0.251175   0.080886   3.105 0.001901 ** 
## self_reference_max_shares       0.052557   0.073067   0.719 0.471952    
## self_reference_avg_sharess     -0.021476   0.109253  -0.197 0.844166    
## weekday_is_monday1             -0.770710   0.100580  -7.663 1.82e-14 ***
## weekday_is_tuesday1            -0.861900   0.099141  -8.694  < 2e-16 ***
## weekday_is_wednesday1          -0.908351   0.099443  -9.134  < 2e-16 ***
## weekday_is_thursday1           -0.839257   0.099766  -8.412  < 2e-16 ***
## weekday_is_friday1             -0.667387   0.102985  -6.480 9.15e-11 ***
## weekday_is_saturday1            0.132031   0.128103   1.031 0.302699    
## weekday_is_sunday1                    NA         NA      NA       NA    
## LDA_00                          0.199159   0.045982   4.331 1.48e-05 ***
## LDA_01                         -0.106461   0.041766  -2.549 0.010803 *  
## LDA_02                         -0.134101   0.047958  -2.796 0.005170 ** 
## LDA_03                         -0.125147   0.052985  -2.362 0.018181 *  
## LDA_04                                NA         NA      NA       NA    
## global_subjectivity             0.131618   0.036899   3.567 0.000361 ***
## global_sentiment_polarity       0.038739   0.059724   0.649 0.516574    
## global_rate_positive_words     -0.053122   0.047153  -1.127 0.259916    
## global_rate_negative_words     -0.001018   0.056486  -0.018 0.985616    
## rate_positive_words             0.252735   0.097770   2.585 0.009738 ** 
## rate_negative_words             0.205334   0.090707   2.264 0.023593 *  
## avg_positive_polarity           0.046422   0.052809   0.879 0.379376    
## min_positive_polarity          -0.102391   0.030658  -3.340 0.000838 ***
## max_positive_polarity          -0.092068   0.039731  -2.317 0.020489 *  
## avg_negative_polarity          -0.043859   0.059994  -0.731 0.464743    
## min_negative_polarity           0.019014   0.050049   0.380 0.704012    
## max_negative_polarity           0.004562   0.037635   0.121 0.903513    
## title_subjectivity             -0.008470   0.033457  -0.253 0.800134    
## title_sentiment_polarity        0.061587   0.024984   2.465 0.013697 *  
## abs_title_subjectivity          0.010106   0.025644   0.394 0.693519    
## abs_title_sentiment_polarity    0.011383   0.034051   0.334 0.738155    
## data_channel_is_other1                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13675  on 9900  degrees of freedom
## Residual deviance: 12263  on 9847  degrees of freedom
## AIC: 12371
## 
## Number of Fisher Scoring iterations: 5

Let's look at the resulting coefficients. Let's run the prediction model with the test data. Since the logistic model returns probabilities, we will take the values > 0.5 as pages that will be popular and the rest as unpopular pages.

predict.logit<-predict(logit,newdata=test[,1:ncol(test)],type='response')
glm_prediction = ifelse(predict.logit>0.5,'P','NP')
conf_matrix<-table(glm_prediction,test$popularity)
conf_matrix
##               
## glm_prediction   NP    P
##             NP 1205  714
##             P   765 1560
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 65.1508

In this case we observe that for the moment it is the best model reaching a 65.15% accuracy rate.

We will repeat the SVD method with a reduced dataset in order to compare the effectiveness of the models.

We are going to repeat the SVD method with the reduced dataset in order to compare the effectiveness of the models.

#We create a dataset with the continuous variables
datanum <- (datascale[,c(1:9,16:27,35:55)])


#We apply SVD and display the data.
datanum.svd<-svd(datanum)
"d"
## [1] "d"
datanum.svd$d
##  [1] 2.596782e+02 2.530943e+02 2.307989e+02 2.068307e+02 1.842217e+02
##  [6] 1.791178e+02 1.760621e+02 1.575913e+02 1.526338e+02 1.398370e+02
## [11] 1.340989e+02 1.315987e+02 1.271044e+02 1.237501e+02 1.165461e+02
## [16] 1.108530e+02 1.072220e+02 1.005792e+02 9.917303e+01 9.685914e+01
## [21] 9.324016e+01 9.171836e+01 8.681378e+01 8.591147e+01 8.069453e+01
## [26] 7.924628e+01 7.614582e+01 7.035524e+01 6.634008e+01 6.043407e+01
## [31] 5.831404e+01 5.140172e+01 4.266649e+01 3.806221e+01 3.699571e+01
## [36] 3.345848e+01 2.914867e+01 2.416142e+01 2.207835e+01 1.916849e+01
## [41] 2.161844e+00 4.760031e-03
"u"
## [1] "u"
datanum.svd$u[1:5,]
##               [,1]        [,2]         [,3]        [,4]         [,5]
## [1,]  8.474179e-03 0.008079700 -0.009693002 0.009105234 -0.004559022
## [2,]  1.038957e-02 0.006289088  0.002466005 0.016794292 -0.003163306
## [3,] -6.297939e-05 0.010933419  0.004489008 0.015856386 -0.001354212
## [4,]  5.849472e-03 0.007731413  0.005256751 0.014962748  0.006906951
## [5,] -9.503979e-03 0.017174825 -0.008828758 0.007019208  0.003609519
##              [,6]         [,7]          [,8]         [,9]         [,10]
## [1,]  0.007226643 -0.001884535  0.0042915846 -0.017627939  0.0003603647
## [2,]  0.008988401  0.001297720  0.0009061188 -0.007334306  0.0053942348
## [3,]  0.003140518  0.002411026  0.0071538148 -0.004324472 -0.0040684787
## [4,]  0.007174044  0.008848170 -0.0053930446 -0.009860199  0.0068677606
## [5,] -0.018368086  0.017539816  0.0154743975  0.001335152  0.0041225923
##              [,11]        [,12]        [,13]        [,14]         [,15]
## [1,]  0.0039004145  0.004936470  0.002215434 -0.011899893 -0.0055226990
## [2,]  0.0001152835  0.011917052 -0.003835803  0.007224432  0.0016276479
## [3,]  0.0050747695  0.007461346 -0.003871376  0.004200978  0.0100594438
## [4,]  0.0047457141  0.008824711 -0.002744474  0.010127089  0.0001342847
## [5,] -0.0172380132 -0.012027651 -0.011031114 -0.003976230 -0.0120551978
##             [,16]         [,17]         [,18]         [,19]        [,20]
## [1,] -0.000015124  0.0012275351 -0.0007308813  0.0088210900  0.008710317
## [2,] -0.003993509 -0.0009475676  0.0045604629  0.0037604752  0.010817725
## [3,]  0.006271079  0.0092262992 -0.0083708614 -0.0001654949 -0.003966483
## [4,]  0.005199082 -0.0164739853 -0.0020013972  0.0029761297  0.002852818
## [5,] -0.009684398 -0.0139561749 -0.0134461229  0.0172416368  0.025383085
##              [,21]        [,22]         [,23]        [,24]        [,25]
## [1,] -0.0008102396 -0.006748125  0.0029034781  0.003591092  0.011470259
## [2,] -0.0044325270 -0.001530149 -0.0010068089  0.002261690  0.004816694
## [3,] -0.0024950028  0.003310255 -0.0083926890 -0.003564886 -0.002293547
## [4,] -0.0085145953 -0.005032175 -0.0005624262  0.003002982  0.006233457
## [5,] -0.0094081141  0.021090796 -0.0034651220  0.007640199 -0.004159618
##              [,26]        [,27]        [,28]        [,29]      [,30]
## [1,] -1.402691e-05  0.003146952  0.003066231 -0.005250015 0.02873764
## [2,]  3.316266e-03 -0.008761224  0.002712635 -0.011310352 0.01974892
## [3,]  1.117306e-02  0.009038712  0.006292482 -0.001414496 0.01282847
## [4,]  8.430057e-03 -0.012473031  0.006056042  0.006398241 0.01840958
## [5,] -3.216749e-03 -0.010951087 -0.015777807  0.004219008 0.01213607
##              [,31]         [,32]      [,33]         [,34]       [,35]
## [1,] -3.606729e-05 -0.0008944496 0.05576603 -0.0008911391 0.012876513
## [2,]  2.657058e-04  0.0075175455 0.05796864  0.0033542159 0.006767369
## [3,] -5.539388e-04 -0.0039071984 0.06088392  0.0028363995 0.011805009
## [4,]  1.651099e-03  0.0074245570 0.05944609  0.0101394976 0.006532310
## [5,]  8.440541e-03  0.0036476529 0.06057276 -0.0025637345 0.012897826
##             [,36]        [,37]        [,38]        [,39]        [,40]
## [1,] -0.001051347 -0.005638723 -0.005238395 -0.002156088 -0.004314705
## [2,] -0.006127420 -0.008578038 -0.003908962 -0.001695845 -0.007080223
## [3,] -0.001481918 -0.005834722 -0.012182896 -0.003789634 -0.001670958
## [4,] -0.012681872 -0.003496313 -0.006511559 -0.002951859  0.001895029
## [5,]  0.008860331 -0.002030300 -0.013504874 -0.001456773 -0.007647298
##              [,41]       [,42]
## [1,] -0.0009605776 0.008749506
## [2,] -0.0022078429 0.009019184
## [3,] -0.0032563586 0.009384242
## [4,]  0.0122139336 0.004732981
## [5,]  0.0019817326 0.008232328
"v"
## [1] "v"
datanum.svd$v[1:5,]
##              [,1]         [,2]         [,3]          [,4]         [,5]
## [1,]  0.025227076 -0.032697512 5.136301e-02 -0.0581870583  0.081661759
## [2,] -0.138246486  0.092560975 1.165227e-01  0.0829187899  0.027496268
## [3,] -0.006356406  0.004057403 7.890069e-05  0.0006009658 -0.003169818
## [4,] -0.173302173  0.044862694 1.021083e-01 -0.0138737246  0.026709426
## [5,] -0.118805464  0.069250555 2.470946e-02  0.0064507175 -0.020787458
##             [,6]         [,7]         [,8]         [,9]        [,10]
## [1,] -0.01886185  0.018799858 -0.149344207  0.084890438 -0.124581632
## [2,] -0.42889078  0.128305465  0.001539783 -0.187479176 -0.007797895
## [3,]  0.01346192 -0.007586421 -0.003646484  0.003714064  0.003578140
## [4,] -0.34198290  0.108884647  0.191988647  0.028107349 -0.027239147
## [5,] -0.30398631  0.123840342  0.148683175 -0.011160306 -0.033660528
##              [,11]         [,12]         [,13]        [,14]        [,15]
## [1,]  0.2561483274 -0.0523054447 -0.2796331388  0.343038139 -0.678467550
## [2,]  0.0112902831  0.0422143250 -0.0630508738  0.049522196  0.028779106
## [3,] -0.0004902339 -0.0001171761 -0.0002540196 -0.000906006 -0.003562594
## [4,] -0.1325059597  0.2305854118 -0.0082068666 -0.125667409 -0.127666930
## [5,] -0.1768113779 -0.1600904356 -0.0930413840 -0.035432880 -0.341247854
##             [,16]        [,17]        [,18]       [,19]        [,20]
## [1,]  0.341870625  0.072729442  0.154295673 -0.11981656 -0.043375251
## [2,]  0.168200283  0.166076622 -0.122681222  0.12216379 -0.125656200
## [3,] -0.008240845 -0.007719391  0.006261413 -0.00419713  0.007483101
## [4,] -0.209080808 -0.129269838 -0.007312209 -0.16870180 -0.214482025
## [5,] -0.506048124 -0.085408014  0.057905037 -0.02503298  0.127046988
##             [,21]        [,22]        [,23]         [,24]        [,25]
## [1,] -0.084778131 -0.167787318  0.057509332  0.0218303100  0.030011908
## [2,]  0.128567994  0.114270007  0.085701032  0.0035703583 -0.281922621
## [3,]  0.002061368 -0.008066023  0.004630831  0.0005913157 -0.001108898
## [4,]  0.297354415 -0.401714131 -0.271499888 -0.1336835314  0.281053005
## [5,] -0.451290382  0.174485154 -0.164054917  0.1356755053 -0.157185467
##             [,26]        [,27]        [,28]        [,29]        [,30]
## [1,]  0.040601152 -0.094620468 -0.007696249 -0.014418779  0.039604710
## [2,]  0.006356591  0.087381363  0.642047181  0.198554876 -0.013846631
## [3,] -0.002439620  0.005348887 -0.003015765 -0.008360102 -0.001697863
## [4,]  0.210135011 -0.277577107  0.005364675 -0.057518124 -0.039651802
## [5,] -0.252450718  0.075422582 -0.065022969  0.044364710  0.011956800
##             [,31]         [,32]         [,33]         [,34]         [,35]
## [1,]  0.002929870  0.0011377400 -0.0006836195 -0.0079184605 -0.0134180687
## [2,] -0.112010940  0.0488705607  0.0118542824  0.0673414947 -0.0331112205
## [3,]  0.001287894  0.0037107881  0.0003464149  0.0001536687  0.0002638498
## [4,]  0.010824662 -0.0272014611  0.0123458061  0.0337237657  0.0120722701
## [5,] -0.008062561  0.0001922159 -0.0200877547 -0.0079844699 -0.0131771982
##              [,36]        [,37]        [,38]         [,39]       [,40]
## [1,]  2.566376e-03  0.002428335  0.012217254  0.0008316503 -0.01546642
## [2,]  6.729817e-02 -0.043327542  0.017621452  0.0025713634 -0.03283003
## [3,]  1.040524e-05  0.003684227 -0.002507583 -0.0010774284  0.02784410
## [4,] -1.710710e-02  0.010112608 -0.025594205  0.0001165700  0.04204715
## [5,] -6.229666e-03  0.000290615  0.003209535  0.0348461130 -0.01666532
##              [,41]         [,42]
## [1,] -0.0002961890 -2.321305e-08
## [2,] -0.0154374966  9.294279e-06
## [3,] -0.9991890687  6.154922e-04
## [4,] -0.0006005106  2.731273e-08
## [5,]  0.0004486355 -2.361950e-07

In the following graph we show the percentage of variance explained by the values of "d", you can see how they go in decreasing order. The values of variability explained from the singular value 30 onwards are very close to 0 so they will be less important in explaining this variability. This means that we can delete these values without fear of losing too much information when applying a model to them.

plot(prop.table(datanum.svd$d^2),ylab="Percent variability explained")

We can see that with the first 20 columns we have an explained variability of 87%, so that we can dispense with these singular values without losing much information.

sum(prop.table(datanum.svd$d^2)[1:20])
## [1] 0.8712191

We obtain the reduced dataset with 20 variables instead of 42 variables

datareduced=datanum.svd$u[,1:20] %*% diag(datanum.svd$d[1:20]) 
datareduced[1:5,]
##             [,1]     [,2]       [,3]     [,4]       [,5]       [,6]
## [1,]  2.20055960 2.044926 -2.2371338 1.883242 -0.8398708  1.2944206
## [2,]  2.69794575 1.591732  0.5691511 3.473575 -0.5827496  1.6099829
## [3,] -0.01635437 2.767186  1.0360579 3.279588 -0.2494753  0.5625227
## [4,]  1.51898040 1.956776  1.2132522 3.094756  1.2724103  1.2849992
## [5,] -2.46797621 4.346850 -2.0376673 1.451788  0.6649517 -3.2900518
##            [,7]       [,8]       [,9]       [,10]       [,11]      [,12]
## [1,] -0.3317951  0.6763164 -2.6906192  0.05039231  0.52304129  0.6496331
## [2,]  0.2284792  0.1427964 -1.1194630  0.75431348  0.01545939  1.5682687
## [3,]  0.4244903  1.1273790 -0.6600606 -0.56892375  0.68052100  0.9819035
## [4,]  1.5578272 -0.8498969 -1.5049995  0.96036687  0.63639504  1.1613206
## [5,]  3.0880964  2.4386305  0.2037894  0.57649083 -2.31159859 -1.5828234
##           [,13]      [,14]       [,15]        [,16]      [,17]       [,18]
## [1,]  0.2815915 -1.4726128 -0.64364905 -0.001676542  0.1316188 -0.07351144
## [2,] -0.4875476  0.8940241  0.18969602 -0.442692509 -0.1016001  0.45868757
## [3,] -0.4920691  0.5198714  1.17238899  0.695168025  0.9892626 -0.84193429
## [4,] -0.3488348  1.2532282  0.01565036  0.576333988 -1.7663742 -0.20129886
## [5,] -1.4021036 -0.4920588 -1.40498634 -1.073544860 -1.4964094 -1.35239987
##            [,19]      [,20]
## [1,]  0.87481423  0.8436738
## [2,]  0.37293772  1.0477955
## [3,] -0.01641263 -0.3841901
## [4,]  0.29515180  0.2763215
## [5,]  1.70990537  2.4585839

Next, we add the binary variables that we had removed.

datareduced<-as.data.frame(datareduced)
summary(datareduced)
##        V1                  V2                  V3           
##  Min.   :-54.08354   Min.   :-93.43009   Min.   :-25.53550  
##  1st Qu.: -1.11052   1st Qu.: -0.42645   1st Qu.: -1.25335  
##  Median : -0.09406   Median :  0.26744   Median : -0.00099  
##  Mean   : -0.00428   Mean   : -0.02186   Mean   :  0.01753  
##  3rd Qu.:  0.88382   3rd Qu.:  0.86922   3rd Qu.:  1.25108  
##  Max.   :  9.12531   Max.   :  4.34685   Max.   : 10.79651  
##        V4                  V5                   V6            
##  Min.   :-16.59956   Min.   :-30.835767   Min.   :-11.158572  
##  1st Qu.: -0.95477   1st Qu.: -0.942126   1st Qu.: -0.786552  
##  Median :  0.08090   Median : -0.165464   Median : -0.019286  
##  Mean   :  0.00674   Mean   : -0.005707   Mean   :  0.006638  
##  3rd Qu.:  1.05642   3rd Qu.:  0.966642   3rd Qu.:  0.824490  
##  Max.   : 35.11486   Max.   :  4.495073   Max.   : 11.946039  
##        V7                  V8                   V9          
##  Min.   :-8.752088   Min.   :-10.860565   Min.   :-7.46464  
##  1st Qu.:-0.836347   1st Qu.: -0.936993   1st Qu.:-0.71057  
##  Median :-0.115613   Median : -0.197204   Median : 0.13105  
##  Mean   : 0.005024   Mean   :  0.004595   Mean   : 0.01229  
##  3rd Qu.: 0.704222   3rd Qu.:  0.838462   3rd Qu.: 0.84251  
##  Max.   :28.052381   Max.   :  6.125843   Max.   : 5.78144  
##       V10                 V11                 V12           
##  Min.   :-9.577826   Min.   :-6.713680   Min.   :-3.474411  
##  1st Qu.:-0.623855   1st Qu.:-0.685392   1st Qu.:-0.814898  
##  Median : 0.061604   Median :-0.011178   Median : 0.041616  
##  Mean   : 0.004169   Mean   :-0.006783   Mean   : 0.004939  
##  3rd Qu.: 0.684528   3rd Qu.: 0.679685   3rd Qu.: 0.811483  
##  Max.   : 6.231708   Max.   : 6.908861   Max.   : 5.119551  
##       V13                 V14                 V15           
##  Min.   :-6.014161   Min.   :-5.001655   Min.   :-6.243757  
##  1st Qu.:-0.495247   1st Qu.:-0.688332   1st Qu.:-0.639673  
##  Median : 0.146845   Median : 0.073099   Median : 0.030930  
##  Mean   : 0.001888   Mean   : 0.001817   Mean   :-0.003354  
##  3rd Qu.: 0.687477   3rd Qu.: 0.681013   3rd Qu.: 0.671317  
##  Max.   : 5.524902   Max.   : 7.026784   Max.   : 3.304626  
##       V16                  V17                 V18           
##  Min.   :-13.280350   Min.   :-4.643394   Min.   :-8.033320  
##  1st Qu.: -0.475241   1st Qu.:-0.500606   1st Qu.:-0.447538  
##  Median :  0.048955   Median : 0.059384   Median : 0.016957  
##  Mean   : -0.000829   Mean   :-0.001992   Mean   :-0.001881  
##  3rd Qu.:  0.552918   3rd Qu.: 0.530947   3rd Qu.: 0.528577  
##  Max.   :  7.709455   Max.   : 4.902544   Max.   : 7.259807  
##       V19                 V20          
##  Min.   :-7.575508   Min.   :-6.50378  
##  1st Qu.:-0.514284   1st Qu.:-0.45468  
##  Median : 0.021577   Median : 0.01933  
##  Mean   :-0.005453   Mean   : 0.00600  
##  3rd Qu.: 0.516713   3rd Qu.: 0.47236  
##  Max.   : 5.017422   Max.   : 4.97175
datareduced$popularity<-datascale$popularity
datareduced$division<-datascale$division

datareduced$weekday_is_monday<-datascale$weekday_is_monday
datareduced$weekday_is_tuesday<-datascale$weekday_is_monday
datareduced$weekday_is_wednesday<-datascale$weekday_is_monday
datareduced$weekday_is_thursday<-datascale$weekday_is_monday
datareduced$weekday_is_friday<-datascale$weekday_is_monday
datareduced$weekday_is_saturday<-datascale$weekday_is_monday
datareduced$weekday_is_sunday<-datascale$weekday_is_monday

datareduced$data_channel_is_lifestyle<-datascale$data_channel_is_lifestyle
datareduced$data_channel_is_entertainment<-datascale$data_channel_is_entertainment
datareduced$data_channel_is_bus<-datascale$data_channel_is_bus
datareduced$data_channel_is_socmed<-datascale$data_channel_is_socmed
datareduced$data_channel_is_tech<-datascale$data_channel_is_tech1
datareduced$data_channel_is_world<-datascale$data_channel_is_world
datareduced$data_channel_is_other<-datascale$data_channel_is_other


summary(datareduced)
##        V1                  V2                  V3           
##  Min.   :-54.08354   Min.   :-93.43009   Min.   :-25.53550  
##  1st Qu.: -1.11052   1st Qu.: -0.42645   1st Qu.: -1.25335  
##  Median : -0.09406   Median :  0.26744   Median : -0.00099  
##  Mean   : -0.00428   Mean   : -0.02186   Mean   :  0.01753  
##  3rd Qu.:  0.88382   3rd Qu.:  0.86922   3rd Qu.:  1.25108  
##  Max.   :  9.12531   Max.   :  4.34685   Max.   : 10.79651  
##        V4                  V5                   V6            
##  Min.   :-16.59956   Min.   :-30.835767   Min.   :-11.158572  
##  1st Qu.: -0.95477   1st Qu.: -0.942126   1st Qu.: -0.786552  
##  Median :  0.08090   Median : -0.165464   Median : -0.019286  
##  Mean   :  0.00674   Mean   : -0.005707   Mean   :  0.006638  
##  3rd Qu.:  1.05642   3rd Qu.:  0.966642   3rd Qu.:  0.824490  
##  Max.   : 35.11486   Max.   :  4.495073   Max.   : 11.946039  
##        V7                  V8                   V9          
##  Min.   :-8.752088   Min.   :-10.860565   Min.   :-7.46464  
##  1st Qu.:-0.836347   1st Qu.: -0.936993   1st Qu.:-0.71057  
##  Median :-0.115613   Median : -0.197204   Median : 0.13105  
##  Mean   : 0.005024   Mean   :  0.004595   Mean   : 0.01229  
##  3rd Qu.: 0.704222   3rd Qu.:  0.838462   3rd Qu.: 0.84251  
##  Max.   :28.052381   Max.   :  6.125843   Max.   : 5.78144  
##       V10                 V11                 V12           
##  Min.   :-9.577826   Min.   :-6.713680   Min.   :-3.474411  
##  1st Qu.:-0.623855   1st Qu.:-0.685392   1st Qu.:-0.814898  
##  Median : 0.061604   Median :-0.011178   Median : 0.041616  
##  Mean   : 0.004169   Mean   :-0.006783   Mean   : 0.004939  
##  3rd Qu.: 0.684528   3rd Qu.: 0.679685   3rd Qu.: 0.811483  
##  Max.   : 6.231708   Max.   : 6.908861   Max.   : 5.119551  
##       V13                 V14                 V15           
##  Min.   :-6.014161   Min.   :-5.001655   Min.   :-6.243757  
##  1st Qu.:-0.495247   1st Qu.:-0.688332   1st Qu.:-0.639673  
##  Median : 0.146845   Median : 0.073099   Median : 0.030930  
##  Mean   : 0.001888   Mean   : 0.001817   Mean   :-0.003354  
##  3rd Qu.: 0.687477   3rd Qu.: 0.681013   3rd Qu.: 0.671317  
##  Max.   : 5.524902   Max.   : 7.026784   Max.   : 3.304626  
##       V16                  V17                 V18           
##  Min.   :-13.280350   Min.   :-4.643394   Min.   :-8.033320  
##  1st Qu.: -0.475241   1st Qu.:-0.500606   1st Qu.:-0.447538  
##  Median :  0.048955   Median : 0.059384   Median : 0.016957  
##  Mean   : -0.000829   Mean   :-0.001992   Mean   :-0.001881  
##  3rd Qu.:  0.552918   3rd Qu.: 0.530947   3rd Qu.: 0.528577  
##  Max.   :  7.709455   Max.   : 4.902544   Max.   : 7.259807  
##       V19                 V20           popularity  division      
##  Min.   :-7.575508   Min.   :-6.50378   NP:6566    Mode :logical  
##  1st Qu.:-0.514284   1st Qu.:-0.45468   P :7579    FALSE:4244     
##  Median : 0.021577   Median : 0.01933              TRUE :9901     
##  Mean   :-0.005453   Mean   : 0.00600                             
##  3rd Qu.: 0.516713   3rd Qu.: 0.47236                             
##  Max.   : 5.017422   Max.   : 4.97175                             
##  weekday_is_monday weekday_is_tuesday weekday_is_wednesday
##  0:11752           0:11752            0:11752             
##  1: 2393           1: 2393            1: 2393             
##                                                           
##                                                           
##                                                           
##                                                           
##  weekday_is_thursday weekday_is_friday weekday_is_saturday
##  0:11752             0:11752           0:11752            
##  1: 2393             1: 2393           1: 2393            
##                                                           
##                                                           
##                                                           
##                                                           
##  weekday_is_sunday data_channel_is_lifestyle data_channel_is_entertainment
##  0:11752           0:13374                   0:11627                      
##  1: 2393           1:  771                   1: 2518                      
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  data_channel_is_bus data_channel_is_socmed data_channel_is_world
##  0:11942             0:13347                0:11104              
##  1: 2203             1:  798                1: 3041              
##                                                                  
##                                                                  
##                                                                  
##                                                                  
##  data_channel_is_other
##  0:13374              
##  1:  771              
##                       
##                       
##                       
## 

We separate the training and testing sets and remove the division column from the reduced dataset.

traindr <- datareduced[datareduced$division==TRUE,]
testdr<- datareduced[datareduced$division==FALSE,]
traindr$division<-NULL
testdr$division<-NULL

We run the logistic model, also showing the time invested by the logit model.

tic("Inicio logit:")
logitsvd<-glm(formula = popularity ~ ., family = binomial(link = "logit"), 
    data = traindr)
toc(echo=TRUE)
## elapsed time is 0.150000 seconds
summary(logitsvd)
## 
## Call:
## glm(formula = popularity ~ ., family = binomial(link = "logit"), 
##     data = traindr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2854  -1.0930   0.6587   1.0366   1.7874  
## 
## Coefficients: (7 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     0.51244    0.04817  10.639  < 2e-16 ***
## V1                             -0.10284    0.01145  -8.978  < 2e-16 ***
## V2                             -0.04157    0.01133  -3.670 0.000242 ***
## V3                             -0.02350    0.01206  -1.949 0.051273 .  
## V4                             -0.12063    0.01501  -8.035 9.39e-16 ***
## V5                             -0.04888    0.02018  -2.422 0.015433 *  
## V6                             -0.02929    0.01539  -1.903 0.057029 .  
## V7                              0.10828    0.02059   5.260 1.44e-07 ***
## V8                              0.10901    0.02020   5.398 6.75e-08 ***
## V9                             -0.07211    0.02237  -3.224 0.001266 ** 
## V10                            -0.06537    0.01894  -3.452 0.000556 ***
## V11                            -0.05343    0.02089  -2.557 0.010545 *  
## V12                            -0.06214    0.02613  -2.378 0.017417 *  
## V13                             0.13936    0.02406   5.792 6.95e-09 ***
## V14                            -0.16460    0.02761  -5.963 2.48e-09 ***
## V15                            -0.05259    0.02503  -2.101 0.035663 *  
## V16                             0.19763    0.02428   8.141 3.92e-16 ***
## V17                             0.01803    0.02433   0.741 0.458451    
## V18                             0.16194    0.02572   6.295 3.07e-10 ***
## V19                            -0.11853    0.02643  -4.485 7.29e-06 ***
## V20                            -0.16759    0.02699  -6.208 5.36e-10 ***
## weekday_is_monday1             -0.07794    0.05673  -1.374 0.169501    
## weekday_is_tuesday1                  NA         NA      NA       NA    
## weekday_is_wednesday1                NA         NA      NA       NA    
## weekday_is_thursday1                 NA         NA      NA       NA    
## weekday_is_friday1                   NA         NA      NA       NA    
## weekday_is_saturday1                 NA         NA      NA       NA    
## weekday_is_sunday1                   NA         NA      NA       NA    
## data_channel_is_lifestyle1     -0.41973    0.10334  -4.062 4.87e-05 ***
## data_channel_is_entertainment1 -0.66880    0.07741  -8.639  < 2e-16 ***
## data_channel_is_bus1           -0.65253    0.10965  -5.951 2.66e-09 ***
## data_channel_is_socmed1         0.49320    0.11696   4.217 2.48e-05 ***
## data_channel_is_world1         -0.54174    0.09919  -5.462 4.72e-08 ***
## data_channel_is_other1               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13675  on 9900  degrees of freedom
## Residual deviance: 12648  on 9874  degrees of freedom
## AIC: 12702
## 
## Number of Fisher Scoring iterations: 4

We run the prediction in the same way as before.

predict.logitsvd<-predict(logitsvd,newdata=testdr,type='response')
glm_prediction = ifelse(predict.logitsvd>0.5,'P','NP')
conf_matrix<-table(glm_prediction,testdr$popularity)
conf_matrix
##               
## glm_prediction   NP    P
##             NP 1149  700
##             P   821 1574
porcentaje_correcto<-100 * sum(diag(conf_matrix)) / sum(conf_matrix)
porcentaje_correcto
## [1] 64.16117

We see that the efficiency of the model with SVD applied drops a little, it remains at 64.16%, so it seems that, in this case, the dimensionality reduction is not able to improve the predictive capacity of the logistic model, however, it is able to yield similar results using many fewer variables, and as can be seen in the time invested by the execution of both models, it yields it with a much shorter time.

Discussion

Different models have been run to try to solve the problem of predicting the popularity of a published news item using data from the mashable.com portal. We have seen that the generated decision tree is able to overcome 60% of effectiveness, and a not very extensive tree is obtained, giving special importance to the fields related to the keywords, as well as to the channel where the news was published.

With regard to the unsupervised models, we see that in this case they do not have great predictive ability, it may be because we have forced them to generate 2 classes and it does not seem that good partitions are generated by the model used (kmeans) in these conditions and with these data. We have observed that depending on the distance used to run the model (euclidean or manhattan) somewhat different results can be obtained.

As for the supervised models, the logistic regression with all fields has achieved a somewhat better result than the decision tree, and we have applied the SVD decomposition to the fields, no improvement in predictive results has been achieved, however, we have achieved similar results faster and with quite less data, which can be a great advantage if we talk about large amounts of data. In the Pr(>|z|) column of the regression summaries, we observe how in the case of the complete data there are many variables that the regression does not consider statistically significant (Pr(>|z|) > 0.05), while in the model with SVD applied we see how all the variables have statistical significance.