Using Scikit-Surprise to Create a Simple Recipe Collaborative Filtering Recommender System.

Publicado por

Companies all over the world are increasingly utilizing recommender systems. These algorithms can be used by online stores, streaming services, or social networks to recommend items to users based on their previous behavior (either consumed items or searched items).

There are several approaches to developing recommendation systems. We can build a recommender system based on the content of the item so that the system recommends similar items to the ones the user usually likes (Content-Based recommender systems), or we can use user similarity to recommend items that other users have rated highly (Collaborative-filtering recommender systems).

In this post we will create a really simple recommender system using the package Surprise, we will use the standard surprise functions to create a Collaborative Filtering recommender system based on user ratings. The dataset I chose for this exercise is the Recipes from Food.com dataset, which is available on Kaggle and contains over 180K recipes and 700K recipe reviews. It’s a massive dataset that’s ideal for experimenting with recommender systems.

The dataset consists of several files containing the raw data and the processed data which is great for our purpose (thanks to the authors of the paper, you can find the citation at the end of the post).

I also want to experiment with different evaluation metrics for the recommender system. Evaluating a recommender system is difficult because user behavior changes over time, but we must stick to the metrics we have. I’ll experiment with MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error). These metrics account for the mean of the errors (differences among the predicted and the actual ratings). Large errors are given more weight by RMSE, so if we have large errors in our predictions, the RMSE will be higher.

This notebook has been created in DeepNote environment and following the tutorials in the scikit-surprise documentation.

Let’s start!

Recommendation_Systems_Surprise_1

The first step is importing the necessary libraries.

In [1]:
import pandas as pd
import difflib
import numpy as np
import pickle

Loading Data

The dataset is divided into four files: recipes, user ratings, and interactions. Two of them have RAW data, while the others have processed data. We will use processed ratings data and raw recipe data for this recommender system. It simply works best for our needs.

Let's load and check the data

In [2]:
recipe_data = pd.read_csv('/work/RAW_recipes.csv',header=0,sep=",")
recipe_data.head()
Out[2]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
0 arriba baked winter squash mexican style 137739 55 47892 2005-09-16 ['60-minutes-or-less', 'time-to-make', 'course... [51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0] 11 ['make a choice and proceed with recipe', 'dep... autumn is my favorite time of year to cook! th... ['winter squash', 'mexican seasoning', 'mixed ... 7
1 a bit different breakfast pizza 31490 30 26278 2002-06-17 ['30-minutes-or-less', 'time-to-make', 'course... [173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0] 9 ['preheat oven to 425 degrees f', 'press dough... this recipe calls for the crust to be prebaked... ['prepared pizza crust', 'sausage patty', 'egg... 6
2 all in the kitchen chili 112140 130 196586 2005-02-25 ['time-to-make', 'course', 'preparation', 'mai... [269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0] 6 ['brown ground beef in large pot', 'add choppe... this modified version of 'mom's' chili was a h... ['ground beef', 'yellow onions', 'diced tomato... 13
3 alouette potatoes 59389 45 68585 2003-04-14 ['60-minutes-or-less', 'time-to-make', 'course... [368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0] 11 ['place potatoes in a large pot of lightly sal... this is a super easy, great tasting, make ahea... ['spreadable cheese with garlic and herbs', 'n... 11
4 amish tomato ketchup for canning 44061 190 41706 2002-10-25 ['weeknight', 'time-to-make', 'course', 'main-... [352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0] 5 ['mix all ingredients& boil for 2 1 / 2 hours ... my dh's amish mother raised him on this recipe... ['tomato juice', 'apple cider vinegar', 'sugar... 8
In [3]:
user_data = pd.read_csv('/work/PP_users.csv',header=0,sep=",")
user_data.head()
Out[3]:
u techniques items n_items ratings n_ratings
0 0 [8, 0, 0, 5, 6, 0, 0, 1, 0, 9, 1, 0, 0, 0, 1, ... [1118, 27680, 32541, 137353, 16428, 28815, 658... 31 [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, ... 31
1 1 [11, 0, 0, 2, 12, 0, 0, 0, 0, 14, 5, 0, 0, 0, ... [122140, 77036, 156817, 76957, 68818, 155600, ... 39 [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, ... 39
2 2 [13, 0, 0, 7, 5, 0, 1, 2, 1, 11, 0, 1, 0, 0, 1... [168054, 87218, 35731, 1, 20475, 9039, 124834,... 27 [3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, ... 27
3 3 [498, 13, 4, 218, 376, 3, 2, 33, 16, 591, 10, ... [163193, 156352, 102888, 19914, 169438, 55772,... 1513 [5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 5.0, 5.0, 5.0, ... 1513
4 4 [161, 1, 1, 86, 93, 0, 0, 11, 2, 141, 0, 16, 0... [72857, 38652, 160427, 55772, 119999, 141777, ... 376 [5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 5.0, 4.0, 5.0, ... 376

Okay, we can see the data on each file. The column names are self-explanatory, so we can get started.

Data Preparation and Exploration.

To build this simple recommender system, we must first prepare the data in a Surprise-compatible dataset. We're only interested in user ratings, so we'll pull them from the recipe ratings dataset.

The first step is to write a function that reads the items (recipes) and user ratings.

In [4]:
def getRecipeRatings(idx):
  user_items = [int(s) for s in user_data.loc[idx]['items'].replace('[','').replace(']','').replace(',','').split()]
  user_ratings = [float(s) for s in user_data.loc[idx]['ratings'].replace('[','').replace(']','').replace(',','').split()]
  df = pd.DataFrame(list(zip(user_items,user_ratings)),columns = ['Item','Rating'])
  df.insert(loc=0,column='User',value = user_data.loc[idx].u)
  return df

Then, create a dataset with one row for each User, Item, and Rating.

In [5]:
#recipe_ratings = pd.DataFrame(columns = ['User','Item','Rating'])
#for idx,row in user_data.iterrows():
#  recipe_ratings = recipe_ratings.append(getRecipeRatings(row['u']),ignore_index=True)

Because the dataset is large and the previous code takes time to execute, we only want to create it once so that we can use pickle to save it to disk and read it back whenever we need to. This saves us a significant amount of time.

In [6]:
#recipe_ratings.to_pickle('/work/recipe_ratings.pkl')
recipe_ratings = pd.read_pickle('/work/recipe_ratings.pkl')

It's a good idea to do some data exploration, so let's get started. We know this is high-quality data, so we'll just make a bar chart to see how the ratings are distributed.

In [7]:
import seaborn as sns
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Out[7]:
<AxesSubplot:ylabel='Rating'>

Good, we see that the majority of the ratings are 5.0, indicating that there are a lot of satisfied users with the recipes.

Because the dataset is large, we will reduce it to save time and avoid running out of memory. Let's only look at the recipes with the most ratings. Let's get rid of the recipes with fewer than 10 ratings.

In [8]:
recipe_counts = recipe_ratings.groupby(['Item']).size()
filtered_recipes = recipe_counts[recipe_counts>30]
filtered_recipes_list = filtered_recipes.index.tolist()
filtered_recipes_list = filtered_recipes.index.tolist()
len(filtered_recipes_list)
Out[8]:
2349
In [9]:
recipe_ratings = recipe_ratings[recipe_ratings['Item'].isin(filtered_recipes_list)]
In [10]:
recipe_ratings.count()
Out[10]:
User      174359
Item      174359
Rating    174359
dtype: int64

Let's take a look at the new rating distribution. As we can see, it is similar to the distribution of the entire dataset.

In [11]:
sns.barplot(x=recipe_ratings.Rating.value_counts().index, y=recipe_ratings.Rating.value_counts())
Out[11]:
<AxesSubplot:ylabel='Rating'>

Okay, we now have a dataset with over 300,000 ratings and approximately 11000 recipes. Enough for our purposes and manageable via Google Colab. Let's get to work on the model!

Model Creation

In Google Colab, we have to install the Surprise package in order to start working with it.

The package surprise includes a number of prediction algorithms that will assist us in developing the recommendation system and selecting a number of recipes that a given user might enjoy. We have the option of using basic collaborative filtering algorithms (KNN) or Matrix Factorization algorithms such as SVD or SVDpp.

KNN-based algorithms choose user or item neighbors based on similarity (taking into account the mean or z-score normalization of each item or user rating). We can specify whether we want to run the user-based or item-based algorithm using the user_based parameter.

Matrix Factorization algorithms translate the user-item matrix into a lower-dimensional space and predict ratings from there.

More information on the definition and behavior of the algorithms can be found on the surprise documentation site.

We'll run some of them through cross-validation to compare the metrics (RMSE) and (MAE) and see how they work with this dataset.

As a baseline, let's run the most basic algorithm (NormalPredictor), which makes random predictions, and then see how the other algorithms improve the evaluation metrics.

In [12]:
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import SVDpp
from surprise import KNNBasic
from surprise.model_selection import cross_validate
In [13]:
reader = Reader(rating_scale=(0, 5))

data = Dataset.load_from_df(recipe_ratings[['User', 'Item', 'Rating']], reader)
In [14]:
trainSet = data.build_full_trainset()

algo = NormalPredictor()

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Evaluating RMSE, MAE of algorithm NormalPredictor on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2205  1.2199  1.2267  1.2288  1.2277  1.2247  0.0037  
MAE (testset)     0.7915  0.7909  0.7889  0.7919  0.7920  0.7910  0.0011  
Fit time          0.45    0.49    0.48    0.47    0.47    0.47    0.01    
Test time         0.87    0.64    0.88    0.67    0.62    0.73    0.12    
Out[14]:
{'test_rmse': array([1.22054598, 1.21993152, 1.2266604 , 1.22876997, 1.22772276]),
 'test_mae': array([0.79151228, 0.79090318, 0.78889573, 0.79188567, 0.79197489]),
 'fit_time': (0.4522593021392822,
  0.4864497184753418,
  0.48149585723876953,
  0.47049403190612793,
  0.4713156223297119),
 'test_time': (0.8718886375427246,
  0.6412787437438965,
  0.879267692565918,
  0.6652803421020508,
  0.6168420314788818)}

Let's see the predictions this algorithm yields for a given user. We need to fit the algorithm with the whole trainset, then make predictions with a test set that contains the user-item pairs that do not exist in the training set. This testSet can be easily built with the function build_anti_testset(), but in this case in order to save resources and time we are going to build a testset for just one user. We need to iterate over all the ratings in the trainSet and select the items that the user has not rated. We also need to fill a rating value for those (user,item) pairs, so we are going to use the trainSet global mean (which is the default value used by surprise).

In [15]:
anti_testset_user = []
targetUser = 0 #inner_id of the target user
fillValue = trainSet.global_mean
user_item_ratings = trainSet.ur[targetUser]
user_items = [item for (item,_) in (user_item_ratings)]
user_items
ratings = trainSet.all_ratings()
for iid in trainSet.all_items():
  if(iid not in user_items):
    anti_testset_user.append((trainSet.to_raw_uid(targetUser),trainSet.to_raw_iid(iid),fillValue))
In [16]:
predictions = algo.test(anti_testset_user)
In [17]:
predictions[0]
Out[17]:
Prediction(uid=0, iid=122140, r_ui=4.602159911447072, est=5, details={'was_impossible': False})

Let's see the 10 recipes with better estimated rating for this user. I like to convert the predictions object into a DataFrame so that I can work better with it.

In [18]:
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Out[18]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
122140 lemon raspberry yogurt muffins 125573 28 69474 2005-06-11 ['30-minutes-or-less', 'time-to-make', 'course... [194.3, 8.0, 56.0, 13.0, 8.0, 15.0, 10.0] 11 ['combine eggs , milk , yogurt , butter and va... i think i found this in a gooseberry patch coo... ['eggs', 'milk', 'lemon yogurt', 'butter', 'le... 11
161785 poached salamon with green bean salad 502339 25 39835 2013-06-25 ['weeknight', '30-minutes-or-less', 'time-to-m... [385.1, 30.0, 36.0, 23.0, 64.0, 15.0, 6.0] 12 ['in a large skillet combine lemon , tarragon ... we enjoyed this fish recipe - with some change... ['tarragon', 'lemon', 'white pearl onions', 's... 8
82121 fantastic never fail pan yorkshire pudding 146196 90 89831 2005-11-25 ['time-to-make', 'course', 'main-ingredient', ... [126.8, 4.0, 0.0, 7.0, 11.0, 5.0, 6.0] 13 ['mix the flour and salt together until well b... this recipe has never failed me yet and you ma... ['all-purpose flour', 'salt', 'eggs', 'water',... 6
136789 miso eggplant 359711 15 949477 2009-03-08 ['15-minutes-or-less', 'time-to-make', 'course... [82.6, 5.0, 35.0, 3.0, 2.0, 2.0, 4.0] 8 ['slice eggplant into 1 / 4-inch circles', 'fr... from "ofukuro no aji" cookbook. ['japanese eggplant', 'sesame seed oil', 'wate... 8
47837 chili s enchilada soup 392171 60 1366254 2009-09-28 ['weeknight', '60-minutes-or-less', 'time-to-m... [234.1, 17.0, 18.0, 38.0, 35.0, 29.0, 4.0] 10 ['add oil to a large pot over medium heat', 'a... my very pregnant best friend has been craving ... ['vegetable oil', 'boneless skinless chicken b... 12
112962 italian rolled peppers with mushrooms and ricotta 108462 30 37636 2005-01-16 ['30-minutes-or-less', 'time-to-make', 'course... [252.3, 28.0, 24.0, 21.0, 23.0, 32.0, 4.0] 18 ['preheat broiler in oven', 'place peppers on ... makes a wonderful low-carb, meatless entree or... ['bell peppers', 'white button mushrooms', 'ol... 10
106202 holyfield s ear 277137 5 330545 2008-01-07 ['15-minutes-or-less', 'time-to-make', 'course... [44.0, 0.0, 18.0, 0.0, 0.0, 0.0, 1.0] 2 ['fill a standard shot glass 1 / 2 full of cre... just another bloody red cocktail. ['coffee liqueur', 'creme de noyaux'] 2
128131 macaroni salad with dill 363181 50 1071114 2009-03-28 ['60-minutes-or-less', 'time-to-make', 'course... [328.9, 31.0, 19.0, 17.0, 8.0, 14.0, 11.0] 3 ['cook , drain , and rinse pasta', 'combine al... quick, easy to make macaroni salad with a touc... ['cooked macaroni', 'mayonnaise', 'celery', 'r... 7
8524 armadillo eggs 125025 100 191220 2005-06-07 ['time-to-make', 'course', 'main-ingredient', ... [976.1, 127.0, 10.0, 56.0, 100.0, 133.0, 1.0] 17 ['you will need toothpicks for this', "i buy t... my dad grills a lot and is always making these... ['chicken breasts', 'bacon', 'fresh jalapeno'] 3
160999 pistachio cream pie 190622 15 136511 2006-10-14 ['15-minutes-or-less', 'time-to-make', 'course... [428.8, 44.0, 89.0, 16.0, 13.0, 70.0, 12.0] 8 ['in a small mixing bowl , beat the cream chee... this pie reminds me of the watergate salad my ... ['cream cheese', 'milk', 'instant pistachio pu... 7

OK, with that baseline let's check if other algorithms can improve the metrics. Let's try with neighbourhoud based algorithm (KNNBasic) computing similarities between items.

In [19]:
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBasic(sim_options=sim_options)
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Computing the cosine similarity matrix...
/root/venv/lib/python3.7/site-packages/surprise/prediction_algorithms/algo_base.py:249: RuntimeWarning: invalid value encountered in double_scalars
  sim = construction_func[name](*args)
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0297  1.0383  1.0111  1.0327  1.0371  1.0298  0.0098  
MAE (testset)     0.5512  0.5532  0.5438  0.5554  0.5558  0.5519  0.0044  
Fit time          6.14    5.84    5.38    5.73    6.00    5.82    0.26    
Test time         4.65    4.48    4.86    4.75    5.15    4.78    0.22    
Out[19]:
{'test_rmse': array([1.02966818, 1.03834889, 1.01109845, 1.03267812, 1.03710726]),
 'test_mae': array([0.5512142 , 0.55320937, 0.54383441, 0.55537928, 0.55583223]),
 'fit_time': (6.140854120254517,
  5.837968826293945,
  5.376391887664795,
  5.7278733253479,
  5.996432781219482),
 'test_time': (4.649652004241943,
  4.480146169662476,
  4.859352111816406,
  4.7482147216796875,
  5.14623761177063)}

This algorithm clearly outperformed our baseline. As can be seen, the MAE and RMSE means are better (lower) than those of the NormalPredictors. Let's see which recipes do this algorithm recommends.

In [20]:
predictions = algo.test(anti_testset_user)
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Out[20]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
79996 elderberry muffins 394509 30 283251 2009-10-13 ['30-minutes-or-less', 'time-to-make', 'course... [250.1, 14.0, 78.0, 8.0, 9.0, 11.0, 12.0] 2 ['cream sugar and oleo , add additional ingred... from my recipe files. uses dried elderberries. ['sugar', 'oleo', 'milk', 'egg', 'nutmeg', 'fl... 12
39145 cheese on a pedestal 274374 40 37449 2007-12-27 ['60-minutes-or-less', 'time-to-make', 'course... [114.7, 14.0, 0.0, 9.0, 13.0, 29.0, 0.0] 5 ['cut the cheeses into wedges and cover with p... there's nothing like good cheese paired with f... ['white cheddar cheese', 'stilton cheese', 'wa... 6
4683 amazing potato salad 86277 75 58300 2004-03-11 ['time-to-make', 'course', 'main-ingredient', ... [286.0, 17.0, 17.0, 10.0, 14.0, 11.0, 13.0] 19 ['in a large saucepan , cover the eggs with wa... i found this recipe in my food and wine magazi... ['eggs', 'baking potatoes', 'salt', 'mayonnais... 13
111487 inside out burger 385856 30 574975 2009-08-17 ['weeknight', '30-minutes-or-less', 'time-to-m... [686.2, 67.0, 10.0, 19.0, 92.0, 90.0, 7.0] 7 ['heat grill to med-high heat', 'pat ground ch... i came across this recipe idea when my husband... ['ground chuck', 'black pepper', 'american che... 10
136641 minted glazed carrots 478129 20 47892 2012-04-17 ['30-minutes-or-less', 'time-to-make', 'course... [202.3, 24.0, 39.0, 9.0, 2.0, 48.0, 5.0] 9 ['lightly steam the carrots in a steamer baske... found on bunch carrots from siri produce inc. ... ['carrots', 'butter', 'sugar', 'of fresh mint'... 5
31607 burnt tongue bbq sauce 67025 1620 93190 2003-07-18 ['weeknight', 'time-to-make', 'course', 'main-... [1826.8, 3.0, 1047.0, 350.0, 18.0, 1.0, 153.0] 12 ['combine all ingredients in a large saucepan'... this bbq sauce has a nice sweet flavor combine... ['ketchup', 'distilled white vinegar', 'dark c... 14
111127 indian split pea and vegetable soup 249754 30 283251 2007-08-29 ['30-minutes-or-less', 'time-to-make', 'course... [375.3, 11.0, 34.0, 49.0, 36.0, 20.0, 21.0] 13 ['remove the spinach from the freezer', 'in a ... this is from one of my food & wine cookbooks. ... ['frozen spinach', 'split peas', 'water', 'gin... 12
72081 dijon potato salad 360260 30 653438 2009-03-11 ['30-minutes-or-less', 'time-to-make', 'course... [195.9, 10.0, 6.0, 2.0, 7.0, 4.0, 10.0] 12 ['place a steamer basket in a saucepan filled ... another recipe i found in everyday food. i ha... ['new potatoes', 'white wine vinegar', 'dijon ... 6
135848 mimi s chili 103441 230 163986 2004-11-05 ['time-to-make', 'course', 'main-ingredient', ... [325.9, 21.0, 65.0, 53.0, 54.0, 25.0, 8.0] 10 ['wash beans and place in pot with the next tw... it is the best i have had and make it often. ... ['dry pinto beans', 'dried kidney beans', 'dri... 16
172034 red devil s food cake 203780 50 122878 2007-01-07 ['60-minutes-or-less', 'time-to-make', 'course... [362.1, 16.0, 140.0, 18.0, 10.0, 13.0, 20.0] 16 ['prepare greased baking pans', 'preheat oven ... this is the best chocolate cake recipe i've ev... ['shortening', 'sugar', 'salt', 'vanilla', 'co... 9

Let's take a look at a Matrix Factorization algorithm now.

In [21]:
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9561  0.9709  0.9529  0.9460  0.9738  0.9599  0.0107  
MAE (testset)     0.5572  0.5622  0.5583  0.5558  0.5641  0.5595  0.0031  
Fit time          16.23   16.77   17.04   16.65   16.26   16.59   0.31    
Test time         0.85    0.60    0.68    0.69    0.61    0.68    0.09    
Out[21]:
{'test_rmse': array([0.95612274, 0.97089603, 0.95285402, 0.94597191, 0.97379423]),
 'test_mae': array([0.55724148, 0.56222468, 0.55830495, 0.55583515, 0.56405309]),
 'fit_time': (16.2300808429718,
  16.76616668701172,
  17.043593406677246,
  16.6538667678833,
  16.259265184402466),
 'test_time': (0.8469138145446777,
  0.6002140045166016,
  0.6778244972229004,
  0.6859123706817627,
  0.6077754497528076)}

We appear to have improved the KNNBasic algorithm slightly. The mean of the MAE is similar, but we improved the RMSE, resulting in smaller errors in our rating predictions.

In [22]:
predictions = algo.test(anti_testset_user)
pred = pd.DataFrame(predictions)
pred.sort_values(by=['est'],inplace=True,ascending = False)
recipe_list = pred.head(10)['iid'].to_list()
recipe_data.loc[recipe_list]
Out[22]:
name id minutes contributor_id submitted tags nutrition n_steps steps description ingredients n_ingredients
153429 passion strawberry fruit leather dehydrator ... 229712 363 11176 2007-05-23 ['course', 'main-ingredient', 'cuisine', 'prep... [59.4, 0.0, 13.0, 0.0, 1.0, 0.0, 5.0] 7 ['pure chopped strawberries with the applesauc... once you know how to plug your dehydrator in, ... ['fresh strawberries', 'passionfruit syrup', '... 4
110718 impossible tortilla casserole 95044 60 24386 2004-07-05 ['60-minutes-or-less', 'time-to-make', 'course... [324.2, 26.0, 10.0, 21.0, 34.0, 39.0, 7.0] 9 ['preheat oven to 375', 'line greased 9x13" pa... have not tried this yet, but it sounds good an... ['8-inch flour tortillas', 'ground beef', 'oni... 9
127869 macadamia bars 495 155 22015 1999-09-10 ['weeknight', 'time-to-make', 'course', 'prepa... [116.8, 13.0, 19.0, 1.0, 2.0, 19.0, 2.0] 17 ['crusts: in food processor or bowl , combine ... this is an adopted recipe. i have not made it... ['all-purpose flour', 'sugar', 'butter', 'wate... 9
9143 asian californian mexican faux crab salad 239467 15 21752 2007-07-09 ['15-minutes-or-less', 'time-to-make', 'course... [108.0, 13.0, 7.0, 0.0, 3.0, 6.0, 2.0] 15 ['in a glass bowl , toss together the ingredie... i found a recipe on a fun food blog -- singleg... ['red pepper', 'tomatoes', 'scallions', 'garli... 15
143777 nif s easy crock pot smothered roast beef 350056 365 65502 2009-01-16 ['course', 'main-ingredient', 'preparation', '... [340.9, 18.0, 11.0, 21.0, 103.0, 22.0, 2.0] 4 ['place roast in crock pot', 'mix other ingred... a really easy dish to throw together. you can ... ['beef roast', 'condensed cream of mushroom so... 6
87426 fresh veggie pockets 52572 15 41706 2003-01-28 ['15-minutes-or-less', 'time-to-make', 'course... [296.6, 10.0, 13.0, 27.0, 35.0, 6.0, 14.0] 3 ['in a bowl , combine the cream cheese , sunfl... this low-fat delicious sandwich is just right ... ['fat free cream cheese', 'sunflower seeds', '... 9
172561 red white blue berry pops kid fun 93475 420 94272 2004-06-16 ['lactose', 'time-to-make', 'course', 'main-in... [16.4, 0.0, 16.0, 0.0, 0.0, 0.0, 1.0] 12 ['in each 3-oz', 'cup , pour two tablespoons o... or, for adults who think they're kids! i've al... ['strawberry juice', 'lemonade', 'raspberry ju... 4
82291 farmgirl s funky chicken 84495 140 95567 2004-02-20 ['time-to-make', 'main-ingredient', 'preparati... [749.2, 74.0, 67.0, 60.0, 113.0, 66.0, 7.0] 10 ['preheat oven to 450', 'wash and dry chicken'... easy, tasty, hearty chicken recipe. when time ... ['chicken', 'relish', 'salt', 'oil', 'ground b... 6
31649 bush pesto 332481 5 422893 2008-10-23 ['15-minutes-or-less', 'time-to-make', 'course... [1797.2, 300.0, 7.0, 16.0, 27.0, 153.0, 2.0] 3 ['puree the nuts , parmesan and 1 / 4 cup of t... pesto with an aussie spin from bushfoodrecipes... ['unsalted macadamia nuts', 'parmesan cheese',... 7
75591 easy cheesy shepherd s pie 194724 30 166019 2006-11-09 ['30-minutes-or-less', 'time-to-make', 'course... [764.6, 66.0, 12.0, 45.0, 89.0, 116.0, 16.0] 10 ['preheat oven to 400f', 'grease an oblong bak... this is such an easy, quick and kid friendly r... ['ground beef', 'condensed cheddar cheese soup... 9

Parameter tuning with GridSearchCV

Scikit-surprise also allows us to tune the algorithms through GridSearchCV, which allows to execute the algorithm repeteadly using a predefined list of parameters values and returning the best set of parameters given the defined error metrics.

In [23]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_factors': [100,150],
              'n_epochs': [20,25,30],
              'lr_all':[0.005,0.01,0.1],
              'reg_all':[0.02,0.05,0.1]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse','mae'], cv=3)
grid_search.fit(data)     

Let's see the scores for the best parameters found.

In [24]:
print(grid_search.best_score['rmse'])
print(grid_search.best_score['mae'])
0.95384628819102
0.5554452372062167

Because the model takes time to run, it's a good idea to save it to disk so we can reuse it and save time.

In [25]:
# save the model to disk
pickle.dump(grid_search, open('/work/surprise_grid_search_svd.sav', 'wb'))
#Load the model from disk
grid_search = pickle.load(open('/work/surprise_grid_search_svd.sav', 'rb'))

Let's take a look at the best parameters found by GridSearchCV.

In [26]:
print(grid_search.best_params['rmse'])
{'n_factors': 100, 'n_epochs': 25, 'lr_all': 0.005, 'reg_all': 0.1}

We can now repeat the cross validation with the best parameters and compare the results.

In [27]:
algo = grid_search.best_estimator['rmse']

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9474  0.9523  0.9415  0.9564  0.9593  0.9514  0.0064  
MAE (testset)     0.5533  0.5551  0.5518  0.5560  0.5550  0.5542  0.0015  
Fit time          19.02   22.06   20.95   19.59   19.24   20.17   1.16    
Test time         0.83    0.96    0.87    0.86    0.73    0.85    0.07    
Out[27]:
{'test_rmse': array([0.94737198, 0.95230069, 0.94149614, 0.95642868, 0.95931833]),
 'test_mae': array([0.55329538, 0.55506764, 0.55180418, 0.55596874, 0.55499898]),
 'fit_time': (19.02249050140381,
  22.055828332901,
  20.950928211212158,
  19.59484338760376,
  19.244446277618408),
 'test_time': (0.832690954208374,
  0.9575753211975098,
  0.8661718368530273,
  0.8619349002838135,
  0.7340149879455566)}

By tuning the method parameters, we were able to slightly outperform the previous SVD technique.

Conclusion

We have seen how to make a simple Recommender System model with scikit Surprise.
To build a recommender system the only needed data are a list of items and a list of ratings the user gave to these items.

With Scikit-Surprise, we learned how to create a simple Recommender System model.
The only data required to create a recommender system is a list of items and a list of user ratings for these items. For this, we have downloaded an appropriate dataset.
We learned how to prepare data and generate a dataset suitable for scikit Surprise in order to calculate user or item similarity, estimate user ratings for objects, and build recommendations from there.
We’ve also experimented with several algorithms to see what metrics they provide and how to fine-tune the algorithms’ settings to improve our metrics.

More information on how to customize surprise algorithms to create more reliable recommender systems can be found in our next post.

The data for this project was obtained via Kaggle. Please see the following cita:

Generating Personalized Recipes from Historical User Preferences

Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley

EMNLP, 2019

https://www.aclweb.org/anthology/D19-1613/