Recommender system: Movielens100K example
In this notebook, we explore some popular matrix-decomposition based recommeder system algorithms on Movielens100K dataset. This is largely based on the slides of STAT8056, Spring 2021, which is taught by Prof. Xiaotong Shen.
Introduction¶
Movielens data are comprised of movie ratings and are available at http://grouplens.org/datasets/movielens/. The dataset was collected through the MovieLens web site (movielens.umn.edu) during the seven months from September 19th, 1997 through April 22nd, 1998. This dataset has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file are available at the end of README.txt on the website. The 100K MovieLens data consists of 1,000,000 anonymous ratings on a five-star scale from 1,000 users on 1,700 movies. Here are four categorical and one continuous covariate, including four user-related covariates, gender, age, occupation, and zip-code, as well as one content-related covariate, genres. For simplicity, we treat gender, age, and occupation as continuous variables, in addition to 19 different movie genres that are reparametrized into 19 binary covariates on if a movie belongs to a particular genre, where only the first digit is used for the zip-code.
Our goals:
Predict the preference scores of each user on all movies using your favorite method such as deep learning. Then compare your approach with the SVD approach. There are five pairs of training and testing data sets for, denoted by (u1.base,u1.test),· · ·,(u5.base,u5.test) there. It is natural to use five-fold cross-validation for training and testing. Compute the root mean square error based on cross-validation.
For this dataset, many ratings are not observed or missing. Do you have any evidence to suggest that missing does not occur at random? (Hint: Supply suitable plots to argue.) If the goal is to build a recommender system with high accuracy of prediction to outperform the conventional singular value decomposition method, then demonstrate that incorporating the missing pattern can improve the prediction performance.
Enviornments & requirements¶
from sys import version
from importlib.metadata import version as vs
requirements = ['numpy', 'pandas', 'seaborn', 'matplotlib', 'statsmodels', 'jupyter', 'scikit-surprise']
print('\n'.join([f'Python version: {version}']+[f'{package.capitalize()} ~= {vs(package)}' for package in requirements]))
Python version: 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC 9.3.0] Numpy ~= 1.20.1 Pandas ~= 1.2.3 Seaborn ~= 0.11.1 Matplotlib ~= 3.3.4 Statsmodels ~= 0.12.2 Jupyter ~= 1.0.0 Scikit-surprise ~= 1.1.1
Read data and an overview¶
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
markers = ['s', '.', '^', '*', '+', '*']
line_types = ['-.', '-', ':', '-', '--', '--']
SMALL_SIZE = 12
MEDIUM_SIZE = 15
BIGGER_SIZE = 20
plt.rc('font', size=BIGGER_SIZE) # controls default text sizes
plt.rc('axes', titlesize=BIGGER_SIZE) # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE) # fontsize of the x and y labels
plt.rc('xtick', labelsize=BIGGER_SIZE) # fontsize of the tick labels
plt.rc('ytick', labelsize=BIGGER_SIZE) # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE) # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE) # fontsize of the figure title
df = pd.read_table('ml-100k/u1.base', header=None)
# row rating data contains pairs of (user, item, rating, timestamp), convert it to a matrix
ratings = np.full((943,1682), np.nan)
ratings = pd.DataFrame(ratings, columns=list(range(1,1683)), index=list(range(1,944)))
ratings = ratings.add(pd.pivot_table(df, index=0, columns=1, values=2), fill_value=0)
ratings
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 5.0 | 3.0 | 4.0 | 3.0 | 3.0 | NaN | 4.0 | 1.0 | 5.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 4.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
939 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
940 | NaN | NaN | NaN | 2.0 | NaN | NaN | 4.0 | 5.0 | 3.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
941 | 5.0 | NaN | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
942 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
943 | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
943 rows × 1682 columns
Let's take a look at the visualization of ratings.
n_user, n_item = ratings.shape
print(f'There are {n_user} users and {n_item} items in total.')
print(f'Rating missing rate is {((1-len(df)/n_user/n_item)*100):.2f}%.')
sns.heatmap(ratings, cmap=sns.color_palette("crest", as_cmap=True), mask=ratings.isna());
There are 943 users and 1682 items in total. Rating missing rate is 94.96%.
This dataset also has user/item covariates.
user = pd.read_table('ml-100k/u.user', header=None, delimiter='|', index_col=0)
user = user.rename(columns={1: "age", 2: "gender", 3: "job", 4: "zip"})
user['zip'] = user['zip'].map(lambda x: x[0] if x[0].isdigit() else 9) # only first digit, change letter to 9, since it's minority(<10)
user['gender'] = user['gender'].map(lambda x: 0 if x=='M' else 1) # coding gender to binary
user['job'] = user['job'].astype('category').cat.codes # coding job to integer
user = user.astype('int')
user
age | gender | job | zip | |
---|---|---|---|---|
0 | ||||
1 | 24 | 0 | 19 | 8 |
2 | 53 | 1 | 13 | 9 |
3 | 23 | 0 | 20 | 3 |
4 | 24 | 0 | 19 | 4 |
5 | 33 | 1 | 13 | 1 |
... | ... | ... | ... | ... |
939 | 26 | 1 | 18 | 3 |
940 | 32 | 0 | 0 | 0 |
941 | 20 | 0 | 18 | 9 |
942 | 48 | 1 | 10 | 7 |
943 | 22 | 0 | 18 | 7 |
943 rows × 4 columns
genre = pd.read_table('ml-100k/u.item', header=None, delimiter='|', encoding='latin-1', index_col=0).iloc[:, -19:]
genre_name = pd.read_table('ml-100k/u.genre', header=None, delimiter='|')[0].values
genre.columns = genre_name
genre # 1 indicates the item belongs to that category
unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||||||||
1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1678 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1679 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1680 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1681 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1682 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1682 rows × 19 columns
The plot of the number of ratings given by user/received by item vs average rating reveals that the data is not missing at random. Items with more ratings have higher average score. It's reasonable that a good movie is popular. Users have more ratings tend to have a lower average score. This observation motivates us to incorporate the information of missing pattern into the recommendation.
user_ave_rating = ratings.mean(axis=1).to_numpy()
user_num_rating = ratings.notna().sum(axis=1).to_numpy()
item_ave_rating = ratings.mean(axis=0).to_numpy()
item_num_rating = ratings.notna().sum(axis=0).to_numpy()
plt.figure(figsize=(12, 4))
plt.subplot(121)
sns.regplot(x=user_num_rating, y=user_ave_rating, line_kws={'color':'r'})
plt.xlabel('Number of user\'s ratings')
plt.ylabel('User\'s Averange rating')
plt.subplot(122)
sns.regplot(x=item_num_rating, y=item_ave_rating, lowess=True, line_kws={'color':'r'})
plt.xlabel('Number of item\'s ratings')
plt.ylabel('Item\'s Averange rating')
Text(0, 0.5, "Item's Averange rating")
The average ratings of different genres are similar to each other, except "unknown" category.
item_sum_rating = ratings.sum(axis=0).to_numpy()
genre_ave_rating = [item_sum_rating[genre.iloc[:,i] == 1].sum()/item_num_rating[genre.iloc[:,i] == 1].sum() for i in range(19)]
sns.barplot(x=genre_ave_rating, y=genre_name)
plt.yticks(fontsize=12);
Methods¶
In this section, we try several popular algorithms which don't consider user/content covariates and the nonigonrable missing pattern. They are
Grand mean imputation, used as baseline.
Model: $\hat{r}_{ui} = \mu$.
Simple linear regression, used as baseline.
Model: $\hat{r}_{ui} = \mu + b_{u}+ b_{i}$.
Optimization: minimize sum of square error(SSE) by alternating least square (ALS) or stochastic gradient descent (SGD) method.
Regularized sigular value decomposition (rSVD) with means (Funk, 2006; Koren, 2009).
Model: $\hat{r}_{u i}=\mu+b_{u}+b_{i}+ p_{u}^Tq_i$.
Optimization: minimize $\sum_{r_{u i} \in R_{\text {train }}}\left(r_{u i}-\hat{\boldsymbol{r}}_{u i}\right)^{2}+\lambda\left(b_{i}^{2}+b_{u}^{2}+\left\|q_{i}\right\|^{2}_2+\left\|p_{u}\right\|^{2}_2\right)$, using ALS or SGD.
*Regularized sigular value decomposition (SoftImpute) (Mazumder, 2010).
Model: $\hat{r}_{u i}= p_{u}^Tq_i$.
Optimization: minimize a nuclear-norm regularized relaxtion to $\ell_0$ regularized SSE $\lVert R - \Theta \rVert_F^2 + \lambda \lVert \Theta \rVert_*$, where $\Theta = P^TQ$. Use SoftImpute algorithm.
*Regression-Based Latent FactorModels (Agarwal, 2009). Extension to 2, replacing the bias term of user and item by regression on covariates.
Model: $\hat{r}_{u i}=\mu+b_{u}+b_{i}+ p_{u}^Tq_i$, $b_u \sim N(x_u^T\alpha, \sigma^2)$, $b_i \sim N(y_i^T\beta, \sigma^2)$.
Optimization: use Bayesian way.
*Partial Latent Models (Zhu, 2015). Extension to 2, replacing the bias term of user and item by regression on covariates.
Model: $\hat{r}_{u i}=x_u^T\alpha+y_i^T\beta+ p_{u}^Tq_i$.
Optimization: use blockwise coordinate descent.
Group specific latent model (Bi, 2017). Extension to 2, add a group mean to either user or item latent factor.
Model: $\Theta = (P+S_c)^T(Q+T_c)$, where $S_c, T_c$ is the group specific mean for each user/item.
Optimization: blockwise coordinate descent.
*Smooth neighborhood recommender system (Dai, 2020).
Model: minimize $\frac{1}{n m} \sum_{u=1}^{n} \sum_{i=1}^{m}\left(\sum_{\left(u^{\prime}, i^{\prime}\right) \in \Omega} \omega_{u i, u^{\prime} i^{\prime}} r_{u^{\prime} i^{\prime}}-\boldsymbol{p}_{u}^{T} \boldsymbol{q}_{i}\right)^{2}+\lambda_{1} \sum_{u=1}^{n} J\left(\boldsymbol{p}_{u}\right)+\lambda_{2} \sum_{i=1}^{m} J\left(\boldsymbol{q}_{i}\right)$.
Optimization: ALS.
Method start with * is not implemented below. Here only method 8 is knn based, all others are matrix-decomposition based.
The data is split into five folds, and an average root mean square error(RMSE) is used as evaluation criterion.
def rmse(y, y_pred):
"""RMSE for two vectors."""
return np.sqrt(np.mean(np.square(y-y_pred)))
errors = pd.DataFrame(np.zeros(4).reshape((1, 4)), columns=['Mean', 'LR', 'rSVD', 'gSVD'], index=['RMSE'])
Grand mean¶
error = np.zeros(5)
for fold in range(1, 6):
train = pd.read_table(f'ml-100k/u{fold}.base', header=None)
test = pd.read_table(f'ml-100k/u{fold}.test', header=None)
X_train, y_train = train[[0, 1]].astype('category'), train[[2]].to_numpy().astype('float')
X_test, y = test[[0, 1]].astype('category'), test[[2]].to_numpy().astype('float')
y_pred = np.mean(y_train)
error[fold-1] = rmse(y, y_pred)
errors['Mean'] = error.mean()
f'RMSE for Grand Mean: {error.mean():.2f}'
'RMSE for Grand Mean: 1.13'
Linear regression¶
from surprise import SVD
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold
from surprise.model_selection import cross_validate
# Alternatively, if data is not downloaded yet, use this command
# data = Dataset.load_builtin('ml-100k')
# If data is already downloaded with specific train-test split, can use codes below
# path to dataset folder
files_dir = os.path.join(os.getcwd(), 'ml-100k/')
# This time, we'll use the built-in reader.
reader = Reader('ml-100k')
# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()
algo = BaselineOnly()
res = cross_validate(algo, data, measures=['RMSE'], cv=pkf, verbose=True)
errors['LR'] = res['test_rmse'].mean()
f'RMSE for Linear Regression: {errors["LR"][0]:.2f}'
Estimating biases using als... Estimating biases using als... Estimating biases using als... Estimating biases using als... Estimating biases using als... Evaluating RMSE of algorithm BaselineOnly on 5 split(s). Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std RMSE (testset) 0.9599 0.9477 0.9405 0.9383 0.9423 0.9457 0.0077 Fit time 0.04 0.07 0.06 0.07 0.06 0.06 0.01 Test time 0.04 0.08 0.04 0.04 0.07 0.05 0.02
'RMSE for Linear Regression: 0.95'
SVD¶
algo = SVD()
res = cross_validate(algo, data, measures=['RMSE'], cv=pkf, verbose=True)
errors['rSVD'] = res['test_rmse'].mean()
f'RMSE for rSVD: {errors["rSVD"][0]:.2f}'
Evaluating RMSE of algorithm SVD on 5 split(s). Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std RMSE (testset) 0.9505 0.9389 0.9313 0.9301 0.9385 0.9379 0.0073 Fit time 2.21 2.24 2.22 2.23 2.24 2.23 0.01 Test time 0.07 0.07 0.10 0.07 0.11 0.08 0.02
'RMSE for rSVD: 0.94'
Group SVD¶
The MATLAB code is available https://sites.google.com/site/xuanbigts/software. I record the RMSE for different hyperparameter (each column) combinations below, and we use the second column at last.
res = ''' 0.9435 0.9210 0.9456
0.9287 0.9087 0.9329
0.9245 0.9077 0.9295
0.9309 0.9098 0.9316
0.9323 0.9170 0.9314'''
res = np.fromstring(res, sep='\t').reshape((5,3))[:,1].reshape(-1)
errors['gSVD'] = res.mean()
f'RMSE for gSVD: {errors["gSVD"][0]:.2f}'
'RMSE for gSVD: 0.91'
Summary¶
I tried grandient boosting as well, but the performance is as bad as grand mean, so I didn't put it here.
errors
Mean | LR | rSVD | gSVD | |
---|---|---|---|---|
RMSE | 1.125578 | 0.945736 | 0.937855 | 0.91284 |
sns.barplot(y=errors.values.reshape(-1), x=errors.columns)
<AxesSubplot:>
As we can see, gSVD utilize the non-ignorable missing pattern information and therefore performs better. The try on incorporate covariates fails, interested readers can try method 5 or 6, however I highly doubt that the improvement is neglectable. sSVD's source code is not published so I didn't realize it here, but it claims that incorporating network pattern in the user can further imporve the performance.
Overall, recommender system is a super difficult topic, especially for large scale data with high missing rate. Simple method like linear regression is only slightly worse than rSVD. However, recent research shows that if we are able to utilize the pattern information in the data, it's promising to increase the accuracy.
References¶
- A repo for examples and best practices for building recommendation systems, provided as Jupyter notebooks. Develpoed by Microsoft. https://github.com/microsoft/recommenders
- Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. "Spectral regularization algorithms for learning large incomplete matrices." The Journal of Machine Learning Research 11 (2010): 2287-2322.
- Funk, A. (2006). Netflix update: Try this at home. http://sifter.org/~simon/journal/20061211.html
- Koren, Y., Bell, R. and Volinsky, C. (2009). Matrix factorization techniques for recommender systems. IEEE, Computer, 8(42), 30–37.
- Agarwal, D. and Chen, B.-C. (2009), “Regression-Based Latent Factor Models,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 19–28.
- Zhu, Shen and Ye (2015). Personalized prediction and sparsity pursuit in latent factor models. JASA.
- Bi, X., Qu, A., Wang, J. and Shen, X. (2017). A group-specific recommender system. Journal of the American Statistical Association, 112, 1344-1353.