\documentclass[12pt]{article}
\usepackage{fullpage,enumitem,amsmath,amssymb,graphicx}
\usepackage{sectsty}
\usepackage{hyperref}
\usepackage{xcolor}
\hypersetup{
colorlinks=true,
}
\usepackage{tgpagella}
\sectionfont{\fontsize{15}{20}\selectfont}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\E}{\mathbb{E}}
\newcommand{\alex}[1]{\textcolor{cyan}{\textbf{[#1]}}}
\begin{document}
\begin{center}
{\Large \textbf{CS 330 Autumn 2023/2024 Warmup Homework 0} \\ Multitask Training for Recommender Systems
\\ Due Wednesday October 4, 11:59 PM PST}
\vspace{0.2cm}
\begin{tabular}{rl}
SUNet ID: & \\
Name: & \\
Collaborators: &
\end{tabular}
\end{center}
\section*{0\hspace{15pt}Honor Code}
\texttt{I agree by the Stanford honor code and declare that I will not view online (e.g. GitHub) or other students' solutions and post solutions online. I declare that all of my submissions are my own works.}
\\ \\
\noindent Please take a moment to type the above statement and your signature. \textbf{This serves for all of your future homework as well.}
\noindent \textbf{Statement}:
\noindent \textbf{Name: Date:}
\\ \\
\noindent\textbf{Use of GPT/Codex/Copilot:} For the sake of deeper understanding on implementing imitation learning methods, assistance from generative models to write code for this homework is prohibited.
\\
\noindent Please be aware that we will be actively monitoring adherence to these guidelines. This addition to our course policy serves not only to maintain the integrity of our academic environment but also to reduce the number of potential honor code violations. Thank you all for your dedication to maintaining the highest standards of academic integrity!
\section{Overview}
In this assignment, we will implement a multi-task movie recommender system based on the classic Matrix Factorization \cite{Yehuda2009matrix} and Neural Collaborative Filtering ~\cite{he2017neural} algorithms. In particular, we will build a model based on the \href{https://www2.seas.gwu.edu/~simhaweb/champalg/cf/papers/KorenBellKor2009.pdf}{BellKor solution} to the Netflix Grand Prize challenge and extend it to predict both likely user-movie interactions and potential scores. In this assignment you will implement a multi-task neural network architecture and explore the effect of parameter sharing and loss weighting on model performance.
\vspace{0.2cm}
\noindent The main goal of these exercises is to familiarize yourself with multi-task architectures, the training pipeline, and coding in PyTorch. These skills will be important in the course. \textbf{Note: This assignment is a warmup, and is shorter than future homeworks will be.}
\vspace{0.2cm}
\noindent\textbf{Submission}: To submit your work, submit one pdf report and one zip file to GradeScope, where the report contains answers to the deliverables listed below and the zip file contains the code with your filled-in solutions.
\vspace{0.2cm}
\noindent\textbf{Code Overview:} The code consists of several files; however, you will only need to interact with two:
\begin{itemize}
\item \texttt{main.py}: To run experiments, execute this file by passing the corresponding parameters.
\item \texttt{models.py}: This file contains our multi-task prediction model \textbf{MultiTaskNet}, which you will need to finish implementing in PyTorch.
\end{itemize}
\section{Dataset and Evaluation}
\paragraph{Dataset.} In this assignment, we will use movie reviews from the \href{https://grouplens.org/datasets/movielens/100k}{MovieLense dataset}. The dataset consists of 100K reviews of 1700 movies generated by 1000 users. Although each user interaction contains several levels of meta-data, we'll only consider tuples of the type \textbf{(userID, itemID, rating)}, which contain an anonymized user ID, movie ID and the score assigned by the user to the movie from 1 to 5. We randomly split the dataset into a \textbf{train} dataset, which contains 95\% of all ratings, and a \textbf{test} dataset, which contains the remaining 5\%.
\paragraph{Problem Definition.}
Given the dataset defined above, we would like to train a model $f(\text{userID}, \text{itemID})$ that predicts: 1) the probability $p$ that the user would watch the movie and 2) the score $r$ they would assign to it from 1 to 5. For some intuition on this setting, consider a user who only watches comedy and action movies. It would not make sense to recommend them a horror movie since they don't watch those. At the same time, we would want to recommend comedy or action movies that the user is likely to score highly.
\paragraph{Evaluation.}
Once we have our trained model, we evaluate it on the test set.
\vspace{0.2cm}
\noindent \emph{Score Prediction.} We will evaluate the mean-squared error of movie score prediction on the held-out user ratings, i.e.
$
\frac{1}{N}\sum_{i=1}^N ||\hat{r}_i-r_i||^2,
$
where $\hat{r}_i$ is the predicted score for user-movie pair $(\text{userID}_i,\text{itemID}_i)$. The summation is over all pairs in the test set. Better models achieve lower mean-squared errors.
\vspace{0.2cm}
\noindent \emph{Likelihood Prediction.}
% Our dataset contains ratings for movies the users have seen.
To evaluate the quality of the likelihood model, we use the \href{https://en.wikipedia.org/wiki/Mean_reciprocal_rank} {mean reciprocal rank metric}, which provides a higher score for highly ranking the movies the user has seen. The metric is computed as follows: 1) for each user, rank all movies based on the probability that the user would watch them; 2) remove movies we know the user has watched (those in the training set); 3) compute the average reciprocal ranking of movies the user has watched from the held-out set.
\section{Problems}
To install all required packages for this assignment you can run:
\texttt{pip install -r} \texttt{ requirements.txt}.
\noindent In this problem, we will implement a multi-task model using Matrix Factorization \cite{Yehuda2009matrix} and regression-based modelling:
\vspace{0.2cm}
\noindent\textbf{Matrix Factorization}:
\label{sec:matrix_fac}
Consider an interaction matrix $M$, where $M_{ij} = 1$ if $\text{userID}_i$ has rated movie with $\text{itemID}_j$ and $0$ otherwise. We will represent each user with a latent vector $\mathbf{u}_i\in\mathbb{R}^d$ and each item with a latent vector $\mathbf{q}_i\in\mathbb{R}^d$. We model the interaction probability $p_{ij}=\log P(M_{ij}=1)$ in the following way:
\begin{equation}
p_{ij} = \mathbf{u}_i^T\mathbf{q}_j + b_j
\label{eq:prob}
\end{equation}
where $b_j$ is a movie-specific bias term. At each training step we sample a batch of triples $(\text{userID}_i, \text{itemID}_j^+, \text{itemID}_{j'}^-)$ with size $B$, such that $M_{i, j} = 1$, while $\text{itemID}_{j'}^-$ is randomly sampled (indicating no user preference). Let
\begin{equation}
\begin{split}
p^+_{ij} = \mathbf{u}_i^T\mathbf{q}_j + b_j \\
p^-_{ij'} = \mathbf{u}_i^T\mathbf{q}_{j'} + b_{j'}
\end{split}
\label{eq:p}
\end{equation}
and optimize the Bayesian Personalised Ranking (BPR) \cite{Rendle2009BPR} pairwise loss function:
\begin{equation}
\mathcal{L}_F(\mathbf{p}^+, \mathbf{p}^-)=\frac{1}{B}\sum_{i=1}^B 1-\sigma(p_{ij}^+-p_{ij'}^-)
\label{eq:l1}
\end{equation}
\noindent where $\sigma$ is the sigmoid function.
\vspace{0.2cm}
\noindent\textbf{Regression Model}: For training the regression model, we consider only batches of tuples $(\text{userID}_i, \text{itemID}_j^+, r_{ij})$, such that $M_{i, j} = 1$ and $r_{ij}$ is the numerical rating $\text{userID}_i$ assigned to $\text{itemID}_j^+$. Using the same latent vector representations as before, we will concatenate $[\mathbf{u}_i, \mathbf{q}_j, \mathbf{u}_i * \mathbf{q}_j]$ (where $*$ denotes element-wise multiplication) together and pass it through a neural network with a single hidden layer:
\begin{equation}
\hat{r}_{ij}=f_{\theta}([\mathbf{u}_i, \mathbf{q}_j, \mathbf{u}_i * \mathbf{q}_j])
\label{eq:reg}
\end{equation}
We train the model using the mean-squared error loss:
\begin{equation}
\mathcal{L}_R(\mathbf{\hat{r}}, \mathbf{r})= \frac{1}{B}\sum_{i=1}^B ||\hat{r}_{ij}-r_{ij}||^2
\label{eq:r}
\end{equation}
\vspace{0.2cm}
\subsection{\textbf{[14 total points (Coding)]} Your Implementation}
\noindent \noindent\textbf{A. Implement MultitaskNet Model}:
\noindent The first part of the assignment is to implement the above model in \texttt{models.py}. First you need to define each component when the model is initialized.
\begin{enumerate}
\item \textbf{[3 points (Coding)]} Consider the matrix $\mathbf{U} = [\mathbf{u}_1|,\ldots,|\mathbf{u}_{N_{\text{users}}}]\in\mathbb{R}^{N_{\text{users}}\times d}$, $\mathbf{Q} = [\mathbf{q}_1|,\ldots,|\mathbf{q}_{N_{\text{items}}}]\in\mathbb{R}^{N_{\text{items}}\times d}$, $\mathbf{B} = [b_1, \ldots, b_{N_{\text{items}}}]\in \mathbb{R}^{N_{\text{items}}\times 1}$. Implement $\mathbf{U}$ and $\mathbf{Q}$ as \texttt{ScaledEmbedding} layers with parameter $d=\texttt{embedding\_dim}$ and $\mathbf{B}$ as \texttt{ZeroEmbedding} layers with parameter $d=1$ (defined in \texttt{models.py}). These are instances of \href{https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html}{PyTorch Embedding} layers with a different weight initialization, which facilitates better convergence.
Specifically, please complete the following functions in \texttt{models.py}:
\begin{itemize}
\item \texttt{init\_shared\_user\_and\_item\_embeddings}
\item \texttt{init\_separate\_user\_and\_item\_embeddings}
\item \texttt{init\_item\_bias}
\end{itemize}
\item \textbf{[2 points (Coding)]} Next implement $f_{\theta}([\mathbf{u}_i, \mathbf{q}_j, \mathbf{u}_i * \mathbf{q}_j])$ as an MLP network. The class \texttt{MultiTaskNet} has \texttt{layer\_sizes} argument, which is a list of the input shapes of each dense layer. Notice that by default $\texttt{embedding\_dim}$=32, while the input size of the first layer is 96, since we concatenate $[\mathbf{u}_i, \mathbf{q}_j, \mathbf{u}_i * \mathbf{q}_j]$ before processing it through the network. Each layer (except the final layer) should be followed by a ReLU activation. The final layer should output the final user-item predicted score in and have an output size of 1.
Specifically, please complete the function \texttt{init\_mlp\_layers} in \texttt{models.py}.
\end{enumerate}
\noindent \noindent\textbf{B. Implement Forward} \textbf{[9 points (Coding)]}:
\noindent In the second part of the problem you need to implement the \texttt{forward} method of the \texttt{MultitaskNet} module. The \texttt{forward} method receives a batch of $(\text{userID}_i, \text{itemID}_j)$ of user-item pairs. The model should output a probability $p_{ij}$ of shape $(batch\_size,)$ that user $i$ would watch movie $j$, given by Eq. \ref{eq:prob} and a predicted score $\hat{r}_{ij}$ of shape $(batch\_size,)$ the user $i$ would assign to movie $j$, given by Eq. \ref{eq:reg}. Note that you do not need to compute the entire user-item interaction matrix $M$ defined above. Here, you can simply assume user index $_i$ and item index $_j$ are always the same and predict the interaction and score for (user[1] w.r.t item[1]), ..., (user[$batch\_size$] w.r.t item[$batch\_size$]).
\\\\
\noindent Moreover, the \texttt{MultiTaskNet} class has an \texttt{embedding\_sharing} attribute. Implement your model in such a way that when \texttt{embedding\_sharing=True} a single latent vector representation is used for both the factorization and regression tasks and vice versa. \textbf{Be careful with output tensor shapes!}
\noindent Specifically, please complete the following functions in \texttt{models.py}:
\begin{itemize}
\item \texttt{forward\_with\_embedding\_sharing}
\item \texttt{forward\_without\_embedding\_sharing}
\end{itemize}
\noindent\textbf{Optional. Autograding Your Code.}
In this homework, we include autograding functionalities in the released code to facilitate you to debug and develop your code. To run the autograder, simply do:
\texttt{python grader.py}
\noindent The maximum points you can get when running the autograder is \textbf{8 / 8 points}. We also have \textbf{6 points} from hidden test cases that show up when you submit your code to Gradescope. This makes the total of \textbf{14 points} for the Coding part.
\section{Write-up}
\subsection{\textbf{[8 total points (Plot)]} Plot Comparison}
To execute experiments run the \texttt{main.py} script, which will automatically log training MSE loss, BPR loss and test set MSE loss and MRR scores to \href{https://pytorch.org/docs/stable/tensorboard.html}{TensorBoard}.
Please do
\texttt{tensorboard --logdir run}
\noindent to visualize the losses in tensorboard.
Once you're done with your implementation run the following 4 experiments:
\begin{enumerate}
\item \textbf{[2 points (Plot)]} Evaluate a model with shared representations and task weights $\lambda_F=0.99, \lambda_R=0.01$. You can run this experiment by running:
\texttt{python main.py --factorization\_weight 0.99 --regression\_weight 0.01 \\--logdir run/shared=True\_LF=0.99\_LR=0.01}
Here the \texttt{--factorization\_weight} and \texttt{--regression\_weight} arguments correspond to $\lambda_F$ and $\lambda_R$ respectively.
\item \textbf{[2 points (Plot)]} Evaluate a model with shared representations and task weights $\lambda_F=0.5, \lambda_R=0.5$. You can run this experiment by running:
\texttt{python main.py --factorization\_weight 0.5 --regression\_weight 0.5 \\--logdir run/shared=True\_LF=0.5\_LR=0.5}
\item \textbf{[2 points (Plot)]} Evaluate a model with \textbf{separate} representations and task weights $\lambda_F=0.5, \lambda_R=0.5$. You can run this experiment by running:
\texttt{python main.py --no\_shared\_embeddings --factorization\_weight 0.5 \\ --regression\_weight 0.5 --logdir run/shared=False\_LF=0.5\_LR=0.5}
\item \textbf{[2 points (Plot)]} Evaluate a model with \textbf{separate} representations and task weights $\lambda_F=0.99, \lambda_R=0.01$. You can run this experiment by running:
\texttt{python main.py --no\_shared\_embeddings --factorization\_weight 0.99 \\ --regression\_weight 0.01 --logdir run/shared=False\_LF=0.99\_LR=0.01}
\end{enumerate}
\noindent\textbf{Your plots go here}:
For each experiment include a screenshot of Tensorboard graphs for the training and test set losses in your write up.
\vspace{0.8cm}
\subsection{\textbf{[8 total points (Written)]} Analysis}
Answer the following questions:
\begin{enumerate}
\item \textbf{[2 points (Written)]} In Eqn.~\ref{eq:prob}, we only include a movie-specific bias term $b_j$. Does it make sense to also include a user-specific bias term $a_i$? Specifically, if we define $p_{ij}$ in the following way:
\begin{equation}
p_{ij} = \mathbf{u}_i^T\mathbf{q}_j + a_i + b_j,
\end{equation}
will the model capacity increase or remain the same?
\item \textbf{[2 points (Written)]} Consider the case with $\lambda_F=0.99$ and $\lambda_R=0.01$. Based on the train/test loss curves, does parameter sharing outperform having separate models? Please provide a brief justification to your answer.
\item \textbf{[2 points (Written)]} Now consider the case with $\lambda_F=0.5$ and $\lambda_R=0.5$. Based on the train/test loss curves, does parameter sharing outperform having separate models? Please provide a brief justification to your answer.
\item \textbf{[2 points (Written)]} In the \textbf{shared model setting} compare results for $\lambda_F=0.99$ and $\lambda_R=0.01$ and $\lambda_F=0.5$ and $\lambda_R=0.5$, can you explain the difference in performance? Please provide a brief justification to your answer.
\end{enumerate}
\noindent\textbf{Your answers go here}:
\vspace{0.8cm}
\newpage
\bibliography{Homework0/references}
\bibliographystyle{unsrt}
\end{document}