Title: | Recommender System using Matrix Factorization |
---|---|
Description: | R wrapper of the 'libmf' library <https://www.csie.ntu.edu.tw/~cjlin/libmf/> for recommender system using matrix factorization. It is typically used to approximate an incomplete matrix using the product of two matrices in a latent space. Other common names for this task include "collaborative filtering", "matrix completion", "matrix recovery", etc. High performance multi-core parallel computing is supported in this package. |
Authors: | Yixuan Qiu, David Cortes, Chih-Jen Lin, Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang, Bo-Wen Yuan, Meng-Yuan Yang, and other contributors. See file AUTHORS for details. |
Maintainer: | Yixuan Qiu <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 0.5.1 |
Built: | 2024-11-06 04:41:58 UTC |
Source: | https://github.com/yixuan/recosystem |
Functions in this page are used to specify the source of data in the recommender system.
They are intended to provide the input argument of functions such as
$tune()
, $train()
, and $predict()
.
Currently three data formats are supported: data file (via function data_file()
),
data in memory as R objects (via function data_memory()
), and data stored as a
sparse matrix (via function data_matrix()
).
data_file(path, index1 = FALSE, ...) data_memory(user_index, item_index, rating = NULL, index1 = FALSE, ...) data_matrix(mat, ...)
data_file(path, index1 = FALSE, ...) data_memory(user_index, item_index, rating = NULL, index1 = FALSE, ...) data_matrix(mat, ...)
path |
Path to the data file. |
index1 |
Whether the user indices and item indices start with 1
( |
... |
Currently unused. |
user_index |
An integer vector giving the user indices of rating scores. |
item_index |
An integer vector giving the item indices of rating scores. |
rating |
A numeric vector of the observed entries in the rating matrix.
Can be specified as |
mat |
A |
In $tune()
and $train()
, functions in this page
are used to specify the source of training data.
data_file()
expects a text file that describes a sparse matrix
in triplet form, i.e., each line in the file contains three numbers
row col value
representing a number in the rating matrix with its location. In real applications, it typically looks like
user_index item_index rating
The ‘smalltrain.txt’ file in the ‘dat’ directory of this package shows an example of training data file.
If the sparse matrix is given as a dgTMatrix
or ngTMatrix
object
(triplets/COO format defined in the Matrix package), then the function
data_matrix()
can be used to specify the data source.
If user index, item index, and ratings are stored as R vectors in memory,
they can be passed to data_memory()
to form the training data source.
By default the user index and item index start with zeros, and the option
index1 = TRUE
can be set if they start with ones.
From version 0.4 recosystem supports two special types of matrix
factorization: the binary matrix factorization (BMF), and the one-class
matrix factorization (OCMF). BMF requires ratings to take value from
, and OCMF requires all the ratings to be positive.
In $predict()
, functions in this page provide the source of
testing data. The testing data have the same format as training data, except
that the value (rating) column is not required, and will be ignored if it is
provided. The ‘smalltest.txt’ file in the ‘dat’ directory of this
package shows an example of testing data file.
An object of class "DataSource" as required by
$tune()
, $train()
, and $predict()
.
Yixuan Qiu <https://statr.me>
This method is a member function of class "RecoSys
"
that exports the user score matrix and the item score matrix
.
Prior to calling this method, model needs to be trained using member function
$train()
.
The common usage of this method is
r = Reco() r$train(...) r$output(out_P = out_file("mat_P.txt"), out_Q = out_file("mat_Q.txt"))
r |
Object returned by |
out_P |
An object of class |
out_Q |
Ditto, but for the item matrix. |
A list with components P
and Q
. They will be filled
with user or item matrix if out_memory()
is used
in the function argument, otherwise NULL
will be returned.
Yixuan Qiu <https://statr.me>
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST, 2015.
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization. PAKDD, 2015.
W.-S. Chin, B.-W. Yuan, M.-Y. Yang, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. LIBMF: A Library for Parallel Matrix Factorization in Shared-memory Systems. Technical report, 2015.
train_set = system.file("dat", "smalltrain.txt", package = "recosystem") r = Reco() set.seed(123) # This is a randomized algorithm r$train(data_file(train_set), out_model = file.path(tempdir(), "model.txt"), opts = list(dim = 10, nmf = TRUE)) ## Write P and Q matrices to files P_file = out_file(tempfile()) Q_file = out_file(tempfile()) r$output(P_file, Q_file) head(read.table(P_file@dest, header = FALSE, sep = " ")) head(read.table(Q_file@dest, header = FALSE, sep = " ")) ## Skip P and only export Q r$output(out_nothing(), Q_file) ## Return P and Q in memory res = r$output(out_memory(), out_memory()) head(res$P) head(res$Q)
train_set = system.file("dat", "smalltrain.txt", package = "recosystem") r = Reco() set.seed(123) # This is a randomized algorithm r$train(data_file(train_set), out_model = file.path(tempdir(), "model.txt"), opts = list(dim = 10, nmf = TRUE)) ## Write P and Q matrices to files P_file = out_file(tempfile()) Q_file = out_file(tempfile()) r$output(P_file, Q_file) head(read.table(P_file@dest, header = FALSE, sep = " ")) head(read.table(Q_file@dest, header = FALSE, sep = " ")) ## Skip P and only export Q r$output(out_nothing(), Q_file) ## Return P and Q in memory res = r$output(out_memory(), out_memory()) head(res$P) head(res$Q)
Functions in this page are used to specify the format of output results.
They are intended to provide the argument of functions such as
$output()
and $predict()
.
Currently there are three types of output: out_file()
indicates
that the result should be written into a file, out_memory()
makes
the result to be returned as R objects, and out_nothing()
means
the result is not needed and will not be returned.
out_file(path, ...) out_memory(...) out_nothing(...)
out_file(path, ...) out_memory(...) out_nothing(...)
path |
Path to the output file. |
... |
Currently unused. |
An object of class "Output" as required by
$output()
and $predict()
.
Yixuan Qiu <https://statr.me>
This method is a member function of class "RecoSys
"
that predicts unknown entries in the rating matrix.
Prior to calling this method, model needs to be trained using member function
$train()
.
The common usage of this method is
r = Reco() r$train(...) r$predict(test_data, out_pred = out_file("predict.txt")
r |
Object returned by |
test_data |
An object of class "DataSource" that describes the source
of testing data, typically returned by function
|
out_pred |
An object of class |
Yixuan Qiu <https://statr.me>
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST, 2015.
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization. PAKDD, 2015.
W.-S. Chin, B.-W. Yuan, M.-Y. Yang, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. LIBMF: A Library for Parallel Matrix Factorization in Shared-memory Systems. Technical report, 2015.
$train()
## Not run: train_file = data_file(system.file("dat", "smalltrain.txt", package = "recosystem")) test_file = data_file(system.file("dat", "smalltest.txt", package = "recosystem")) r = Reco() set.seed(123) # This is a randomized algorithm opts_tune = r$tune(train_file)$min r$train(train_file, out_model = NULL, opts = opts_tune) ## Write predicted values into file out_pred = out_file(tempfile()) r$predict(test_file, out_pred) ## Return predicted values in memory pred = r$predict(test_file, out_memory()) ## If testing data are stored in memory test_df = read.table(test_file@source, sep = " ", header = FALSE) test_data = data_memory(test_df[, 1], test_df[, 2]) pred2 = r$predict(test_data, out_memory()) ## Compare results print(scan(out_pred@dest, n = 10)) head(pred, 10) head(pred2, 10) ## If testing data are stored as a sparse matrix if(require(Matrix)) { mat = Matrix::sparseMatrix(i = test_df[, 1], j = test_df[, 2], x = -1, repr = "T", index1 = FALSE) test_data = data_matrix(mat) pred3 = r$predict(test_data, out_memory()) print(head(pred3, 10)) } ## End(Not run)
## Not run: train_file = data_file(system.file("dat", "smalltrain.txt", package = "recosystem")) test_file = data_file(system.file("dat", "smalltest.txt", package = "recosystem")) r = Reco() set.seed(123) # This is a randomized algorithm opts_tune = r$tune(train_file)$min r$train(train_file, out_model = NULL, opts = opts_tune) ## Write predicted values into file out_pred = out_file(tempfile()) r$predict(test_file, out_pred) ## Return predicted values in memory pred = r$predict(test_file, out_memory()) ## If testing data are stored in memory test_df = read.table(test_file@source, sep = " ", header = FALSE) test_data = data_memory(test_df[, 1], test_df[, 2]) pred2 = r$predict(test_data, out_memory()) ## Compare results print(scan(out_pred@dest, n = 10)) head(pred, 10) head(pred2, 10) ## If testing data are stored as a sparse matrix if(require(Matrix)) { mat = Matrix::sparseMatrix(i = test_df[, 1], j = test_df[, 2], x = -1, repr = "T", index1 = FALSE) test_data = data_matrix(mat) pred3 = r$predict(test_data, out_memory()) print(head(pred3, 10)) } ## End(Not run)
This function simply returns an object of class "RecoSys
"
that can be used to construct recommender model and conduct prediction.
Reco()
Reco()
Reco()
returns an object of class "RecoSys
"
equipped with methods
$train()
, $tune()
, $output()
and $predict()
, which describe the typical process of
building and tuning model, exporting factorization matrices, and
predicting results. See their help documents for details.
Yixuan Qiu <https://statr.me>
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST, 2015.
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization. PAKDD, 2015.
W.-S. Chin, B.-W. Yuan, M.-Y. Yang, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. LIBMF: A Library for Parallel Matrix Factorization in Shared-memory Systems. Technical report, 2015.
$tune()
, $train()
, $output()
,
$predict()
This method is a member function of class "RecoSys
"
that trains a recommender model. It will read from a training data source and
create a model file at the specified location. The model file contains
necessary information for prediction.
The common usage of this method is
r = Reco() r$train(train_data, out_model = file.path(tempdir(), "model.txt"), opts = list())
r |
Object returned by |
train_data |
An object of class "DataSource" that describes the source
of training data, typically returned by function
|
out_model |
Path to the model file that will be created.
If passing |
opts |
A number of parameters and options for the model training. See section Parameters and Options for details. |
The opts
argument is a list that can supply any of the following parameters:
loss
Character string, the loss function. Default is "l2", see below for details.
dim
Integer, the number of latent factors. Default is 10.
costp_l1
Numeric, L1 regularization parameter for user factors. Default is 0.
costp_l2
Numeric, L2 regularization parameter for user factors. Default is 0.1.
costq_l1
Numeric, L1 regularization parameter for item factors. Default is 0.
costq_l2
Numeric, L2 regularization parameter for item factors. Default is 0.1.
lrate
Numeric, the learning rate, which can be thought of as the step size in gradient descent. Default is 0.1.
niter
Integer, the number of iterations. Default is 20.
nthread
Integer, the number of threads for parallel computing. Default is 1.
nbin
Integer, the number of bins. Must be greater than nthread
.
Default is 20.
nmf
Logical, whether to perform non-negative matrix factorization.
Default is FALSE
.
verbose
Logical, whether to show detailed information. Default is
TRUE
.
The loss
option may take the following values:
For real-valued matrix factorization,
"l2"
Squared error (L2-norm)
"l1"
Absolute error (L1-norm)
"kl"
Generalized KL-divergence
For binary matrix factorization,
"log"
Logarithmic error
"squared_hinge"
Squared hinge loss
"hinge"
Hinge loss
For one-class matrix factorization,
"row_log"
Row-oriented pair-wise logarithmic loss
"col_log"
Column-oriented pair-wise logarithmic loss
Yixuan Qiu <https://statr.me>
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST, 2015.
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization. PAKDD, 2015.
W.-S. Chin, B.-W. Yuan, M.-Y. Yang, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. LIBMF: A Library for Parallel Matrix Factorization in Shared-memory Systems. Technical report, 2015.
$tune()
, $output()
, $predict()
## Training model from a data file train_set = system.file("dat", "smalltrain.txt", package = "recosystem") train_data = data_file(train_set) r = Reco() set.seed(123) # This is a randomized algorithm # The model will be saved to a file r$train(train_data, out_model = file.path(tempdir(), "model.txt"), opts = list(dim = 20, costp_l2 = 0.01, costq_l2 = 0.01, nthread = 1) ) ## Training model from data in memory train_df = read.table(train_set, sep = " ", header = FALSE) train_data = data_memory(train_df[, 1], train_df[, 2], rating = train_df[, 3]) set.seed(123) # The model will be stored in memory r$train(train_data, out_model = NULL, opts = list(dim = 20, costp_l2 = 0.01, costq_l2 = 0.01, nthread = 1) ) ## Training model from data in a sparse matrix if(require(Matrix)) { mat = Matrix::sparseMatrix(i = train_df[, 1], j = train_df[, 2], x = train_df[, 3], repr = "T", index1 = FALSE) train_data = data_matrix(mat) r$train(train_data, out_model = NULL, opts = list(dim = 20, costp_l2 = 0.01, costq_l2 = 0.01, nthread = 1)) }
## Training model from a data file train_set = system.file("dat", "smalltrain.txt", package = "recosystem") train_data = data_file(train_set) r = Reco() set.seed(123) # This is a randomized algorithm # The model will be saved to a file r$train(train_data, out_model = file.path(tempdir(), "model.txt"), opts = list(dim = 20, costp_l2 = 0.01, costq_l2 = 0.01, nthread = 1) ) ## Training model from data in memory train_df = read.table(train_set, sep = " ", header = FALSE) train_data = data_memory(train_df[, 1], train_df[, 2], rating = train_df[, 3]) set.seed(123) # The model will be stored in memory r$train(train_data, out_model = NULL, opts = list(dim = 20, costp_l2 = 0.01, costq_l2 = 0.01, nthread = 1) ) ## Training model from data in a sparse matrix if(require(Matrix)) { mat = Matrix::sparseMatrix(i = train_df[, 1], j = train_df[, 2], x = train_df[, 3], repr = "T", index1 = FALSE) train_data = data_matrix(mat) r$train(train_data, out_model = NULL, opts = list(dim = 20, costp_l2 = 0.01, costq_l2 = 0.01, nthread = 1)) }
This method is a member function of class "RecoSys
"
that uses cross validation to tune the model parameters.
The common usage of this method is
r = Reco() r$tune(train_data, opts = list(dim = c(10L, 20L), costp_l1 = c(0, 0.1), costp_l2 = c(0.01, 0.1), costq_l1 = c(0, 0.1), costq_l2 = c(0.01, 0.1), lrate = c(0.01, 0.1)) )
r |
Object returned by |
train_data |
An object of class "DataSource" that describes the source
of training data, typically returned by function
|
opts |
A number of candidate tuning parameter values and extra options in the model tuning procedure. See section Parameters and Options for details. |
A list with two components:
min
Parameter values with minimum cross validated loss.
This is a list that can be passed to the
opts
argument in $train()
.
res
A data frame giving the supplied candidate values of tuning parameters, and one column showing the loss function value associated with each combination.
The opts
argument should be a list that provides the candidate values
of tuning parameters and some other options. For tuning parameters (dim
,
costp_l1
, costp_l2
, costq_l1
, costq_l2
,
and lrate
), users can provide a numeric vector for each one, so that
the model will be evaluated on each combination of the candidate values.
For other non-tuning options, users should give a single value. If a parameter
or option is not set by the user, the program will use a default one.
See below for the list of available parameters and options:
dim
Tuning parameter, the number of latent factors.
Can be specified as an integer vector, with default value
c(10L, 20L)
.
costp_l1
Tuning parameter, the L1 regularization cost for user factors.
Can be specified as a numeric vector, with default value
c(0, 0.1)
.
costp_l2
Tuning parameter, the L2 regularization cost for user factors.
Can be specified as a numeric vector, with default value
c(0.01, 0.1)
.
costq_l1
Tuning parameter, the L1 regularization cost for item factors.
Can be specified as a numeric vector, with default value
c(0, 0.1)
.
costq_l2
Tuning parameter, the L2 regularization cost for item factors.
Can be specified as a numeric vector, with default value
c(0.01, 0.1)
.
lrate
Tuning parameter, the learning rate, which can be thought
of as the step size in gradient descent.
Can be specified as a numeric vector, with default value
c(0.01, 0.1)
.
loss
Character string, the loss function. Default is "l2", see
section Parameters and Options in $train()
for details.
nfold
Integer, the number of folds in cross validation. Default is 5.
niter
Integer, the number of iterations. Default is 20.
nthread
Integer, the number of threads for parallel computing. Default is 1.
nbin
Integer, the number of bins. Must be greater than nthread
.
Default is 20.
nmf
Logical, whether to perform non-negative matrix factorization.
Default is FALSE
.
verbose
Logical, whether to show detailed information. Default is
FALSE
.
progress
Logical, whether to show a progress bar. Default is TRUE
.
Yixuan Qiu <https://statr.me>
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems. ACM TIST, 2015.
W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Learning-rate Schedule for Stochastic Gradient Methods to Matrix Factorization. PAKDD, 2015.
W.-S. Chin, B.-W. Yuan, M.-Y. Yang, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. LIBMF: A Library for Parallel Matrix Factorization in Shared-memory Systems. Technical report, 2015.
$train()
## Not run: train_set = system.file("dat", "smalltrain.txt", package = "recosystem") train_src = data_file(train_set) r = Reco() set.seed(123) # This is a randomized algorithm res = r$tune( train_src, opts = list(dim = c(10, 20, 30), costp_l1 = 0, costq_l1 = 0, lrate = c(0.05, 0.1, 0.2), nthread = 2) ) r$train(train_src, opts = res$min) ## End(Not run)
## Not run: train_set = system.file("dat", "smalltrain.txt", package = "recosystem") train_src = data_file(train_set) r = Reco() set.seed(123) # This is a randomized algorithm res = r$tune( train_src, opts = list(dim = c(10, 20, 30), costp_l1 = 0, costq_l1 = 0, lrate = c(0.05, 0.1, 0.2), nthread = 2) ) r$train(train_src, opts = res$min) ## End(Not run)