The xgbAnalysis package is a tool to find suitable reaction coordinates of biomolecular systems, e.g. proteins, using the XGBoost algorithm that can be found on github. Given a trajectory from a molecular dynamics simulation and suitable corresponding states, it evaluates, which of the input coordinates describe the system best, resulting in a low dimensional reaction coordinate of directly interpretable original coordinates. To obtain states for a given trajectorie the prodyna package can be used.
Install the xgbAnalysis package, using the R-devtools package
if ( ! ("devtools" %in% installed.packages())) {
install.packages("devtools")
}
devtools::install_github("moldyn/xgbAnalysis")
library(xgbAnalysis)
To this end internal data of HP-35 with every 100th frame for a quick testrun:
Path: /u3/rc_classification/testrun/
Including test files:
trajectory from MD simulation converted to dihedral angles villin_360K_every100.dih
states obtained from maxgap shifted density based clustering macrostates_every100
You can find an introduction to boosted decision trees here.
While XGBoost has many parameter to define, how the decision trees are build, the following are the most important and are usually used improve the learining process.
eta
learning rate (default=0.3
): defines how fast each training round will improve the missclassification done by the model so far. A low value will need more training rounds to get a good prediction accuracy, while a high value will lead to fast overfitting.
max_depth
maximum tree depth (default=6
): defines the maximum depth of the single decision trees, e.g. the maximum number of decision nodes, until a prediction value is assigned.
nrounds
number of training rounds: defines how many rounds the model is trained. Usually, more training rounds result in a better prediction accuracy, but too many training rounds will lead to overfitting, when the prediction error on the training set diverges from the prediction error on the test set.
nthread
number of CPU cores to use (default=0
): nthread=0
uses all cores
When training a model to precisely predict states for unseen data, a good parameter choice is crucial. For the feature analysis however, the parameter choice is not that important, as the XGBoost algorithm is strong against overfitting and comes with a good default choice of training parameter. Here, it is most important, that the build model is “complex” enough with enough training rounds (nrounds
) of trees that are not too small (max_depth
), so that enough features (coordinates) are used in the evaluation process for split candidates in the split nodes. Usually the default parameter suit that requirement.
In a first step the trajectory and states will be separated into a training set trainsplit = 0.7 => 70%
and a test set.
getwd()
import.data(output_dir = "./data",
coords = "villin_360K_every100.dih",
states = "macrostates_every100",
labels = "dihedrals",
trainsplit = 0.7)
This will create the ouput directory, here data
, containing the following files
import.data.parameter
with the parameter used in the import function, as some future functions need to reload the original trajectory.
train.parameter
with the default training parameter for the XGBoost algorithm and the number of classes (states) that are needed for the training process. Parameters can be changed using the set.parameter
function.
train.index
as the training set is choosen randomly from the data, this file contains the indices of the training set to ensure that the same set can be taken again.
test.index
with the indices of the test set.
feature.names
with the names (labels
) of the coordinates of the trajectory file.
train.xgb.Dmatrix
the training set in the XGBoost dense matrix format.
test.xgb.Dmatrix
the test set in the XGBoost dense matrix format.
To simply train a model that predicts the corresponding states of the trajectory with the default parameter and 2 training rounds, use
train.model(data_dir = "./data",
output_dir = "./model",
nrounds = 2
)
As the data set is small and the number of training rounds is small, the accuracy will not be good. For better accuracy, try using nrounds = 20
training rounds for which the computation time will be 10x larger.
This will produce the following files in the output directory
xgb.model
the trained model that now can be used to predict states.
xgb.dump.model
containing the tree structure, that has been trained.
importance
with the feature importance from the training process.
TODO description
feature.selection(output_dir = "./featureSelection",
data = "./data",
decreasing = F,
eta = 0.3,
max_depth = 6,
nrounds = 20,
nthread = 0,
savemode = T)
This will produce the following files in the ouput directory
feature.selection
containing the list of features that are dismissed, the overall accuracy and the accuracy for every state for every iteration round.
parameter
containing the parameter values used for feature selection.
If savemode = TRUE
the following files will be produced every iteration round
selectround<n>.importance
containing the feature importance of the iteration round.
selectround<n>.model
the trained model.
selectround<n>.prediction
with the state prediction made by the model of this iteration.
The index n
gives the number of coordinates (features) that are dismissed. In round 0, the model is trained with all coordinates and in round 1 the most/least important coordinate from round 0 has been removed from the data set. Given 66 dihedral angles for HP-35, at selectround 65 the model is trained using just the last remaining coordinate.
The prediction accuracy can be plotted in dependency of the number of dismissed coordinates (features).
plt.feature.selection(dir = "./featureSelection",
saveplot = T)
If saveplot = TRUE
, the folder ./featureSelection/plot/
, containing the plot of the feature selection will be created. If FALSE
the plot will be returned.
Dismissing the coordinates decreasingly, the most important coordinate will be dismissed every round.
feature.selection(output_dir = "./featureSelection_decreasing",
data = "./data",
decreasing = T,
eta = 0.3,
max_depth = 6,
nrounds = 20,
nthread = 0,
savemode = T)
Plot the feature selection decreasing
plt.feature.selection(dir = "./featureSelection_decreasing",
saveplot = F)
If you want to continue a previous not finished feature selection or you want to dismiss certain features from the beginning, you can give a vector (fdismissed
) with the names of the features (coordinates), that will be dismissed before the first iteration. To define how many features (coordinates) should be dismissed at all, use ndismiss
feature.selection(output_dir = "./featureSelection_decreasing",
data = "./data",
decreasing = T,
ndismiss = 8,
fdismissed = c("Phi2", "Psi13", "Psi4"),
eta = 0.3,
max_depth = 6,
nrounds = 20,
nthread = 0,
savemode = T)
This will remove the coordinates “Phi2”, “Psi13” and “Psi4” from the data set before starting to remove the most important coordinate for 5 iterations.
Given a trained model, the feature importance for every state can be obtained. Here, for every single state, the individual importance of all coordinates for that state is calculated and normalized, so that the importance of coordinates for one state add up to 1. The results are shown in a plot that can either be obtained setting a minimum importance, a coordinate has to contribute for at least one state or the number of coordinates that should be shown. Hence, setting the minimum importance to colmin = 0
results in a plot showing all coordinates and setting the number of coordinates to nfeatures = 6
results in a plot showing the 6 most important coordinates, weighted by their population.
plt.single.class.importance(pre = "./singleClassImportance/sci",
model = "./model/xgb.model",
names = "./data/feature.names",
colmin = 0)
This function will produce the file sci_colmin0.png
in the directory singleClassImportance
with sci
as the prefix.
plt.single.class.importance(pre = "./singleClassImportance/sci",
model = "./model/xgb.model",
names = "./data/feature.names",
nfeatures = 6)
This function will produce the file sci_n6.png
in the directory singleClassImportance
with sci
as the prefix.
Multiple plots can be obtained as well
plt.single.class.importance(pre = "./singleClassImportance/sci",
model = "./model/xgb.model",
names = "./data/feature.names",
colmin = c(0,0.2,0.5),
nfeatures = c(4,5,6))
To test the influence of different parameter on the prediction accuracy of the trained model, the parameter test fuction can be used. The defpar
parameter defines, which training parameter should be used as base. defpar = 'default'
uses the XGBoost default parameter. defpar = NA
will take the parameter set in the train.parameter
file within the data directory. The training parameter can also be set manually with defpar = list(eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
. The parameters to test should be given as data frame in the following format: data.frame(parameter = c(start, stop, stepsize, adjust nrounds(T/F)))
. As some parameter influence the computation time, more training rounds are neccesary to get comparable results of the prediction accuracy. Therefore, T/F
parameter defines wether to adjust the training rounds by nrounds*(max_value/current_value)
.
parameter.test(data = "./data",
output_dir = "./parameterTest",
nthread = 0,
nrounds = 10,
defpar = 'default',
testpar = data.frame(eta = c(0.1, 0.5, 0.1, T),
max_depth = c(3, 10, 1, T))
)
This will produce two plots in the output directory, showing the the prediction accuracy in dependence of the learning rate eta
and the maximum tree depth max_depth
.