Waiting Times and Pathways of Markov Models¶
For a more detailed introduction to this topic, please refer to following article.
MSMPathfinder: Identification of Pathways in Markov State Models
Daniel Nagel, Anna Weber, and Gerhard Stock
J. Chem. Theory Comput. 2020 16 (12), 7874-7882
doi: 10.1021/acs.jctc.0c00774
Introduction¶
In the context of biochemical systems, understanding the kinetics of protein dynamics is critical for gaining insight into the underlying mechanisms of biological function. The Markov state model provides a powerful framework for characterizing the kinetics of a system. One can use it to estimate the waiting times $t^\text{wt}$ between transitions, the transition path times $t^\text{tt}$, and the pathways that the system takes between states.
It is important to note that waiting times $t^\text{wt}$, transition path times $t^\text{tt}$, and means first passage times $t^\text{mfpt}$ are related but distinct quantities. Waiting times are the times between individual transitions. Transition path times, on the other hand, are the times required for the system to traverse a particular path between two states. Finally, mean first passage times are the average times required for the system to reach a particular state for the first time. In general, it holds that $$ t^\text{wt} \ge t^\text{mfpt} \ge t^\text{tt}\;. $$ In general, we are interested in comparisons with experiments. For example, in biochemical experiments, one can measure the times required for a system (an ensemble) to transition from one state to another. Since this directly corresponds to the waiting time distribution, we mainly focus on it and try to relate it to the corresponding pathways.
In this section, we will focus on the waiting times and transition path times and show how to estimate them using either the Markov state model or directly the state trajectory. We will also discuss how to visualize the pathways and interpret the results in the context of the toy system being studied.
Model Systems¶
In the following we will use two simple toy models introduced and discussed by Nagel et al. 20, namely, the following 4-state and 6-state models
import msmhelper as mh
from msmhelper.utils import datasets
print(
f'4 state model:\nT =\n{datasets.nagel20_4state.tmat}\n\n'
f'6 tate model:\nT =\n{datasets.nagel20_6state.tmat}'
)
# generate random trajectories from transition matrices
n_steps = int(5e4)
traj_4state = datasets.nagel20_4state(n_steps)
traj_6state = datasets.nagel20_6state(n_steps)
4 state model: T = [[0.92 0.04 0.04 0. ] [0.1 0.7 0.1 0.1 ] [0.1 0.1 0.7 0.1 ] [0. 0.04 0.04 0.92]] 6 tate model: T = [[0.9 0.03 0.07 0. 0. 0. ] [0.05 0.63 0.15 0.07 0.1 0. ] [0.08 0.04 0.6 0.14 0.14 0. ] [0. 0.12 0.15 0.5 0.15 0.08] [0. 0.04 0.16 0.1 0.6 0.1 ] [0. 0. 0. 0.08 0.02 0.9 ]]
Estimating Timescales from MD¶
Before relying on Markov state models to analyse the expected times and pathways of a given process, we first want to show to analyse the raw data. In the following this is referred as MD due to the fact that it is the truly simulated dynamic of the MD simulation.
Let us start by analysing the $1\to4$ process of the 4 state model. Within this model, this is the most interesting one, due to the fact that these two states are not directly connected.
start, final = 1, 4
wts = mh.md.estimate_wt(traj_4state, start, final)
print(f'Identified waiting times [frames]:\n{wts}')
Identified waiting times [frames]: [ 4 39 141 33 26 18 136 34 22 17 147 123 12 62 23 68 18 70 25 110 34 42 36 14 29 21 25 15 18 60 36 34 77 29 60 5 19 89 9 18 7 29 52 37 19 60 38 8 5 14 18 19 20 26 33 51 54 74 29 9 10 8 13 7 12 68 8 13 11 14 6 3 7 10 23 22 30 15 17 6 40 16 7 11 33 21 31 16 14 50 34 12 10 20 31 4 4 6 22 7 6 5 106 46 5 63 35 6 27 4 59 20 23 93 16 62 25 6 10 2 15 26 7 30 51 7 145 157 60 57 95 46 20 9 7 5 3 98 16 44 12 26 65 19 61 4 54 66 67 66 56 32 22 47 24 66 23 18 30 3 6 34 26 51 16 15 40 100 45 28 4 56 132 19 58 64 14 12 86 34 84 5 14 90 19 72 43 5 13 9 27 35 21 64 43 16 22 8 83 34 22 93 41 23 32 2 13 64 47 12 71 84 27 23 15 21 5 17 46 107 85 27 24 15 11 7 22 82 28 57 19 22 26 13 13 16 32 58 65 60 44 14 46 25 4 30 4 127 46 97 60 32 25 3 18 12 5 40 21 7 14 14 108 11 87 74 15 22 20 9 49 47 43 32 11 12 42 18 5 101 56 29 4 13 4 36 25 35 12 28 37 4 31 87 56 60 14 18 29 24 21 21 38 13 52 50 45 18 32 11 27 24 20 8 37 138 7 23 54 9 53 14 15 23 36 19 11 26 48 31 6 20 23 12 9 9 7 21 39 11 31 15 71 51 18 45 70 11 59 5 37 9 14 35 65 16 56 37 21 45 66 30 123 5 43 17 54 87 11 13 11 16 73 93 18 3 3 18 49 18 17 55 63 22 32 132 22 11 36 41 80 2 6 36 10 12 36 5 31 8 104 25 23 15 26 33 7 18 38 38 22 22 17 65 25 22 9 18 33 31 19 132 42 56 6 82 82 62 109 3 88 80 67 133 27 55 22 66 28 21 50 11 28 22 32 8 38 49 18 38 4 40 32 29 17 41 18 31 46 34 66 41 13 12 6 133 142 18 2 26 8 53 5 77 19 67 44 11 4 56 44 28 97 17 22 29 20 30 50 22 3 99 77 10 12 56 57 41 64 27 48 39 23 41 43 52 10 20 8 77 41 22 41 16 30 29 20 21 56 91 28 22 21 16 14 9 27 2 14 28 2 60 10 74 53 63 109 9 13 35 27 128 26 25 59 11 134 11 14 53 40 10 19 11 32 14 26 21 15 16 35 32 56 37 25 21 31 22 6 42 6 37 22 152 11 20 70 28 100 125 14 36 13 13 14 80 14 46 65 5 38 11 24 11 9 24 60 35 19 75 4 11 11 85 21 9 64 13 11 14 16 17 12 28 73 6 20 2 10 94 21 16 5 61 10 16 22 8 41 24 12 3 15 13 13 25 12 14 7 41 18 14 34 6 13 32 6 28 22 109 9 73 128 18 49 10 5 4 32 10 16 76 89 53 18 85 25 63 46 18 15 38 38 2 15 27 13 26 4 17 5 21 32 31 55 23 5 16 65 78 15 74 11 23 43 15 18 32 8 31 32 22 58 72 34 14 7 19 57 30 3 26 8 14 60 14 17 56 17 54 105 13]
This list reveals directly that the waiting times are not normally distributed. To make it clearer, we visualize them results by a simple histogram
import numpy as np
import prettypyplot as pplt
from matplotlib import pyplot as plt
pplt.use_style(figsize=(5,2), latex=False, colors='paula')
fig, ax = plt.subplots()
# ensure that bins are integers.
bins = np.arange(0, wts.max() + 5, 5)
ax.hist(wts, bins=bins, density=True)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()
Estimating Paths from MD¶
So let us now take a look at the pathways.
paths = mh.md.estimate_paths(traj_4state, start, final)
# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, pathtimes in paths.items():
print(f'{path}: {pathtimes}')
Identified pathways with time of events given in [frames]: (1, 3, 4): [4, 39, 141, 17, 147, 123, 23, 110, 34, 36, 21, 29, 19, 89, 9, 29, 52, 37, 60, 8, 14, 20, 33, 51, 54, 8, 13, 7, 8, 13, 3, 22, 15, 6, 11, 21, 31, 16, 12, 31, 6, 6, 20, 23, 2, 26, 51, 7, 5, 3, 16, 44, 54, 66, 67, 32, 47, 23, 34, 26, 15, 100, 56, 19, 12, 86, 5, 14, 5, 13, 9, 35, 64, 43, 16, 22, 84, 23, 15, 21, 5, 85, 15, 22, 57, 19, 22, 26, 13, 60, 14, 30, 4, 5, 21, 87, 15, 9, 49, 12, 5, 101, 29, 4, 25, 12, 87, 14, 18, 18, 32, 27, 7, 11, 12, 39, 71, 51, 18, 5, 35, 16, 45, 123, 11, 13, 3, 3, 55, 63, 132, 41, 80, 36, 12, 36, 8, 15, 26, 18, 17, 25, 18, 19, 132, 56, 3, 80, 22, 28, 28, 22, 32, 66, 142, 26, 53, 5, 77, 19, 4, 44, 28, 97, 29, 20, 50, 99, 77, 10, 12, 23, 8, 22, 41, 21, 56, 91, 28, 14, 27, 2, 14, 2, 109, 9, 13, 26, 11, 14, 53, 11, 14, 21, 16, 32, 56, 31, 42, 70, 125, 14, 36, 80, 14, 46, 5, 11, 24, 9, 24, 60, 35, 19, 75, 11, 85, 21, 13, 16, 17, 12, 6, 10, 21, 22, 41, 24, 3, 14, 14, 6, 22, 109, 10, 5, 32, 10, 16, 89, 18, 63, 18, 38, 15, 5, 5, 15, 74, 8, 72, 34, 7, 3, 26, 13] (1, 2, 4): [33, 18, 12, 68, 70, 25, 36, 34, 5, 7, 38, 5, 19, 26, 74, 9, 68, 11, 14, 10, 23, 30, 17, 40, 7, 33, 50, 34, 10, 4, 4, 6, 22, 5, 5, 63, 27, 4, 59, 93, 16, 25, 15, 145, 157, 60, 57, 95, 46, 98, 12, 65, 19, 61, 4, 24, 18, 30, 3, 6, 51, 45, 28, 4, 132, 58, 14, 34, 84, 90, 19, 43, 8, 41, 23, 32, 2, 64, 12, 27, 46, 107, 27, 24, 11, 7, 82, 13, 16, 65, 44, 127, 46, 97, 60, 32, 25, 3, 40, 108, 74, 22, 20, 47, 43, 32, 11, 4, 36, 35, 28, 37, 4, 31, 24, 38, 13, 50, 45, 20, 8, 37, 23, 54, 53, 15, 36, 19, 6, 20, 9, 7, 11, 31, 15, 45, 70, 11, 59, 37, 9, 56, 66, 30, 5, 43, 11, 93, 18, 18, 17, 22, 32, 11, 36, 2, 6, 10, 5, 31, 25, 23, 38, 38, 22, 65, 9, 33, 31, 6, 62, 109, 88, 66, 11, 49, 18, 38, 40, 32, 29, 18, 46, 34, 41, 13, 18, 2, 44, 11, 17, 22, 3, 56, 57, 64, 41, 52, 10, 20, 77, 41, 16, 30, 20, 21, 9, 74, 63, 35, 27, 59, 11, 19, 32, 26, 21, 6, 6, 11, 20, 28, 100, 13, 13, 11, 11, 73, 2, 94, 16, 5, 10, 16, 8, 12, 15, 13, 25, 18, 34, 28, 73, 128, 18, 4, 76, 25, 46, 15, 38, 2, 27, 4, 32, 31, 55, 23, 65, 78, 23, 32, 32, 22, 19, 57, 60, 17, 17, 54] (1, 2, 3, 4): [26, 136, 34, 62, 18, 77, 60, 18, 18, 6, 7, 16, 20, 62, 30, 7, 20, 26, 56, 66, 16, 72, 93, 13, 47, 17, 32, 4, 12, 14, 18, 56, 56, 29, 11, 138, 14, 23, 26, 48, 23, 21, 65, 21, 17, 16, 73, 22, 104, 7, 22, 55, 50, 38, 17, 12, 133, 8, 67, 56, 41, 27, 48, 39, 22, 60, 10, 40, 10, 15, 35, 22, 37, 22, 152, 14, 65, 38, 9, 28, 13, 12, 41, 49, 26, 17, 21, 16, 58, 14, 8] (1, 3, 2, 4): [22, 18, 42, 14, 29, 25, 15, 60, 19, 29, 10, 12, 14, 7, 106, 46, 35, 6, 10, 7, 9, 66, 22, 40, 64, 27, 21, 83, 34, 22, 71, 28, 58, 46, 25, 18, 7, 14, 11, 42, 13, 60, 21, 21, 52, 24, 9, 31, 9, 14, 37, 54, 87, 49, 18, 33, 22, 42, 82, 82, 67, 133, 27, 21, 8, 4, 41, 31, 6, 22, 30, 43, 29, 16, 28, 53, 128, 25, 134, 37, 25, 4, 11, 64, 14, 20, 61, 7, 13, 32, 6, 9, 53, 85, 13, 11, 43, 15, 18, 31, 30, 14, 14, 56, 105]
If we now want to see how frequent these different pathways occur, we can simply reduce the above result via
path_frequencies = sorted(
((path, len(pathtimes)) for path, pathtimes in paths.items()),
key=lambda pathfreq: pathfreq[1],
reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])
# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
print(f'{path}:\t{freq / n_sampled_paths:.1%}')
Identified pathways with time of events given in [frames]: (1, 2, 4): 37.4% (1, 3, 4): 35.5% (1, 3, 2, 4): 14.5% (1, 2, 3, 4): 12.6%
Comparing these results to the true values of $37.5\%$ and $12.5\%$ (see Nagel et al. 20), we find a rather good agreement.
Estimating Timescales from MSM¶
Now we want to compare how well the Markov model is able to recover the true waiting time distributions and path frequencies.
nsteps = int(1e7)
lagtime = 1
wts_msm = mh.msm.estimate_wt(
trajs=traj_4state,
lagtime=lagtime,
start=start,
final=final,
steps=nsteps,
return_list=True,
)
print(f'Identified number of waiting times events: {len(wts_msm)}')
fig, ax = plt.subplots()
# ensure that bins are integers.
bins_msm = np.arange(0, wts_msm.max() + 1)
ax.hist(wts, bins=bins, color='C4', density=True, alpha=0.8)
ax.hist(wts_msm, bins=bins_msm, density=True, alpha=0.8)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()
Identified number of waiting times events: 144131
So we see, that we get a smooth interpolation of the waiting time distribution.
Estimating Paths from MSM¶
Estimating pathways is straight forward and can be achieved by,
paths = mh.msm.estimate_paths(
trajs=traj_4state,
lagtime=lagtime,
start=start,
final=final,
steps=nsteps,
)
path_frequencies = sorted(
((path, len(pathtimes)) for path, pathtimes in paths.items()),
key=lambda pathfreq: pathfreq[1],
reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])
# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
print(f'{path}:\t{freq / n_sampled_paths:.1%}')
Identified pathways with time of events given in [frames]: (1, 2, 4): 38.5% (1, 3, 4): 36.8% (1, 2, 3, 4): 12.6% (1, 3, 2, 4): 12.1%
Comparing Waiting Times of MD vs MSM¶
Finally, we want to directly compare the effect of alternating the lag time $\tau$ on the waiting time distribution. Similar to the implied timescale plot, we can also compare the waiting time distribution for a given lag time to the resulting MD waiting time distribution. To circumvent the problem of poor statistics, we rely in this analysis only on statistical quantities instead of the probability distributions: Namely, the three quartiles $Q_{1,2,3}$, and the interquartile range $\mathrm{IQR}=Q_3 - Q_1$.
This can be easily achieved with the predefined functions:
wtd = mh.msm.estimate_wtd(
trajs=traj_4state,
max_lagtime=10,
start=start,
final=final,
steps=nsteps,
)
fig, ax = plt.subplots()
ax = mh.plot.plot_wtd(wtd=wtd, ax=ax)
plt.show()
An interesting observation we have made is that the waiting time $t$ shows a linear scaling relationship with the lag time $\tau_\text{lag}$. However, our analysis also revealed that when comparing the timescales to the reference simulation (referred to as "MD" in the figure), using a lag time of $\tau_\text{lag} \ge 6$ leads to overestimation of the waiting timescale. This effect is due to selecting a lag time that is longer than the fastest intrinsic timescale of the process. It is worth noting that unlike the commonly used implied timescale, which typically converges towards the true value, waiting times do not exhibit such behavior, and therefore, caution must be exercised when comparing these intrinsically different time scales.
Concluding Remarks¶
In conclusion, we have learned how to estimate waiting times and transition pathways directly from the state trajectory and via Markov state modeling. By analyzing the timescales and pathways of a system, we can gain valuable insights into its dynamics and compare them with experimental observations. We hope that this tutorial has provided a useful introduction to the estimation of waiting times and pathways in protein dynamics and will inspire further investigations into the dynamics of complex biological systems. We will continue to see how the Hummer-Szabo projection can be used to improve the general MSM prediction of lumped microstate dynamics.