Waiting Times and Pathways of Markov Models¶
For a more detailed introduction to this topic, please refer to following article.
MSMPathfinder: Identification of Pathways in Markov State Models
Daniel Nagel, Anna Weber, and Gerhard Stock
J. Chem. Theory Comput. 2020 16 (12), 7874-7882
doi: 10.1021/acs.jctc.0c00774
Introduction¶
In the context of biochemical systems, understanding the kinetics of protein dynamics is critical for gaining insight into the underlying mechanisms of biological function. The Markov state model provides a powerful framework for characterizing the kinetics of a system. One can use it to estimate the waiting times $t^\text{wt}$ between transitions, the transition path times $t^\text{tt}$, and the pathways that the system takes between states.
It is important to note that waiting times $t^\text{wt}$, transition path times $t^\text{tt}$, and means first passage times $t^\text{mfpt}$ are related but distinct quantities. Waiting times are the times between individual transitions. Transition path times, on the other hand, are the times required for the system to traverse a particular path between two states. Finally, mean first passage times are the average times required for the system to reach a particular state for the first time. In general, it holds that $$ t^\text{wt} \ge t^\text{mfpt} \ge t^\text{tt}\;. $$ In general, we are interested in comparisons with experiments. For example, in biochemical experiments, one can measure the times required for a system (an ensemble) to transition from one state to another. Since this directly corresponds to the waiting time distribution, we mainly focus on it and try to relate it to the corresponding pathways.
In this section, we will focus on the waiting times and transition path times and show how to estimate them using either the Markov state model or directly the state trajectory. We will also discuss how to visualize the pathways and interpret the results in the context of the toy system being studied.
Model Systems¶
In the following we will use two simple toy models introduced and discussed by Nagel et al. 20, namely, the following 4-state and 6-state models
import msmhelper as mh
from msmhelper.utils import datasets
print(
f'4 state model:\nT =\n{datasets.nagel20_4state.tmat}\n\n'
f'6 tate model:\nT =\n{datasets.nagel20_6state.tmat}'
)
# generate random trajectories from transition matrices
n_steps = int(5e4)
traj_4state = datasets.nagel20_4state(n_steps)
traj_6state = datasets.nagel20_6state(n_steps)
4 state model: T = [[0.92 0.04 0.04 0. ] [0.1 0.7 0.1 0.1 ] [0.1 0.1 0.7 0.1 ] [0. 0.04 0.04 0.92]] 6 tate model: T = [[0.9 0.03 0.07 0. 0. 0. ] [0.05 0.63 0.15 0.07 0.1 0. ] [0.08 0.04 0.6 0.14 0.14 0. ] [0. 0.12 0.15 0.5 0.15 0.08] [0. 0.04 0.16 0.1 0.6 0.1 ] [0. 0. 0. 0.08 0.02 0.9 ]]
Estimating Timescales from MD¶
Before relying on Markov state models to analyse the expected times and pathways of a given process, we first want to show to analyse the raw data. In the following this is referred as MD due to the fact that it is the truly simulated dynamic of the MD simulation.
Let us start by analysing the $1\to4$ process of the 4 state model. Within this model, this is the most interesting one, due to the fact that these two states are not directly connected.
start, final = 1, 4
wts = mh.md.estimate_wt(traj_4state, start, final)
print(f'Identified waiting times [frames]:\n{wts}')
Identified waiting times [frames]: [ 35 16 97 15 30 58 167 4 151 14 22 41 14 12 69 30 32 50 23 37 71 75 4 200 57 21 6 36 35 33 94 49 28 3 50 19 38 77 12 14 52 34 12 15 65 92 56 25 27 21 12 19 42 64 35 27 46 53 100 74 35 19 7 31 32 13 4 82 30 17 12 35 98 5 5 25 20 45 51 102 69 21 91 77 5 28 34 20 12 6 49 26 23 89 6 13 30 107 58 29 37 10 8 16 30 20 67 42 15 14 69 41 2 18 18 8 53 99 82 11 48 13 4 7 17 7 8 22 50 94 12 15 16 6 20 21 20 11 2 68 56 18 5 36 26 85 22 100 70 25 25 34 9 23 47 6 42 37 105 20 140 4 5 37 83 51 37 5 49 22 38 79 27 11 119 23 9 26 17 27 46 16 28 22 49 5 88 91 8 33 67 52 21 28 12 9 39 16 75 4 29 46 75 15 61 151 96 12 23 58 8 68 9 58 30 30 14 74 13 68 67 13 56 19 49 11 24 98 39 20 11 51 20 73 22 53 21 26 244 13 11 78 31 148 22 10 15 18 19 37 12 16 59 3 38 67 27 29 15 154 4 18 12 44 89 20 45 13 55 50 11 22 27 28 17 85 27 12 23 24 103 28 32 5 60 103 81 5 20 37 24 57 172 14 23 55 50 69 44 131 123 14 6 8 5 25 10 37 104 24 12 36 18 41 100 44 8 81 14 11 28 37 80 20 51 13 87 46 34 19 17 30 50 74 10 9 17 149 113 22 93 74 18 26 88 53 10 36 30 23 41 40 43 5 18 16 47 24 5 15 18 34 16 46 48 24 7 6 33 28 10 24 10 45 8 26 35 24 36 10 10 7 25 55 8 40 3 33 12 12 58 9 38 20 6 9 21 64 92 40 18 33 3 17 33 11 117 28 63 10 11 3 6 20 37 24 11 20 27 13 12 26 9 13 10 23 82 28 40 33 11 14 23 12 117 6 18 23 10 15 8 50 10 55 37 9 37 11 137 23 6 11 7 29 64 29 17 74 16 9 4 72 11 50 7 33 44 58 102 125 3 30 6 82 94 37 112 28 53 113 104 6 45 34 8 45 56 11 9 39 28 16 53 23 24 5 27 5 23 28 12 27 50 9 15 10 14 33 11 18 31 5 9 36 12 40 7 143 78 15 60 106 20 56 18 49 20 31 13 21 53 17 13 12 24 9 19 32 12 38 10 50 52 40 63 13 47 25 39 127 32 5 63 22 150 24 8 122 5 6 77 18 61 30 41 115 20 62 16 5 38 57 91 41 46 35 20 86 54 68 13 17 6 44 67 15 5 12 17 43 33 91 22 16 5 13 10 2 25 9 19 7 44 54 14 24 38 9 67 4 79 97 7 120 169 39 96 25 31 18 21 53 6 45 81 21 4 9 37 33 28 14 31 27 51 59 122 15 81 9 8 13 16 62 95 63 18 14 47 13 21 45 52 94 10 25 4 14 18 24 78 94 63 90 11 51 41 31 15 34 6 29 18 70 50 10 134 3 22 50 15 12 8 10 21 41 19 8 58 78 22 52 43 37 30 9 64 15 9 13 87 80 96 37 18 40 9]
This list reveals directly that the waiting times are not normally distributed. To make it clearer, we visualize them results by a simple histogram
import numpy as np
import prettypyplot as pplt
from matplotlib import pyplot as plt
pplt.use_style(figsize=(5,2), latex=False, colors='paula')
fig, ax = plt.subplots()
# ensure that bins are integers.
bins = np.arange(0, wts.max() + 5, 5)
ax.hist(wts, bins=bins, density=True)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()
Estimating Paths from MD¶
So let us now take a look at the pathways.
paths = mh.md.estimate_paths(traj_4state, start, final)
# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, pathtimes in paths.items():
print(f'{path}: {pathtimes}')
Identified pathways with time of events given in [frames]: (1, 2, 4): [35, 97, 167, 4, 151, 22, 14, 12, 30, 23, 71, 200, 57, 6, 28, 19, 12, 14, 34, 92, 25, 21, 12, 42, 46, 100, 19, 32, 98, 51, 5, 28, 12, 26, 6, 58, 37, 30, 20, 67, 69, 2, 18, 11, 48, 7, 22, 50, 94, 12, 16, 6, 11, 34, 9, 140, 4, 5, 51, 37, 38, 27, 23, 9, 27, 16, 28, 49, 5, 88, 8, 33, 28, 4, 46, 75, 58, 30, 14, 67, 56, 19, 11, 26, 13, 78, 22, 19, 37, 3, 27, 4, 55, 11, 22, 27, 17, 27, 5, 81, 5, 20, 24, 23, 55, 69, 6, 10, 37, 24, 12, 18, 44, 8, 14, 11, 51, 19, 30, 17, 26, 88, 41, 43, 5, 16, 47, 24, 16, 6, 33, 28, 10, 10, 7, 8, 40, 3, 33, 12, 92, 18, 11, 117, 10, 24, 27, 10, 14, 117, 18, 23, 10, 8, 50, 55, 37, 9, 37, 11, 137, 23, 6, 11, 7, 17, 9, 4, 72, 50, 33, 102, 6, 82, 37, 34, 28, 16, 5, 5, 27, 50, 15, 10, 11, 31, 5, 36, 15, 20, 20, 31, 17, 12, 9, 38, 52, 13, 47, 32, 22, 77, 16, 5, 46, 35, 20, 15, 43, 22, 16, 5, 10, 2, 25, 9, 19, 7, 44, 9, 67, 7, 96, 18, 21, 45, 81, 21, 9, 33, 28, 14, 15, 9, 13, 62, 95, 63, 21, 52, 10, 24, 11, 51, 31, 6, 29, 50, 3, 8, 19, 8, 58, 43, 9] (1, 2, 3, 4): [16, 15, 41, 36, 49, 50, 65, 27, 35, 53, 82, 12, 35, 5, 102, 69, 53, 70, 49, 52, 16, 15, 23, 8, 68, 11, 16, 59, 29, 15, 154, 89, 13, 12, 23, 60, 37, 14, 104, 80, 34, 74, 9, 149, 93, 15, 18, 34, 7, 24, 55, 64, 11, 6, 20, 26, 23, 11, 64, 58, 94, 45, 9, 39, 53, 40, 78, 56, 12, 50, 40, 63, 25, 127, 24, 122, 62, 38, 67, 5, 38, 97, 39, 51, 59, 18, 78, 15, 18, 10, 15, 12, 10, 15, 18] (1, 3, 4): [30, 14, 69, 32, 50, 37, 75, 21, 35, 94, 3, 38, 77, 52, 12, 15, 56, 19, 27, 35, 7, 31, 4, 17, 25, 20, 45, 77, 20, 6, 49, 23, 89, 13, 30, 29, 10, 8, 16, 42, 15, 14, 41, 18, 8, 99, 82, 13, 4, 7, 17, 8, 20, 21, 20, 2, 68, 56, 5, 36, 85, 22, 100, 23, 6, 37, 105, 20, 37, 83, 5, 22, 79, 11, 119, 17, 22, 91, 67, 21, 12, 9, 29, 61, 151, 96, 9, 58, 30, 13, 13, 49, 24, 39, 20, 11, 51, 20, 73, 22, 53, 21, 244, 148, 15, 18, 12, 38, 18, 12, 44, 20, 45, 50, 28, 85, 24, 103, 28, 103, 57, 172, 131, 123, 14, 5, 25, 36, 100, 81, 28, 37, 13, 46, 17, 10, 22, 74, 18, 10, 36, 30, 23, 40, 18, 5, 46, 48, 10, 24, 8, 26, 35, 36, 10, 25, 58, 9, 38, 20, 3, 17, 33, 28, 63, 3, 37, 11, 20, 12, 9, 13, 82, 40, 23, 6, 10, 29, 29, 74, 16, 11, 7, 44, 3, 53, 113, 104, 6, 8, 11, 23, 24, 27, 23, 28, 12, 9, 14, 33, 18, 9, 12, 7, 60, 106, 18, 49, 13, 21, 53, 13, 24, 19, 32, 10, 5, 63, 150, 5, 6, 18, 61, 30, 115, 20, 57, 91, 41, 86, 54, 68, 13, 17, 44, 12, 17, 91, 13, 54, 14, 24, 4, 79, 120, 169, 25, 31, 53, 6, 4, 31, 27, 122, 81, 8, 14, 47, 45, 94, 4, 14, 94, 63, 70, 134, 22, 50, 21, 41, 78, 22, 52, 37, 64, 13, 87, 37, 9] (1, 3, 2, 4): [58, 4, 33, 64, 74, 13, 30, 5, 21, 91, 34, 107, 15, 18, 26, 25, 25, 47, 42, 26, 46, 39, 75, 12, 68, 74, 98, 31, 10, 67, 32, 50, 44, 8, 41, 20, 87, 50, 113, 53, 24, 45, 12, 6, 9, 21, 40, 33, 13, 28, 33, 12, 15, 125, 30, 112, 28, 45, 56, 143, 39, 8, 41, 6, 33, 37, 16, 13, 25, 18, 90, 41, 34, 30, 9, 80, 96, 40]
If we now want to see how frequent these different pathways occur, we can simply reduce the above result via
path_frequencies = sorted(
((path, len(pathtimes)) for path, pathtimes in paths.items()),
key=lambda pathfreq: pathfreq[1],
reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])
# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
print(f'{path}:\t{freq / n_sampled_paths:.1%}')
Identified pathways with time of events given in [frames]: (1, 3, 4): 39.5% (1, 2, 4): 36.1% (1, 2, 3, 4): 13.4% (1, 3, 2, 4): 11.0%
Comparing these results to the true values of $37.5\%$ and $12.5\%$ (see Nagel et al. 20), we find a rather good agreement.
Estimating Timescales from MSM¶
Now we want to compare how well the Markov model is able to recover the true waiting time distributions and path frequencies.
nsteps = int(1e7)
lagtime = 1
wts_msm = mh.msm.estimate_wt(
trajs=traj_4state,
lagtime=lagtime,
start=start,
final=final,
steps=nsteps,
return_list=True,
)
print(f'Identified number of waiting times events: {len(wts_msm)}')
fig, ax = plt.subplots()
# ensure that bins are integers.
bins_msm = np.arange(0, wts_msm.max() + 1)
ax.hist(wts, bins=bins, color='C4', density=True, alpha=0.8)
ax.hist(wts_msm, bins=bins_msm, density=True, alpha=0.8)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()
Identified number of waiting times events: 139654
So we see, that we get a smooth interpolation of the waiting time distribution.
Estimating Paths from MSM¶
Estimating pathways is straight forward and can be achieved by,
paths = mh.msm.estimate_paths(
trajs=traj_4state,
lagtime=lagtime,
start=start,
final=final,
steps=nsteps,
)
path_frequencies = sorted(
((path, len(pathtimes)) for path, pathtimes in paths.items()),
key=lambda pathfreq: pathfreq[1],
reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])
# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
print(f'{path}:\t{freq / n_sampled_paths:.1%}')
Identified pathways with time of events given in [frames]: (1, 3, 4): 39.0% (1, 2, 4): 36.1% (1, 2, 3, 4): 12.7% (1, 3, 2, 4): 12.2%
Comparing Waiting Times of MD vs MSM¶
Finally, we want to directly compare the effect of alternating the lag time $\tau$ on the waiting time distribution. Similar to the implied timescale plot, we can also compare the waiting time distribution for a given lag time to the resulting MD waiting time distribution. To circumvent the problem of poor statistics, we rely in this analysis only on statistical quantities instead of the probability distributions: Namely, the three quartiles $Q_{1,2,3}$, and the interquartile range $\mathrm{IQR}=Q_3 - Q_1$.
This can be easily achieved with the predefined functions:
wtd = mh.msm.estimate_wtd(
trajs=traj_4state,
max_lagtime=10,
start=start,
final=final,
steps=nsteps,
)
fig, ax = plt.subplots()
ax = mh.plot.plot_wtd(wtd=wtd, ax=ax)
plt.show()
An interesting observation we have made is that the waiting time $t$ shows a linear scaling relationship with the lag time $\tau_\text{lag}$. However, our analysis also revealed that when comparing the timescales to the reference simulation (referred to as "MD" in the figure), using a lag time of $\tau_\text{lag} \ge 6$ leads to overestimation of the waiting timescale. This effect is due to selecting a lag time that is longer than the fastest intrinsic timescale of the process. It is worth noting that unlike the commonly used implied timescale, which typically converges towards the true value, waiting times do not exhibit such behavior, and therefore, caution must be exercised when comparing these intrinsically different time scales.
Concluding Remarks¶
In conclusion, we have learned how to estimate waiting times and transition pathways directly from the state trajectory and via Markov state modeling. By analyzing the timescales and pathways of a system, we can gain valuable insights into its dynamics and compare them with experimental observations. We hope that this tutorial has provided a useful introduction to the estimation of waiting times and pathways in protein dynamics and will inspire further investigations into the dynamics of complex biological systems. We will continue to see how the Hummer-Szabo projection can be used to improve the general MSM prediction of lumped microstate dynamics.