Waiting Times and Pathways of Markov Models¶

For a more detailed introduction to this topic, please refer to following article.

MSMPathfinder: Identification of Pathways in Markov State Models
Daniel Nagel, Anna Weber, and Gerhard Stock
J. Chem. Theory Comput. 2020 16 (12), 7874-7882
doi: 10.1021/acs.jctc.0c00774

Introduction¶

In the context of biochemical systems, understanding the kinetics of protein dynamics is critical for gaining insight into the underlying mechanisms of biological function. The Markov state model provides a powerful framework for characterizing the kinetics of a system. One can use it to estimate the waiting times $t^\text{wt}$ between transitions, the transition path times $t^\text{tt}$, and the pathways that the system takes between states.

It is important to note that waiting times $t^\text{wt}$, transition path times $t^\text{tt}$, and means first passage times $t^\text{mfpt}$ are related but distinct quantities. Waiting times are the times between individual transitions. Transition path times, on the other hand, are the times required for the system to traverse a particular path between two states. Finally, mean first passage times are the average times required for the system to reach a particular state for the first time. In general, it holds that $$ t^\text{wt} \ge t^\text{mfpt} \ge t^\text{tt}\;. $$ In general, we are interested in comparisons with experiments. For example, in biochemical experiments, one can measure the times required for a system (an ensemble) to transition from one state to another. Since this directly corresponds to the waiting time distribution, we mainly focus on it and try to relate it to the corresponding pathways.

In this section, we will focus on the waiting times and transition path times and show how to estimate them using either the Markov state model or directly the state trajectory. We will also discuss how to visualize the pathways and interpret the results in the context of the toy system being studied.

Model Systems¶

In the following we will use two simple toy models introduced and discussed by Nagel et al. 20, namely, the following 4-state and 6-state models

In [1]:

Copied!





import msmhelper as mh
from msmhelper.utils import datasets

print(
    f'4 state model:\nT =\n{datasets.nagel20_4state.tmat}\n\n'
    f'6 tate model:\nT =\n{datasets.nagel20_6state.tmat}'
)

# generate random trajectories from transition matrices
n_steps = int(5e4)
traj_4state = datasets.nagel20_4state(n_steps)
traj_6state = datasets.nagel20_6state(n_steps)
import msmhelper as mh
from msmhelper.utils import datasets

print(
    f'4 state model:\nT =\n{datasets.nagel20_4state.tmat}\n\n'
    f'6 tate model:\nT =\n{datasets.nagel20_6state.tmat}'
)

# generate random trajectories from transition matrices
n_steps = int(5e4)
traj_4state = datasets.nagel20_4state(n_steps)
traj_6state = datasets.nagel20_6state(n_steps)

4 state model:
T =
[[0.92 0.04 0.04 0.  ]
 [0.1  0.7  0.1  0.1 ]
 [0.1  0.1  0.7  0.1 ]
 [0.   0.04 0.04 0.92]]

6 tate model:
T =
[[0.9  0.03 0.07 0.   0.   0.  ]
 [0.05 0.63 0.15 0.07 0.1  0.  ]
 [0.08 0.04 0.6  0.14 0.14 0.  ]
 [0.   0.12 0.15 0.5  0.15 0.08]
 [0.   0.04 0.16 0.1  0.6  0.1 ]
 [0.   0.   0.   0.08 0.02 0.9 ]]

Estimating Timescales from MD¶

Before relying on Markov state models to analyse the expected times and pathways of a given process, we first want to show to analyse the raw data. In the following this is referred as MD due to the fact that it is the truly simulated dynamic of the MD simulation.

Let us start by analysing the $1\to4$ process of the 4 state model. Within this model, this is the most interesting one, due to the fact that these two states are not directly connected.

In [2]:

Copied!

start, final = 1, 4
wts = mh.md.estimate_wt(traj_4state, start, final)

print(f'Identified waiting times [frames]:\n{wts}')
start, final = 1, 4
wts = mh.md.estimate_wt(traj_4state, start, final)

print(f'Identified waiting times [frames]:\n{wts}')

Identified waiting times [frames]:
[  4  39 141  33  26  18 136  34  22  17 147 123  12  62  23  68  18  70
  25 110  34  42  36  14  29  21  25  15  18  60  36  34  77  29  60   5
  19  89   9  18   7  29  52  37  19  60  38   8   5  14  18  19  20  26
  33  51  54  74  29   9  10   8  13   7  12  68   8  13  11  14   6   3
   7  10  23  22  30  15  17   6  40  16   7  11  33  21  31  16  14  50
  34  12  10  20  31   4   4   6  22   7   6   5 106  46   5  63  35   6
  27   4  59  20  23  93  16  62  25   6  10   2  15  26   7  30  51   7
 145 157  60  57  95  46  20   9   7   5   3  98  16  44  12  26  65  19
  61   4  54  66  67  66  56  32  22  47  24  66  23  18  30   3   6  34
  26  51  16  15  40 100  45  28   4  56 132  19  58  64  14  12  86  34
  84   5  14  90  19  72  43   5  13   9  27  35  21  64  43  16  22   8
  83  34  22  93  41  23  32   2  13  64  47  12  71  84  27  23  15  21
   5  17  46 107  85  27  24  15  11   7  22  82  28  57  19  22  26  13
  13  16  32  58  65  60  44  14  46  25   4  30   4 127  46  97  60  32
  25   3  18  12   5  40  21   7  14  14 108  11  87  74  15  22  20   9
  49  47  43  32  11  12  42  18   5 101  56  29   4  13   4  36  25  35
  12  28  37   4  31  87  56  60  14  18  29  24  21  21  38  13  52  50
  45  18  32  11  27  24  20   8  37 138   7  23  54   9  53  14  15  23
  36  19  11  26  48  31   6  20  23  12   9   9   7  21  39  11  31  15
  71  51  18  45  70  11  59   5  37   9  14  35  65  16  56  37  21  45
  66  30 123   5  43  17  54  87  11  13  11  16  73  93  18   3   3  18
  49  18  17  55  63  22  32 132  22  11  36  41  80   2   6  36  10  12
  36   5  31   8 104  25  23  15  26  33   7  18  38  38  22  22  17  65
  25  22   9  18  33  31  19 132  42  56   6  82  82  62 109   3  88  80
  67 133  27  55  22  66  28  21  50  11  28  22  32   8  38  49  18  38
   4  40  32  29  17  41  18  31  46  34  66  41  13  12   6 133 142  18
   2  26   8  53   5  77  19  67  44  11   4  56  44  28  97  17  22  29
  20  30  50  22   3  99  77  10  12  56  57  41  64  27  48  39  23  41
  43  52  10  20   8  77  41  22  41  16  30  29  20  21  56  91  28  22
  21  16  14   9  27   2  14  28   2  60  10  74  53  63 109   9  13  35
  27 128  26  25  59  11 134  11  14  53  40  10  19  11  32  14  26  21
  15  16  35  32  56  37  25  21  31  22   6  42   6  37  22 152  11  20
  70  28 100 125  14  36  13  13  14  80  14  46  65   5  38  11  24  11
   9  24  60  35  19  75   4  11  11  85  21   9  64  13  11  14  16  17
  12  28  73   6  20   2  10  94  21  16   5  61  10  16  22   8  41  24
  12   3  15  13  13  25  12  14   7  41  18  14  34   6  13  32   6  28
  22 109   9  73 128  18  49  10   5   4  32  10  16  76  89  53  18  85
  25  63  46  18  15  38  38   2  15  27  13  26   4  17   5  21  32  31
  55  23   5  16  65  78  15  74  11  23  43  15  18  32   8  31  32  22
  58  72  34  14   7  19  57  30   3  26   8  14  60  14  17  56  17  54
 105  13]

This list reveals directly that the waiting times are not normally distributed. To make it clearer, we visualize them results by a simple histogram

In [3]:

Copied!





import numpy as np
import prettypyplot as pplt
from matplotlib import pyplot as plt

pplt.use_style(figsize=(5,2), latex=False, colors='paula')

fig, ax = plt.subplots()
# ensure that bins are integers.
bins = np.arange(0, wts.max() + 5, 5)
ax.hist(wts, bins=bins, density=True)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()
import numpy as np
import prettypyplot as pplt
from matplotlib import pyplot as plt

pplt.use_style(figsize=(5,2), latex=False, colors='paula')

fig, ax = plt.subplots()
# ensure that bins are integers.
bins = np.arange(0, wts.max() + 5, 5)
ax.hist(wts, bins=bins, density=True)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()

No description has been provided for this image

Estimating Paths from MD¶

So let us now take a look at the pathways.

In [4]:

Copied!





paths = mh.md.estimate_paths(traj_4state, start, final)

# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, pathtimes in paths.items():
    print(f'{path}: {pathtimes}')
paths = mh.md.estimate_paths(traj_4state, start, final)

# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, pathtimes in paths.items():
    print(f'{path}: {pathtimes}')

Identified pathways with time of events given in [frames]:
(1, 3, 4): [4, 39, 141, 17, 147, 123, 23, 110, 34, 36, 21, 29, 19, 89, 9, 29, 52, 37, 60, 8, 14, 20, 33, 51, 54, 8, 13, 7, 8, 13, 3, 22, 15, 6, 11, 21, 31, 16, 12, 31, 6, 6, 20, 23, 2, 26, 51, 7, 5, 3, 16, 44, 54, 66, 67, 32, 47, 23, 34, 26, 15, 100, 56, 19, 12, 86, 5, 14, 5, 13, 9, 35, 64, 43, 16, 22, 84, 23, 15, 21, 5, 85, 15, 22, 57, 19, 22, 26, 13, 60, 14, 30, 4, 5, 21, 87, 15, 9, 49, 12, 5, 101, 29, 4, 25, 12, 87, 14, 18, 18, 32, 27, 7, 11, 12, 39, 71, 51, 18, 5, 35, 16, 45, 123, 11, 13, 3, 3, 55, 63, 132, 41, 80, 36, 12, 36, 8, 15, 26, 18, 17, 25, 18, 19, 132, 56, 3, 80, 22, 28, 28, 22, 32, 66, 142, 26, 53, 5, 77, 19, 4, 44, 28, 97, 29, 20, 50, 99, 77, 10, 12, 23, 8, 22, 41, 21, 56, 91, 28, 14, 27, 2, 14, 2, 109, 9, 13, 26, 11, 14, 53, 11, 14, 21, 16, 32, 56, 31, 42, 70, 125, 14, 36, 80, 14, 46, 5, 11, 24, 9, 24, 60, 35, 19, 75, 11, 85, 21, 13, 16, 17, 12, 6, 10, 21, 22, 41, 24, 3, 14, 14, 6, 22, 109, 10, 5, 32, 10, 16, 89, 18, 63, 18, 38, 15, 5, 5, 15, 74, 8, 72, 34, 7, 3, 26, 13]
(1, 2, 4): [33, 18, 12, 68, 70, 25, 36, 34, 5, 7, 38, 5, 19, 26, 74, 9, 68, 11, 14, 10, 23, 30, 17, 40, 7, 33, 50, 34, 10, 4, 4, 6, 22, 5, 5, 63, 27, 4, 59, 93, 16, 25, 15, 145, 157, 60, 57, 95, 46, 98, 12, 65, 19, 61, 4, 24, 18, 30, 3, 6, 51, 45, 28, 4, 132, 58, 14, 34, 84, 90, 19, 43, 8, 41, 23, 32, 2, 64, 12, 27, 46, 107, 27, 24, 11, 7, 82, 13, 16, 65, 44, 127, 46, 97, 60, 32, 25, 3, 40, 108, 74, 22, 20, 47, 43, 32, 11, 4, 36, 35, 28, 37, 4, 31, 24, 38, 13, 50, 45, 20, 8, 37, 23, 54, 53, 15, 36, 19, 6, 20, 9, 7, 11, 31, 15, 45, 70, 11, 59, 37, 9, 56, 66, 30, 5, 43, 11, 93, 18, 18, 17, 22, 32, 11, 36, 2, 6, 10, 5, 31, 25, 23, 38, 38, 22, 65, 9, 33, 31, 6, 62, 109, 88, 66, 11, 49, 18, 38, 40, 32, 29, 18, 46, 34, 41, 13, 18, 2, 44, 11, 17, 22, 3, 56, 57, 64, 41, 52, 10, 20, 77, 41, 16, 30, 20, 21, 9, 74, 63, 35, 27, 59, 11, 19, 32, 26, 21, 6, 6, 11, 20, 28, 100, 13, 13, 11, 11, 73, 2, 94, 16, 5, 10, 16, 8, 12, 15, 13, 25, 18, 34, 28, 73, 128, 18, 4, 76, 25, 46, 15, 38, 2, 27, 4, 32, 31, 55, 23, 65, 78, 23, 32, 32, 22, 19, 57, 60, 17, 17, 54]
(1, 2, 3, 4): [26, 136, 34, 62, 18, 77, 60, 18, 18, 6, 7, 16, 20, 62, 30, 7, 20, 26, 56, 66, 16, 72, 93, 13, 47, 17, 32, 4, 12, 14, 18, 56, 56, 29, 11, 138, 14, 23, 26, 48, 23, 21, 65, 21, 17, 16, 73, 22, 104, 7, 22, 55, 50, 38, 17, 12, 133, 8, 67, 56, 41, 27, 48, 39, 22, 60, 10, 40, 10, 15, 35, 22, 37, 22, 152, 14, 65, 38, 9, 28, 13, 12, 41, 49, 26, 17, 21, 16, 58, 14, 8]
(1, 3, 2, 4): [22, 18, 42, 14, 29, 25, 15, 60, 19, 29, 10, 12, 14, 7, 106, 46, 35, 6, 10, 7, 9, 66, 22, 40, 64, 27, 21, 83, 34, 22, 71, 28, 58, 46, 25, 18, 7, 14, 11, 42, 13, 60, 21, 21, 52, 24, 9, 31, 9, 14, 37, 54, 87, 49, 18, 33, 22, 42, 82, 82, 67, 133, 27, 21, 8, 4, 41, 31, 6, 22, 30, 43, 29, 16, 28, 53, 128, 25, 134, 37, 25, 4, 11, 64, 14, 20, 61, 7, 13, 32, 6, 9, 53, 85, 13, 11, 43, 15, 18, 31, 30, 14, 14, 56, 105]

If we now want to see how frequent these different pathways occur, we can simply reduce the above result via

In [5]:

Copied!





path_frequencies = sorted(
    ((path, len(pathtimes)) for path, pathtimes in paths.items()),
    key=lambda pathfreq: pathfreq[1],
    reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])

# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
    print(f'{path}:\t{freq / n_sampled_paths:.1%}')
path_frequencies = sorted(
    ((path, len(pathtimes)) for path, pathtimes in paths.items()),
    key=lambda pathfreq: pathfreq[1],
    reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])

# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
    print(f'{path}:\t{freq / n_sampled_paths:.1%}')

Identified pathways with time of events given in [frames]:
(1, 2, 4):	37.4%
(1, 3, 4):	35.5%
(1, 3, 2, 4):	14.5%
(1, 2, 3, 4):	12.6%

Comparing these results to the true values of $37.5\%$ and $12.5\%$ (see Nagel et al. 20), we find a rather good agreement.

Estimating Timescales from MSM¶

Now we want to compare how well the Markov model is able to recover the true waiting time distributions and path frequencies.

In [6]:

Copied!





nsteps = int(1e7)
lagtime = 1
wts_msm = mh.msm.estimate_wt(
    trajs=traj_4state,
    lagtime=lagtime,
    start=start,
    final=final,
    steps=nsteps,
    return_list=True,
)

print(f'Identified number of waiting times events: {len(wts_msm)}')

fig, ax = plt.subplots()
# ensure that bins are integers.
bins_msm = np.arange(0, wts_msm.max() + 1)
ax.hist(wts, bins=bins, color='C4', density=True, alpha=0.8)
ax.hist(wts_msm, bins=bins_msm, density=True, alpha=0.8)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()
nsteps = int(1e7)
lagtime = 1
wts_msm = mh.msm.estimate_wt(
    trajs=traj_4state,
    lagtime=lagtime,
    start=start,
    final=final,
    steps=nsteps,
    return_list=True,
)

print(f'Identified number of waiting times events: {len(wts_msm)}')

fig, ax = plt.subplots()
# ensure that bins are integers.
bins_msm = np.arange(0, wts_msm.max() + 1)
ax.hist(wts, bins=bins, color='C4', density=True, alpha=0.8)
ax.hist(wts_msm, bins=bins_msm, density=True, alpha=0.8)
ax.set_xlabel(r'waiting time $t^\mathrm{wt}$')
ax.set_ylabel(r'$P(t^\mathrm{wt})$')
plt.show()

Identified number of waiting times events: 144131

So we see, that we get a smooth interpolation of the waiting time distribution.

Estimating Paths from MSM¶

Estimating pathways is straight forward and can be achieved by,

In [7]:

Copied!





paths = mh.msm.estimate_paths(
    trajs=traj_4state,
    lagtime=lagtime,
    start=start,
    final=final,
    steps=nsteps,
)

path_frequencies = sorted(
    ((path, len(pathtimes)) for path, pathtimes in paths.items()),
    key=lambda pathfreq: pathfreq[1],
    reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])

# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
    print(f'{path}:\t{freq / n_sampled_paths:.1%}')
paths = mh.msm.estimate_paths(
    trajs=traj_4state,
    lagtime=lagtime,
    start=start,
    final=final,
    steps=nsteps,
)

path_frequencies = sorted(
    ((path, len(pathtimes)) for path, pathtimes in paths.items()),
    key=lambda pathfreq: pathfreq[1],
    reverse=True,
)
n_sampled_paths = np.sum([freq for _, freq in path_frequencies])

# Let's format the output
print(f'Identified pathways with time of events given in [frames]:')
for path, freq in path_frequencies:
    print(f'{path}:\t{freq / n_sampled_paths:.1%}')

Identified pathways with time of events given in [frames]:
(1, 2, 4):	38.5%
(1, 3, 4):	36.8%
(1, 2, 3, 4):	12.6%
(1, 3, 2, 4):	12.1%

Comparing Waiting Times of MD vs MSM¶

Finally, we want to directly compare the effect of alternating the lag time $\tau$ on the waiting time distribution. Similar to the implied timescale plot, we can also compare the waiting time distribution for a given lag time to the resulting MD waiting time distribution. To circumvent the problem of poor statistics, we rely in this analysis only on statistical quantities instead of the probability distributions: Namely, the three quartiles $Q_{1,2,3}$, and the interquartile range $\mathrm{IQR}=Q_3 - Q_1$.

This can be easily achieved with the predefined functions:

In [8]:

Copied!





wtd = mh.msm.estimate_wtd(
    trajs=traj_4state,
    max_lagtime=10,
    start=start,
    final=final,
    steps=nsteps,
)
fig, ax = plt.subplots()
ax = mh.plot.plot_wtd(wtd=wtd, ax=ax)
plt.show()
wtd = mh.msm.estimate_wtd(
    trajs=traj_4state,
    max_lagtime=10,
    start=start,
    final=final,
    steps=nsteps,
)
fig, ax = plt.subplots()
ax = mh.plot.plot_wtd(wtd=wtd, ax=ax)
plt.show()

An interesting observation we have made is that the waiting time $t$ shows a linear scaling relationship with the lag time $\tau_\text{lag}$. However, our analysis also revealed that when comparing the timescales to the reference simulation (referred to as "MD" in the figure), using a lag time of $\tau_\text{lag} \ge 6$ leads to overestimation of the waiting timescale. This effect is due to selecting a lag time that is longer than the fastest intrinsic timescale of the process. It is worth noting that unlike the commonly used implied timescale, which typically converges towards the true value, waiting times do not exhibit such behavior, and therefore, caution must be exercised when comparing these intrinsically different time scales.

Concluding Remarks¶

In conclusion, we have learned how to estimate waiting times and transition pathways directly from the state trajectory and via Markov state modeling. By analyzing the timescales and pathways of a system, we can gain valuable insights into its dynamics and compare them with experimental observations. We hope that this tutorial has provided a useful introduction to the estimation of waiting times and pathways in protein dynamics and will inspire further investigations into the dynamics of complex biological systems. We will continue to see how the Hummer-Szabo projection can be used to improve the general MSM prediction of lumped microstate dynamics.