dcTMD via Work
Introduction¶
One way to analyze nonequilibrium force time traces from constraint pulling simulations is by calculating the work first and then estimating the free energy $\Delta G$, via:
$$ \begin{align} \Delta G(s) &= \left< W(s) \right>_\text{N} -\frac{\beta}{2}\left<\delta W(s)^2\right>_\text{N} \\ &= \left<W(s)\right>_\text{N} - W_{diss}(s) \end{align} $$
with $\delta W(s) = W(s) -\left< W(s) \right>_\text{N}$. The expression is true if the work distribution is Gaussian. This allows to formulate the friction estimate:
$$\Gamma = \frac{1}{v_c}\frac{d~W_{diss}(s)}{ds}$$
This appraoch is implemented in the WorkEstimator class and computaionally more efficient than calculating the force autocorrelation function (which is implemented in the ForceEstimator class), because the class works with the work data via the integration of the force timetraces. This allows to reduce the resolution significantly while getting the same reults.
Therefore, we advide to use this approach for large datasets.
# load packages
import numpy as np
import dcTMD
from dcTMD.storing import WorkSet
from dcTMD.dcTMD import WorkEstimator
# define variables
velocity = 0.001
res = 1
verbose = True
temperature = 300
I. create a work set¶
To calculate free energy and friction estimates a workset is needed. It contains the integrated force time traces.
- an array containing the filenames is generated. This can be done via the function dcTMD.io.load_pullf() which takes either a glob pattern or a file containing the pullf file names as argument.
pullf_files = '../../tests/testdata/pullf_filenames.dat'
pullf_files = '../../tests/testdata/*pullf.xvg'
filenames = dcTMD.io.load_pullf(pullf_files)
filenames
file ../../tests/testdata/*pullf.xvg not found. using glob.glob(../../tests/testdata/*pullf.xvg)
['../../tests/testdata/t_middle_32_pullf.xvg', '../../tests/testdata/t_middle_03_pullf.xvg', '../../tests/testdata/t_middle_34_pullf.xvg', '../../tests/testdata/t_middle_24_pullf.xvg', '../../tests/testdata/t_middle_21_pullf.xvg', '../../tests/testdata/t_middle_04_pullf.xvg', '../../tests/testdata/t_middle_29_pullf.xvg', '../../tests/testdata/t_middle_16_pullf.xvg', '../../tests/testdata/t_middle_30_pullf.xvg', '../../tests/testdata/t_middle_19_pullf.xvg', '../../tests/testdata/t_middle_01_pullf.xvg', '../../tests/testdata/t_middle_28_pullf.xvg', '../../tests/testdata/t_middle_26_pullf.xvg', '../../tests/testdata/t_middle_31_pullf.xvg', '../../tests/testdata/t_middle_09_pullf.xvg', '../../tests/testdata/t_middle_17_pullf.xvg', '../../tests/testdata/t_middle_25_pullf.xvg', '../../tests/testdata/t_middle_05_pullf.xvg']
- the workset is created by creating a WorkSet instance which is fitted with the filenames.
The resolution parameter controls the striding of the data set. The reduction is performed after integration the force time traces. For long trajectories e.g. 35,000,000 frames it is advisory to use a resolution > 1000 to not exceed your hardware.
# create WorkSet instance
workset = WorkSet(velocity=velocity,
resolution=res,
verbose=False,
)
workset
WorkSet(velocity=0.001)
# fit/fill workset
workset.fit(filenames)
# save workset
#dcTMD.storing.save('my_workset', workset)
Loading & integrating force files: 0%| | 0/18 [00:00<?, ?it/s]
Loading & integrating force files: 100%|██████████| 18/18 [00:00<00:00, 95.04it/s]
WorkSet(velocity=0.001)
print(vars(workset))
{'velocity': 0.001, 'resolution': 1, 'verbose': False, 'X': ['../../tests/testdata/t_middle_32_pullf.xvg', '../../tests/testdata/t_middle_03_pullf.xvg', '../../tests/testdata/t_middle_34_pullf.xvg', '../../tests/testdata/t_middle_24_pullf.xvg', '../../tests/testdata/t_middle_21_pullf.xvg', '../../tests/testdata/t_middle_04_pullf.xvg', '../../tests/testdata/t_middle_29_pullf.xvg', '../../tests/testdata/t_middle_16_pullf.xvg', '../../tests/testdata/t_middle_30_pullf.xvg', '../../tests/testdata/t_middle_19_pullf.xvg', '../../tests/testdata/t_middle_01_pullf.xvg', '../../tests/testdata/t_middle_28_pullf.xvg', '../../tests/testdata/t_middle_26_pullf.xvg', '../../tests/testdata/t_middle_31_pullf.xvg', '../../tests/testdata/t_middle_09_pullf.xvg', '../../tests/testdata/t_middle_17_pullf.xvg', '../../tests/testdata/t_middle_25_pullf.xvg', '../../tests/testdata/t_middle_05_pullf.xvg'], 'time_': array([0.0000e+00, 1.0000e-01, 2.0000e-01, ..., 1.9998e+03, 1.9999e+03, 2.0000e+03]), 'work_': array([[ 0.00000000e+00, 1.03698915e-01, 1.24836915e-01, ..., 8.45118712e+01, 8.44347098e+01, 8.44165457e+01], [ 0.00000000e+00, -1.00744350e-01, -2.14093600e-01, ..., 1.15331409e+02, 1.15265871e+02, 1.15229601e+02], [ 0.00000000e+00, 6.30896000e-02, 5.76634000e-02, ..., 5.74830869e+01, 5.75868897e+01, 5.74908294e+01], ..., [ 0.00000000e+00, -8.02650000e-03, 1.56785000e-01, ..., 5.20392642e+01, 5.21592937e+01, 5.21718456e+01], [ 0.00000000e+00, 1.79783900e-01, 3.88872900e-01, ..., 8.74668694e+01, 8.72953927e+01, 8.72269519e+01], [ 0.00000000e+00, 2.15985500e-02, -4.17557500e-02, ..., 9.99542344e+01, 9.98090948e+01, 9.98238686e+01]]), 'names_': array(['t_middle_32_pullf.xvg', 't_middle_03_pullf.xvg', 't_middle_34_pullf.xvg', 't_middle_24_pullf.xvg', 't_middle_21_pullf.xvg', 't_middle_04_pullf.xvg', 't_middle_29_pullf.xvg', 't_middle_16_pullf.xvg', 't_middle_30_pullf.xvg', 't_middle_19_pullf.xvg', 't_middle_01_pullf.xvg', 't_middle_28_pullf.xvg', 't_middle_26_pullf.xvg', 't_middle_31_pullf.xvg', 't_middle_09_pullf.xvg', 't_middle_17_pullf.xvg', 't_middle_25_pullf.xvg', 't_middle_05_pullf.xvg'], dtype='<U32'), 'position_': array([0.0000e+00, 1.0000e-04, 2.0000e-04, ..., 1.9998e+00, 1.9999e+00, 2.0000e+00])}
II. check normality of work distribution¶
One of the main conditions which need to be fulfilled for dcTMD is a normally distributed work.
This can be checked via different methods. e.g. plotting the work time traces, normality checks at different x positions, Kolmogorov-Smirnov Test, Shapiro-Wilk Test, Anderson-Darling Test,
CAUTION: if the work distribution is not normal you results are compromised. And a path separation is necessary. For the theory on path separation see...
# plot workset
import matplotlib.pyplot as plt
from dcTMD.utils import plotting
fig, ax = plt.subplots()
plotting.plot_worklines(workset, ax)
plt.show()
# check if work distribution follows a normal distribution
from scipy.stats import kstest, shapiro, anderson
from dcTMD.utils import plotting
index = [5000, 15000]
x = workset.position_
fig, axs = plotting.plot_worknormalitychecks(workset, index)
for i, p in enumerate(index):
# Shapiro-Wilk Test
shapiro_test = shapiro(workset.work_[:,p])
print(f'shapiro wilkins results at x={x[p]} is {shapiro_test}')
# Anderson-Darling Test
# If the test statsitics is larger than the critical value of a given
# significance_level in percent, the null hypothesis that the work
# is normally distributed has to be rejected.
anderson_test = anderson(workset.work_[:,p], 'norm')
print(f'anderson darling results at x={x[p]} is {anderson_test}.')
# Kolmogorov-Smirnov Test (requires centering and scaling of input data)
kstest_test = kstest(
(workset.work_[:,p]-np.mean(workset.work_[:,p]))/np.std(workset.work_[:,p]),
'norm'
)
print(f'Kolmogorov-Smirnov results at x={x[p]} is {kstest_test}')
shapiro wilkins results at x=0.5 is ShapiroResult(statistic=0.9809889793395996, pvalue=0.9596364498138428) anderson darling results at x=0.5 is AndersonResult(statistic=0.1889729620402001, critical_values=array([0.503, 0.573, 0.687, 0.802, 0.954]), significance_level=array([15. , 10. , 5. , 2.5, 1. ])). Kolmogorov-Smirnov results at x=0.5 is KstestResult(statistic=0.1222997987555296, pvalue=0.920955734836253) shapiro wilkins results at x=1.5 is ShapiroResult(statistic=0.9350144863128662, pvalue=0.23768189549446106) anderson darling results at x=1.5 is AndersonResult(statistic=0.38109849215264546, critical_values=array([0.503, 0.573, 0.687, 0.802, 0.954]), significance_level=array([15. , 10. , 5. , 2.5, 1. ])). Kolmogorov-Smirnov results at x=1.5 is KstestResult(statistic=0.14917443864214264, pvalue=0.7646028245726414)
III. derive estimates from workset¶
- create WorkEstimator instance
- fit WorkEstimator instance with previously created workset
# create WorkEstimator instance
workestimator = WorkEstimator(temperature)
# fit existing workset
# or load an existing workset
# workset = dcTMD.storing.load(my_workset)
workestimator.fit(workset)
vars(workestimator)
{'temperature': 300, 'verbose': False, 'work_set': WorkSet(velocity=0.001), 'position_': array([0.0000e+00, 1.0000e-04, 2.0000e-04, ..., 1.9998e+00, 1.9999e+00, 2.0000e+00]), 'names_': array(['t_middle_32_pullf.xvg', 't_middle_03_pullf.xvg', 't_middle_34_pullf.xvg', 't_middle_24_pullf.xvg', 't_middle_21_pullf.xvg', 't_middle_04_pullf.xvg', 't_middle_29_pullf.xvg', 't_middle_16_pullf.xvg', 't_middle_30_pullf.xvg', 't_middle_19_pullf.xvg', 't_middle_01_pullf.xvg', 't_middle_28_pullf.xvg', 't_middle_26_pullf.xvg', 't_middle_31_pullf.xvg', 't_middle_09_pullf.xvg', 't_middle_17_pullf.xvg', 't_middle_25_pullf.xvg', 't_middle_05_pullf.xvg'], dtype='<U32'), 'W_mean_': array([0.00000000e+00, 4.41837458e-02, 4.44572742e-02, ..., 9.33136501e+01, 9.32887153e+01, 9.32516402e+01]), 'W_diss_': array([0.00000000e+00, 1.16301148e-03, 5.56596643e-03, ..., 1.03329006e+02, 1.03220317e+02, 1.03098234e+02]), 'dG_': array([ 0. , 0.04302073, 0.03889131, ..., -10.01535586, -9.93160192, -9.84659417]), 'friction_': array([ 0. , 11630.11484155, 44029.54944899, ..., 750667.95819666, -1086887.96422726, -1220828.47759316])}
Visualize results¶
In the package a couple of simple plot functions to get an overview of the results are implemented. e.g. plot_dcTMD_results()
# plot dcTMD results
from dcTMD.utils import plotting
# plot free energy estimate
fig, ax = plt.subplots()
plotting.plot_dG(x, workestimator.dG_, ax, label='some system')
plt.legend()
plt.show()
# plot dcTMD results
fig, axs = plotting.plot_dcTMD_results(workestimator)
plt.show()
Smooth friction estimate¶
Finally, the friction estimate needs to be smoothed. This can be done via dcTMD.utils.smoothing.gaussfilter_friction() or dcTMD.WorkEstimator.smooth_friction(sigma, mode) sigma is the standard deviation of gaussian kernel in nm the mode parameter determines how the input array is extended beyond its boundaries.
Caution: this can lead to long computations using large datasets and a big smoothing window.
# smooth friction and plot results
workestimator.smooth_friction(sigma=0.1)
# if the friction is smoothed that value is automatically plotted
fig, axs = plotting.plot_dcTMD_results(workestimator)
plt.show()
# but one can also specify the friction
#fig, axs = plotting.plot_dcTMD_results(
# workestimator,
# friction=workestimator.friction_
#)
Using different smoothing windows and modes changes the results significantly. Because of the different boundary handling in the modes. mode='nearest' leads to an overestimation at the right hand boarder (end of the simulation), while mode='reflect' leads to a wash out of the boarder on the left hand side.
Here, are some examples:
fig, axs = plt.subplots()
x = workestimator.position_
plotting.plot_Gamma(x,
workestimator.friction_smooth_,
axs,
label='default (reflect) 0.1nm'
)
# using different smoothing windows and modes changes the results significantly
smooth_friction = dcTMD.utils.gaussfilter_friction(workestimator.friction_,
x,
sigma=.01,
mode='reflect',
)
axs.plot(x, smooth_friction, label='reflect .01nm')
smooth_friction = dcTMD.utils.gaussfilter_friction(workestimator.friction_,
x,
sigma=.01,
mode='nearest',
)
axs.plot(x, smooth_friction, label='nearest .01nm')
smooth_friction = dcTMD.utils.gaussfilter_friction(workestimator.friction_,
x,
sigma=0.2,
mode='reflect',
)
axs.plot(x, smooth_friction, label='reflect .2nm')
axs.legend()
plt.show()
IV. Error estimation¶
The error estimation of the results is implemented via bootstrapping
n_resamples = 1000
# bootstrapping error in mode std
mode = 'std'
workestimator.estimate_free_energy_errors(n_resamples, mode)
fig, ax = plt.subplots()
plotting.plot_dG_werrors(workestimator, ax)
plt.show()
Bootstrapping progress: 100%|██████████| 1000/1000 [00:00<00:00, 1119.05it/s]
# bootstrapping error in with confidence interval
# this gives a lower and upperbound estimate
confidence_interval = 0.9
mode = confidence_interval
workestimator.estimate_free_energy_errors(n_resamples, mode)
fig, ax = plt.subplots()
plotting.plot_dG_werrors(workestimator, ax)
plt.show()
Bootstrapping progress: 100%|██████████| 1000/1000 [00:00<00:00, 1097.81it/s]
V. Save and load results¶
# save workestimator instance
dcTMD.storing.save('my_workestimator', workestimator)
# save data as .npz and .dat file
outname = 'my_workestimator_results'
dcTMD.io.write_output(outname, workestimator)
# loas results
results = np.load(f'{outname}_N{len(workestimator.names_)}_dG.npz')
results.files
save file my_workestimator_results_N18_dG.dat save file my_workestimator_results_N18_dG.npz
['x', 'Wmean', 'Wdiss', 'dG', 'Gamma', 's_W_mean', 's_W_diss', 's_dG', 'Gamma_smooth']