Skip to content

Snakemake Workflow Usage Guide

The MPP package includes a Snakemake workflow that automates lumping and plot generation across multiple molecular systems and kernel configurations.


Prerequisites

  • Snakemake installed in the environment
  • MPP installed (the workflow uses the mpp.yml conda environment spec in workflow/)
  • Input data organized as described below

Directory Structure

Copy the workflow/ directory into your working directory alongside your data:

<workdir>/
├── workflow/
│   ├── Snakefile
│   ├── lumpings.yml
│   ├── mpp.yml
│   └── ...
└── example/                    # data root (configurable)
    └── <SystemName>/
        ├── input/
        │   ├── config.yml
        │   ├── traj            # microstate trajectory
        │   └── feature_traj   # multi-feature trajectory
        └── results/
            └── <lumping>/
                ├── Z.npy
                └── *.pdf

The data_root defaults to "example" and can be overridden from the command line without editing the Snakefile:

snakemake --config data_root=mydata ...

Configuration Files

Per-system config: <data_root>/<SystemName>/input/config.yml

Specifies input file paths and lumping parameters. See docs/usage_cli.md for the full key reference.

source: example/SampleSystem/input

microstate_trajectory: traj
multi_feature_trajectory: feature_traj

lagtime: 20
pop_thr: 0.15
q_min: 0.5
frame_length: 0.2  # in ns / frame — required

Lumping definitions: workflow/lumpings.yml

Defines named lumping configurations. Each entry maps a lumping key to kernel_similarity (d) and feature_kernel (g) values.

t:
  kernel_similarity: T
  feature_kernel: none

kl:
  kernel_similarity: KL
  feature_kernel: none

t_js:
  kernel_similarity: T
  feature_kernel: JS

kl_js:
  kernel_similarity: KL
  feature_kernel: JS

js:
  kernel_similarity: none
  feature_kernel: JS

These keys become the {lumping} wildcard in Snakemake rules.


Workflow Rules

gen_Z — Run lumping and save Z matrix

Produces <data_root>/{system}/results/{lumping}/Z.npy.

Invokes:

python -m MPP.run {config} {d} {g} -Z {output}

plot — Generate a single plot

Produces <data_root>/{system}/results/{lumping}/{plot}.{ext} where {ext} is pdf or png.

Invokes:

python -m MPP.run {config} {d} {g} -Z {Z} -p {plot} -o {output}

plot_all — Generate all standard plots

Collects: sankey, dendrogram, ck_test, timescales, contacts, macrotraj, state_network, transition_matrix, transition_time in both pdf and png formats, plus macrostate_trajectory.txt. Most plots require only the microstate trajectory and feature trajectory. The contacts plot additionally requires cluster_file to be set in the system config — if absent, the rule will fail for that plot.

plot_all_rmsd — Generate RMSD plots

Collects: rmsd, delta_rmsd in both pdf and png formats. Requires topology_file and xtc_file in the system config.

rmsd_CA / rmsd_feature — Compute RMSD

Produces RMSD .npy and mean frame index .ndx files. Requires topology_file and xtc_file to be set in the system config.

draw_random (checkpoint) — Draw random frame indices

Produces per-macrostate .ndx files in a random_frames/ directory. Requires topology_file and xtc_file.


Running the Workflow

Generate Z matrices for all systems and lumpings:

snakemake --use-conda -j 4 \
    example/SampleSystem/results/t/Z.npy \
    example/SampleSystem/results/kl/Z.npy

Generate a specific plot:

snakemake --use-conda -j 1 \
    example/SampleSystem/results/t/dendrogram.pdf

Generate all plots for a system and lumping:

snakemake --use-conda -j 4 \
    example/SampleSystem/results/t/plotted

The plotted sentinel file is created by the plot_all rule when all standard (non-structural) plots exist. To also generate RMSD plots, request the plotted_rmsd sentinel:

snakemake --use-conda -j 4 \
    example/SampleSystem/results/t/plotted_rmsd

Dry run (preview without execution):

snakemake -n example/SampleSystem/results/t/dendrogram.pdf

Relationship to the CLI

Each Snakemake rule calls python -m MPP.run directly. The {lumping} wildcard resolves to a named entry in lumpings.yml, which provides the d and g arguments. This is equivalent to running the CLI commands manually (see docs/usage_cli.md).


Caching

The gen_Z rule is marked cache: True. Snakemake will reuse an existing Z matrix if the inputs have not changed. The CLI also skips recomputation if the output Z file already exists (unless overwrite=True is passed in the Python API).


Stochastic Lumpings

Stochastic lumpings (multiple independent runs with randomized neighbor selection) are triggered via a stochastic block in the per-system config.yml. No changes to lumpings.yml or any Snakemake rule are needed — the stochastic configuration is embedded in the system config.

stochastic:
  method: n     # 'n': consider the N most similar neighbors
  param: 10     # N = 10 most similar candidates
  n: 100        # number of independent stochastic runs

With this block present, gen_Z will perform 100 stochastic runs and store the full (100, n_states-1, 4) Z matrix. All downstream plot rules work unchanged — plots will use the first run by default.


Current Limitations

  • Structural rules (rmsd_CA, rmsd_feature, draw_random, get_random_pdb_frames, get_mean_frames_pdb_*) require a topology file (.pdb) and an XTC trajectory file. These must be specified in the per-system config.yml as topology_file and xtc_file. If these are absent, the affected rules cannot run.
  • PyMol rendering rules (render_pdb_files, draw_cluster_in_states) require PyMol and the pymol conda environment (workflow/pymol.yml).
  • GPCCA lumping (gpcca entry in lumpings.yml) is supported but produces a mock Z matrix and is intended for comparison only.
  • The contacts plot requires a cluster_file entry in the system config.
  • Stochastic workflows (multiple runs via the stochastic config block) are not covered by this guide.

Expected Outputs

After a successful gen_Z + plot_all run, the results directory contains:

example/<SystemName>/results/<lumping>/
├── Z.npy                         # Z matrix (lumping tree)
├── macrostate_map.npy            # microstate → macrostate index map
├── dendrogram.pdf/.png
├── sankey.pdf/.png
├── ck_test.pdf/.png
├── timescales.pdf/.png
├── contacts.pdf/.png
├── macrotraj.pdf/.png
├── state_network.pdf/.png
├── transition_matrix.pdf/.png
├── transition_time.pdf/.png
├── macrostate_trajectory.txt
├── plotted                       # sentinel: plot_all complete
├── rmsd.pdf/.png                 # requires topology/XTC
├── delta_rmsd.pdf/.png           # requires topology/XTC
└── plotted_rmsd                  # sentinel: plot_all_rmsd complete