Damn, I finally successfully used this feature… I’m really stupid…
In actual workflows, our processes may have some “dynamic” requirements. For example, aligning fastq files or performing variant detection, the corresponding alignment/detection software may not have a well-developed multi-threading mechanism. Additionally, original data files are too large, making direct computation very time-consuming. In such cases, using source file splitting -> parallel processing across multiple processes -> result file merging
is a very practical strategy. How do you split them? You can split the files in a fixed way, such as splitting bam files by chromosome during variant detection. However, this fixed splitting pattern may not always be efficient. For example, different chromosomes have vastly different data volumes, and parallel processing by chromosome will inevitably be limited by the largest chromosome’s data volume. But if you don’t follow a fixed strategy for splitting, the number of generated split files may be unpredictable, which conflicts with snakemake
‘s logic of parsing what to do before allocating resources. Therefore, developers introduced a dynamic parsing mechanism (Data-dependent conditional execution
) in newer versions.
The core of this dynamic mechanism is checkpoint
, which is a special type of rule
. While running, it behaves similarly to a regular rule
, but when the workflow reaches a checkpoint, the entire dependency graph (DAG
) will be re-parsed. The designer’s intention was clear: set rules that produce unpredictable results as checkpoints, and then re-decide the necessary steps based on the actually generated files.
Due to this special dynamic mechanism, using checkpoints is also more troublesome than regular rules
. Checkpoints must be used in conjunction with function input
, because although a new mechanism has been added, the program’s execution logic still relies on deciding what to do first. And deciding what to do cannot be determined by the program itself; it requires manual design, so we need to use a function to get the files generated by the checkpoint and then pass these files to downstream rules
for parsing.
I have an example that is slightly easier to understand than the official one:
1 | from os import path |
This example implements splitting fastq files into multiple sub-files, counting the number of lines for each split file, and then merging the results into a single file. The checkpoint
part is responsible for splitting, while this rule
generates an output that does not depend on downstream rules. The truly useful part is the checkpoint
itself. The get_split_files()
function uses the checkpoints
object and accesses the content of the split
checkpoint, so it uses this function as input for the cat
rule, which has a dependency on the checkpoint split
. At the same time, it does not perform actual dependency parsing but waits for its checkpoint split
to complete before parsing (i.e., dynamic parsing).
The official documentation example is basically the same principle, although its checkpoint
generates directory results to facilitate re-parsing of dynamically generated files.
Finally, I’ll add another BWA example:
1 | from os import path |
I have been using this feature in some of my projects, and the actual experience is that it seems simple and general-purpose, but there are still different issues to consider depending on the software situation…
For example, in the above BWA example, there is still one problem to solve: here, the generated directories are actually temporary directories, and I hope to delete them after use. However, temp
and directory
keywords conflict, and the files in the temporary folder are dynamic and cannot be specified directly, so it’s currently not possible to utilize snakemake’s own features to automatically delete them.
```