Quick-Start¶
NanoPypes is a python package for managing and analysing ONT sequence data using distributed computing environments.
*Coming soon*
- parallel_variant_calling -> with samtools and bcftools
- guppy_cpu
- guppy_gpu
- kubernettes cluster support
- Slurm cluster support
Installation Instructions:¶
You will need Albacore installed. Install Albacore:
Install from source¶
NanoPypes Source:
$ git clone https://github.com/kforti/NanoPypes
$ cd NanoPypes
$ python3 setup.py install --user
Parallel basecalling with ONT’s Albacore¶
Run Albacore (replace all < > with their appropriate value):
$ albacore_basecaller path/to/yaml/config --kit <name> --flowcell <name> --cluster-name <name> --save-path <path> --input-path <path > --output_format <fastq or fast5>
albacore_basecaller options:
config The path to the cluster configuration yaml ##<path/to/config>
-n --cluster-name The name of the cluster- located directly under computes in the config file. required=True
-s --save-path An empty save location for the basecalled data- if the directory does not exist it will be created but the parent directory must exist required=True
-i --input-path The path to a directory that contains batches of raw sequening data- likely titled pass. required=True
-k --kit The type of ONT kit used in the sequencing run. required=True
-f --flowcell The type of ONT kit used in the sequencing run. required=True
-o --output-format fastq or fast5 output format. required=True
Building the yaml config file¶
A yaml file is used to pass cluster configuration information to NanoPypes. Multiple clusters can be described. In the example below, there is one cluster listed and its name is ‘cluster1’.
The .yml file should have the following parameters.
computes:
cluster1:
job_time: 04:00
mem: 2048
umassmem: 2048
ncpus: 10
project: /path/to/project/space
queue: short
workers: 10
cores: 10
memory: 2 GB
scale_value: 200
cluster_type: LSF
yaml options:
-job_time #Number of physical cores per job (for cluster) ##BSUB -W
-mem #The amount of memory in bytes required by each job ##BSUM -M
-umassmem: #Should be None if not using Umass LSF cluster. Memory described as - rusage[mem=umassmem] ##BSUB -R 'rusage[mem=2048]'
-ncpus #The number of physical cores per job ##BSUB -n
-project #The project space path on the cluster ##BSUB -p
-queue #The queue that the worker jobs should be submitted to ##BSUB -q
-workers #The number of workers per job
-cores: #The number of cores per worker ##cores * workers == ncpus
-memory: # The amount of memory per worker ##memory *workers == mem
-scale_value: #The total number of workers that you would like in your cluster ## scale_value / workers == total number of jobs to be created
-cluster_type: #The type of job scheduler on your HPC cluster ##currently only supports LSF
NanoPypes comes with a pre-made config file for running albacore on an LSF cluster. You only need to add your project path to the file.
Build a config file:
$ get_config_template --save-path <path> --cluster-type <name>
A config file for your cluster will be saved to the save_path
Move your data with parallel rsync¶
Be aware while selecting the number of channels to not overwhelm the data source/destination.:
default -nchannels == 4
Run parallel_rsync (replace all < > with their appropriate value).:
$ parallel_rsync --nchannels <default=4> --local-path <path> --remote-path <path> --password <password> --direction <push or pull> --options <rsync options default='-vcr'>
parallel_rsync options:
-n --nchannels The number of parallel rsync channels.
-l --local-path The path to the data on your local machine.
-r --remote-path The path to where your data should be saved remotely, must include username.
-p --password Remote location password
-d --direction Use "push" for local to remote. Use "pull" for remote to local. Default is set to push.
-o --options a string containing the rsync options you would like to use, must include the appropriate flag(s). Default options are -vcr
Parallel Read Mapping with Minimap2¶
Run Minimap2 (replace all < > with their appropriate value):
$ parallel_minimap2 <path/to/config> --command <name> --cluster-name <name> --input-path <path> --reference <path> --save-path <path>
Parallel minimap2 options:
config The path to the cluster configuration yaml ##<path/to/config>
-n --cluster-name The name of the cluster- located directly under computes in the config file.
-i --input-path The path to a directory containing multiple fastq files.
-s --save-path The path to where the output should be saved.
-r --reference The path to your fasta reference file.
-c --command The minimap2 command that you would like to use. ["splice", "genomic", "rna", "overlap"]