HiFi DeepVariant + WhatsHap workflow

Workflow steps

align HiFi reads to reference with pbmm2
call small variants with DeepVariant, using two-pass method (DeepVariant ➡️ WhatsHap phase ➡️ WhatsHap haplotag ➡️ DeepVariant)
phase small variants with WhatsHap
haplotag aligned BAMs with WhatsHap and merge

Directory structure within basedir

.
├── cluster_logs  # slurm stderr/stdout logs
├── reference
│   ├── reference.chr_lengths.txt  # cut -f1,2 reference.fasta > reference.chr_lengths.txt
│   ├── reference.fasta
│   └── reference.fasta.fai
├── samples
│   └── 
   
      # sample_id regex: r'[A-Za-z0-9_-]+'
│       ├── whatshap/  # phased small variants; merged haplotagged alignments
│       ├── logs/  # per-rule stdout/stderr logs
│       ├── aligned/  # intermediate
│       ├── deepvariant/  # intermediate
│       ├── deepvariant_intermediate/  # intermediate
│       └── whatshap_intermediate/  # intermediate
├── smrtcells
│   ├── done  # move folders from smrtcells/ready to smrtcells/done to prevent re-processing
│   └── ready
│       └── 
    
       # uBAMs or FASTQs per sample
│                        # filename regex: r'm\d{5}[Ue]?_\d{6}_\d{6}).(ccs|hifi_reads).bam' or r'm\d{5}[Ue]?_\d{6}_\d{6}).fastq.gz'
└── workflow  # clone of this repo

To run the pipeline

$ conda create \
    --channel bioconda \
    --channel conda-forge \
    --prefix ./conda_env \
    python=3 snakemake mamba lockfile

$ conda activate ./conda_env

$ sbatch workflow/run_snakemake.sh <sample_id>

HiFi DeepVariant + WhatsHap workflowHiFi DeepVariant + WhatsHap workflow

Related tags

Overview

HiFi DeepVariant + WhatsHap workflow

Workflow steps

Directory structure within basedir

To run the pipeline

Owner

William Rowell

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

ChatBotProyect - This is an unfinished project about a simple chatbot.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Spooky Skelly For Python

FastFormers - highly efficient transformer models for NLU

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

A python package for deep multilingual punctuation prediction.

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Python wrapper for Stanford CoreNLP tools v3.4.1

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

Beyond Paragraphs: NLP for Long Sequences