Maxwell : rosettafold

Sources: https://github.com/RosettaCommons/RoseTTAFold

License: MIT - academic use only. Also have a look at pyRosetta license terms

Citation: M Baek, et al., Accurate prediction of protein structures and interactions using a 3-track network, bioRxiv (2021)

Genetic Databases: /beegfs/desy/group/it/ReferenceData/rosettafold

Samples: /software/rosettafold/2021-07/example

Sample Batch Script: /software/rosettafold/2021-07/sbatch-rosettafold.sh

casp14: https://www.predictioncenter.org/casp14/index.cgi (T1050)

RoseTTAFolds three-track network produces structure predictions with accuracy approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo–electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking.

This page briefly describes the installation of RoseTTAFold on Maxwell. RoseTTAFold needs multiple genetic (sequence) databases to run. The databases are huge, and downloads can take a very long time (days). We therefore provide a central installation under /beegfs/desy/group/it/ReferenceData/rosettafold.

Samples and sources for RoseTTAFold can be found in /software/rosettafold/2021-07.

Below are descriptions how to run RoseTTAFold.

Running RoseTTAFold

The RoseTTAFold distribution provides two scripts (/software/rosettafold/2021-07/run_e2e_ver.sh, /software/rosettafold/2021-07/run_pyrosetta_ver.sh) for convenience. The scripts will not work properly unless you run them from inside your own RosettaFold installation.

The scripts execute python $PIPEDIR/network/predict_pyRosetta.py, which aims to use $PIPEDIR/network/equivariant_attention/from_se3cnn/cache for caching and locking, which is not possible for a shared installation. To get around the issue you need to

copy /software/rosettafold/2021-07/network to your working directory
make a copy of /software/rosettafold/2021-07/run_pyrosetta_ver.sh and
change $PIPEDIR/network/predict_pyRosetta.py to <your-network-copy>/predict_pyRosetta.py

/software/rosettafold/2021-07/sbatch-rosettafold.sh gives an example of a fully runnable script using the RoseTTAFold distributed example:

sbatch examples

# Usage:
Usage: sbatch sbatch-rosettafold.sh <fasta-sequence>
Environments:
        RF_BASE=<rosettafold installation folder> [/software/rosettafold/2021-07]
        RF_NET=<location of network> [/home/schluenz/RosettaFold2/network]
        RF_DB=<database> [/software/rosettafold/2021-07/pdb100_2021Mar03/pdb100_2021Mar03]
        RF_WDIR=<workdir> [/home/schluenz/RosettaFold2]


# run rosettafold on T1050 with current DB
mkdir -p $HOME/rosettafold.t1050; cd $HOME/rosettafold.t1050
sbatch /software/rosettafold/2021-07/sbatch-rosettafold.sh /software/alphafold/2.0/T1050.fasta

# run rosettafold on T1050 with last years DB (casp14)
mkdir -p $HOME/rosettafold.casp14.t1050; cd $HOME/rosettafold.casp14.t1050
export RF_DB=/beegfs/desy/group/it/ReferenceData/rosettafold/pdb100_2020Mar11/pdb100_2020Mar11 
sbatch --partition=maxgpu --constraint=V100 --time=0-08:00 /software/rosettafold/2021-07/sbatch-rosettafold.sh /software/alphafold/2.0/T1050.fasta

The batch-script is a modified version of run_pyrosetta_ver.sh, and should give an idea how to use it ...

/software/rosettafold/sbatch-rosettafold.sh

#!/bin/bash
#SBATCH --partition=allgpu
#SBATCH --constraint='A100|V100'
#SBATCH --time=0-08:00
unset LD_PRELOAD

# execute as sbatch sbatch-rosettafold.sh <fasta-sequence>
if [ "$#" -ne 1 ]; then
    echo "Usage: sbatch sbatch-rosettafold.sh <fasta-sequence>"
    echo "Environments:"
    echo "        RF_BASE=<rosettafold installation folder> [/software/rosettafold/2021-07]"
    echo "        RF_NET=<location of network> [$PWD/network]"
    echo "        RF_DB=<database> [/software/rosettafold/2021-07/pdb100_2021Mar03/pdb100_2021Mar03]"
    echo "        RF_WDIR=<workdir> [$PWD]"
    exit 1
else
    fafile=$1
    echo "Running RoseTTAFold on $fafile"
fi


# make the script stop on error (non-true exit code)
set -e

#
#  conda setup
# 
source /etc/profile.d/modules.sh
module purge
which conda > /dev/null 2>&1 || module load anaconda3
__conda_setup="$('conda' 'shell.bash' 'hook' 2> /dev/null)"
eval "$__conda_setup"
unset __conda_setup

#
#  RoseTTAFold installation folder
#
export PIPEDIR="${RF_BASE:-/software/rosettafold/2021-07}"
export NETWORK="${RF_NET:-$PWD/network}"
export DB="${RF_DB:-$PIPEDIR/pdb100_2021Mar03/pdb100_2021Mar03}"
export WDIR="${RF_WDIR:-$PWD}"
mkdir -p $WDIR/log

CPU=$(nproc) 
MEM=$(free -g | grep Mem: | awk '{print $4 - 50}')
IN=$fafile
LEN=`tail -n1 $IN | wc -m`

cat <<EOF

RossTTAFold Setup:
----------------------------------------------------------------------------------------------------
PIPEDIR....: $PIPEDIR
NETWORK....: $NETWORK
DB.........: $DB
WDIR.......: $WDIR
TARGET.....: $IN
Length.....: $LEN
----------------------------------------------------------------------------------------------------

Hardware Setup
----------------------------------------------------------------------------------------------------
Host.......:  $(hostname)
CPU........:  $(grep "model name" /proc/cpuinfo  | head -1 | cut -d: -f2 | grep -o '[a-Z].*')
GPU........:  $(nvidia-smi -L |cut -d'(' -f1 | tr '\n' ' ')
Cores......:  $CPU out of $(nproc)
Memory.....:  $MEM out of $(free -g | grep Mem | awk '{print $2}')

Time.......:  $(date)

EOF

#
#  you need a local copy of the network folder!
#
if [[ ! -e $NETWORK ]]; then
    mkdir -p $NETWORK
    cp -r $PIPEDIR/network/* $NETWORK
fi

conda activate /software/rosettafold/2021-07/RoseTTAFold
############################################################
# 1. generate MSAs
############################################################
if [ ! -s $WDIR/t000_.msa0.a3m ]
then
    echo "Running HHblits"
    $PIPEDIR/input_prep/make_msa.sh $IN $WDIR $CPU $MEM > $WDIR/log/make_msa.stdout 2> $WDIR/log/make_msa.stderr
fi


############################################################
# 2. predict secondary structure for HHsearch run
############################################################
if [ ! -s $WDIR/t000_.ss2 ]
then
    echo "Running PSIPRED"
    $PIPEDIR/input_prep/make_ss.sh $WDIR/t000_.msa0.a3m $WDIR/t000_.ss2 > $WDIR/log/make_ss.stdout 2> $WDIR/log/make_ss.stderr
fi


############################################################
# 3. search for templates
############################################################
if [ ! -s $WDIR/t000_.hhr ]
then
    echo "Running hhsearch"
    HH="hhsearch -b 50 -B 500 -z 50 -Z 500 -mact 0.05 -cpu $CPU -maxmem $MEM -aliw 100000 -e 100 -p 5.0 -d $DB"
    cat $WDIR/t000_.ss2 $WDIR/t000_.msa0.a3m > $WDIR/t000_.msa0.ss2.a3m
    $HH -i $WDIR/t000_.msa0.ss2.a3m -o $WDIR/t000_.hhr -atab $WDIR/t000_.atab -v 0 > $WDIR/log/hhsearch.stdout 2> $WDIR/log/hhsearch.stderr
fi


############################################################
# 4. predict distances and orientations
############################################################
# crucial: use local copy of network/predict_pyRosetta.py
if [ ! -s $WDIR/t000_.3track.npz ]
then
    echo "Predicting distance and orientations"
    python $NETWORK/predict_pyRosetta.py \
        -m $PIPEDIR/weights \
        -i $WDIR/t000_.msa0.a3m \
        -o $WDIR/t000_.3track \
        --hhr $WDIR/t000_.hhr \
        --atab $WDIR/t000_.atab \
        --db $DB 1> $WDIR/log/network.stdout 2> $WDIR/log/network.stderr
fi

############################################################
# 5. perform modeling
############################################################
mkdir -p $WDIR/pdb-3track

conda deactivate
conda activate /software/rosettafold/2021-07/folding

for m in 0 1 2
do
    for p in 0.05 0.15 0.25 0.35 0.45
    do
        for ((i=0;i<1;i++))
        do
            if [ ! -f $WDIR/pdb-3track/model${i}_${m}_${p}.pdb ]; then
                echo "python -u $PIPEDIR/folding/RosettaTR.py --roll -r 3 -pd $p -m $m -sg 7,3 $WDIR/t000_.3track.npz $IN $WDIR/pdb-3track/model${i}_${m}_${p}.pdb"
            fi
        done
    done
done > $WDIR/parallel.fold.list

N=`cat $WDIR/parallel.fold.list | wc -l`
if [ "$N" -gt "0" ]; then
    echo "Running parallel RosettaTR.py  - Using $CPU cores - Number of runs: $N "    
    parallel -j $CPU < $WDIR/parallel.fold.list > $WDIR/log/folding.stdout 2> $WDIR/log/folding.stderr
else
    echo "Nothing found to run parallel RosettaTR.py"
fi

############################################################
# 6. Pick final models
############################################################
count=$(find $WDIR/pdb-3track -maxdepth 1 -name '*.npz' | grep -v 'features' | wc -l)
if [ "$count" -lt "15" ]; then
    # run DeepAccNet-msa
    echo "Running DeepAccNet-msa  - count: $count"
    python $PIPEDIR/DAN-msa/ErrorPredictorMSA.py --roll -p $CPU $WDIR/t000_.3track.npz $WDIR/pdb-3track $WDIR/pdb-3track 1> $WDIR/log/DAN_msa.stdout 2> $WDIR/log/DAN_msa.stderr
else
    echo "Skipping DeepAccNet-msa - count: $count"
fi

if [ ! -s $WDIR/model/model_5.crderr.pdb ]
then
    echo "Picking final models"
    python -u -W ignore $PIPEDIR/DAN-msa/pick_final_models.div.py \
        $WDIR/pdb-3track $WDIR/model $CPU > $WDIR/log/pick.stdout 2> $WDIR/log/pick.stderr
    echo "Final models saved in: $WDIR/model"
fi
echo "Done"

Runtime

	CPU	Cores	Memory	GPU	#GPU	Elapsed
1	AMD EPYC 7302	2x16 (+HT)	512GB	NVIDIA A100-PCIE-40GB	4	03:56:45
	AMD EPYC 7302	2x16	512GB	NVIDIA A100-PCIE-40GB	4	04:25:27
2	Intel(R) Xeon(R) Silver 4210	2x10 (+HT)	384GB	NVIDIA Tesla V100S-PCIE-32GB	1	04:27:00
	Intel(R) Xeon(R) Silver 4210	2x10	384GB	NVIDIA Tesla V100S-PCIE-32GB	1	05:05:47

Remarks

Memory on P100 is too small, only A100, V100 will work
Runs using all cores (physical+hyper-threads) is significantly faster than only using physical cores, so use all (sbatch-rosettafold.sh does)

Maxwell : rosettafold

Content

Running RoseTTAFold

Runtime

Remarks

Attachments: