Created by Frank Schluenzen, last modified on Jan 03, 2022 19:46
RoseTTAFolds three-track network produces structure predictions with accuracy approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo–electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking.
This page briefly describes the installation of RoseTTAFold on Maxwell. RoseTTAFold needs multiple genetic (sequence) databases to run. The databases are huge, and downloads can take a very long time (days). We therefore provide a central installation under /beegfs/desy/group/it/ReferenceData/rosettafold.
Samples and sources for RoseTTAFold can be found in /software/rosettafold/2021-07.
Below are descriptions how to run RoseTTAFold.
Content
Running RoseTTAFold
The RoseTTAFold distribution provides two scripts (/software/rosettafold/2021-07/run_e2e_ver.sh, /software/rosettafold/2021-07/run_pyrosetta_ver.sh) for convenience. The scripts will not work properly unless you run them from inside your own RosettaFold installation.
The scripts execute python $PIPEDIR/network/predict_pyRosetta.py, which aims to use $PIPEDIR/network/equivariant_attention/from_se3cnn/cache for caching and locking, which is not possible for a shared installation. To get around the issue you need to
- copy /software/rosettafold/2021-07/network to your working directory
- make a copy of /software/rosettafold/2021-07/run_pyrosetta_ver.sh and
- change $PIPEDIR/network/predict_pyRosetta.py to <your-network-copy>/predict_pyRosetta.py
/software/rosettafold/2021-07/sbatch-rosettafold.sh gives an example of a fully runnable script using the RoseTTAFold distributed example:
# Usage:
Usage: sbatch sbatch-rosettafold.sh <fasta-sequence>
Environments:
RF_BASE=<rosettafold installation folder> [/software/rosettafold/2021-07]
RF_NET=<location of network> [/home/schluenz/RosettaFold2/network]
RF_DB=<database> [/software/rosettafold/2021-07/pdb100_2021Mar03/pdb100_2021Mar03]
RF_WDIR=<workdir> [/home/schluenz/RosettaFold2]
# run rosettafold on T1050 with current DB
mkdir -p $HOME/rosettafold.t1050; cd $HOME/rosettafold.t1050
sbatch /software/rosettafold/2021-07/sbatch-rosettafold.sh /software/alphafold/2.0/T1050.fasta
# run rosettafold on T1050 with last years DB (casp14)
mkdir -p $HOME/rosettafold.casp14.t1050; cd $HOME/rosettafold.casp14.t1050
export RF_DB=/beegfs/desy/group/it/ReferenceData/rosettafold/pdb100_2020Mar11/pdb100_2020Mar11
sbatch --partition=maxgpu --constraint=V100 --time=0-08:00 /software/rosettafold/2021-07/sbatch-rosettafold.sh /software/alphafold/2.0/T1050.fasta
The batch-script is a modified version of run_pyrosetta_ver.sh, and should give an idea how to use it ...
#!/bin/bash
#SBATCH --partition=allgpu
#SBATCH --constraint='A100|V100'
#SBATCH --time=0-08:00
unset LD_PRELOAD
# execute as sbatch sbatch-rosettafold.sh <fasta-sequence>
if [ "$#" -ne 1 ]; then
echo "Usage: sbatch sbatch-rosettafold.sh <fasta-sequence>"
echo "Environments:"
echo " RF_BASE=<rosettafold installation folder> [/software/rosettafold/2021-07]"
echo " RF_NET=<location of network> [$PWD/network]"
echo " RF_DB=<database> [/software/rosettafold/2021-07/pdb100_2021Mar03/pdb100_2021Mar03]"
echo " RF_WDIR=<workdir> [$PWD]"
exit 1
else
fafile=$1
echo "Running RoseTTAFold on $fafile"
fi
# make the script stop on error (non-true exit code)
set -e
#
# conda setup
#
source /etc/profile.d/modules.sh
module purge
which conda > /dev/null 2>&1 || module load anaconda3
__conda_setup="$('conda' 'shell.bash' 'hook' 2> /dev/null)"
eval "$__conda_setup"
unset __conda_setup
#
# RoseTTAFold installation folder
#
export PIPEDIR="${RF_BASE:-/software/rosettafold/2021-07}"
export NETWORK="${RF_NET:-$PWD/network}"
export DB="${RF_DB:-$PIPEDIR/pdb100_2021Mar03/pdb100_2021Mar03}"
export WDIR="${RF_WDIR:-$PWD}"
mkdir -p $WDIR/log
CPU=$(nproc)
MEM=$(free -g | grep Mem: | awk '{print $4 - 50}')
IN=$fafile
LEN=`tail -n1 $IN | wc -m`
cat <<EOF
RossTTAFold Setup:
----------------------------------------------------------------------------------------------------
PIPEDIR....: $PIPEDIR
NETWORK....: $NETWORK
DB.........: $DB
WDIR.......: $WDIR
TARGET.....: $IN
Length.....: $LEN
----------------------------------------------------------------------------------------------------
Hardware Setup
----------------------------------------------------------------------------------------------------
Host.......: $(hostname)
CPU........: $(grep "model name" /proc/cpuinfo | head -1 | cut -d: -f2 | grep -o '[a-Z].*')
GPU........: $(nvidia-smi -L |cut -d'(' -f1 | tr '\n' ' ')
Cores......: $CPU out of $(nproc)
Memory.....: $MEM out of $(free -g | grep Mem | awk '{print $2}')
Time.......: $(date)
EOF
#
# you need a local copy of the network folder!
#
if [[ ! -e $NETWORK ]]; then
mkdir -p $NETWORK
cp -r $PIPEDIR/network/* $NETWORK
fi
conda activate /software/rosettafold/2021-07/RoseTTAFold
############################################################
# 1. generate MSAs
############################################################
if [ ! -s $WDIR/t000_.msa0.a3m ]
then
echo "Running HHblits"
$PIPEDIR/input_prep/make_msa.sh $IN $WDIR $CPU $MEM > $WDIR/log/make_msa.stdout 2> $WDIR/log/make_msa.stderr
fi
############################################################
# 2. predict secondary structure for HHsearch run
############################################################
if [ ! -s $WDIR/t000_.ss2 ]
then
echo "Running PSIPRED"
$PIPEDIR/input_prep/make_ss.sh $WDIR/t000_.msa0.a3m $WDIR/t000_.ss2 > $WDIR/log/make_ss.stdout 2> $WDIR/log/make_ss.stderr
fi
############################################################
# 3. search for templates
############################################################
if [ ! -s $WDIR/t000_.hhr ]
then
echo "Running hhsearch"
HH="hhsearch -b 50 -B 500 -z 50 -Z 500 -mact 0.05 -cpu $CPU -maxmem $MEM -aliw 100000 -e 100 -p 5.0 -d $DB"
cat $WDIR/t000_.ss2 $WDIR/t000_.msa0.a3m > $WDIR/t000_.msa0.ss2.a3m
$HH -i $WDIR/t000_.msa0.ss2.a3m -o $WDIR/t000_.hhr -atab $WDIR/t000_.atab -v 0 > $WDIR/log/hhsearch.stdout 2> $WDIR/log/hhsearch.stderr
fi
############################################################
# 4. predict distances and orientations
############################################################
# crucial: use local copy of network/predict_pyRosetta.py
if [ ! -s $WDIR/t000_.3track.npz ]
then
echo "Predicting distance and orientations"
python $NETWORK/predict_pyRosetta.py \
-m $PIPEDIR/weights \
-i $WDIR/t000_.msa0.a3m \
-o $WDIR/t000_.3track \
--hhr $WDIR/t000_.hhr \
--atab $WDIR/t000_.atab \
--db $DB 1> $WDIR/log/network.stdout 2> $WDIR/log/network.stderr
fi
############################################################
# 5. perform modeling
############################################################
mkdir -p $WDIR/pdb-3track
conda deactivate
conda activate /software/rosettafold/2021-07/folding
for m in 0 1 2
do
for p in 0.05 0.15 0.25 0.35 0.45
do
for ((i=0;i<1;i++))
do
if [ ! -f $WDIR/pdb-3track/model${i}_${m}_${p}.pdb ]; then
echo "python -u $PIPEDIR/folding/RosettaTR.py --roll -r 3 -pd $p -m $m -sg 7,3 $WDIR/t000_.3track.npz $IN $WDIR/pdb-3track/model${i}_${m}_${p}.pdb"
fi
done
done
done > $WDIR/parallel.fold.list
N=`cat $WDIR/parallel.fold.list | wc -l`
if [ "$N" -gt "0" ]; then
echo "Running parallel RosettaTR.py - Using $CPU cores - Number of runs: $N "
parallel -j $CPU < $WDIR/parallel.fold.list > $WDIR/log/folding.stdout 2> $WDIR/log/folding.stderr
else
echo "Nothing found to run parallel RosettaTR.py"
fi
############################################################
# 6. Pick final models
############################################################
count=$(find $WDIR/pdb-3track -maxdepth 1 -name '*.npz' | grep -v 'features' | wc -l)
if [ "$count" -lt "15" ]; then
# run DeepAccNet-msa
echo "Running DeepAccNet-msa - count: $count"
python $PIPEDIR/DAN-msa/ErrorPredictorMSA.py --roll -p $CPU $WDIR/t000_.3track.npz $WDIR/pdb-3track $WDIR/pdb-3track 1> $WDIR/log/DAN_msa.stdout 2> $WDIR/log/DAN_msa.stderr
else
echo "Skipping DeepAccNet-msa - count: $count"
fi
if [ ! -s $WDIR/model/model_5.crderr.pdb ]
then
echo "Picking final models"
python -u -W ignore $PIPEDIR/DAN-msa/pick_final_models.div.py \
$WDIR/pdb-3track $WDIR/model $CPU > $WDIR/log/pick.stdout 2> $WDIR/log/pick.stderr
echo "Final models saved in: $WDIR/model"
fi
echo "Done"
Runtime
| CPU | Cores | Memory | GPU | #GPU | Elapsed |
---|
1 | AMD EPYC 7302 | 2x16 (+HT) | 512GB | NVIDIA A100-PCIE-40GB | 4 | 03:56:45 |
| AMD EPYC 7302 | 2x16 | 512GB | NVIDIA A100-PCIE-40GB | 4 | 04:25:27 |
2 | Intel(R) Xeon(R) Silver 4210 | 2x10 (+HT) | 384GB | NVIDIA Tesla V100S-PCIE-32GB | 1 | 04:27:00 |
| Intel(R) Xeon(R) Silver 4210 | 2x10 | 384GB | NVIDIA Tesla V100S-PCIE-32GB | 1 | 05:05:47 |
- Memory on P100 is too small, only A100, V100 will work
- Runs using all cores (physical+hyper-threads) is significantly faster than only using physical cores, so use all (sbatch-rosettafold.sh does)