HPC Facility

The HPC Facility (SURYA) has 16 CPU (640 cores) compute node clusters with 4 GPU (160 cores with 2x4 Nvidia Tesla-V100 having 40,960‬ CUDA core) node clusters along with 8.5TB of RAM. The HPC Facility use Parallel File System (~200TB) of DDN GRIDScaler storage at 15 GBps throughput over 100 Gbps interconnect network.

Queuing Systems

When a job is submitted, it is placed in a queue. There are different queues available for different purposes. The user must select any one of the queues from the ones listed below which is appropriate for his/her computation need.

  Queue   Details
  • Queue for submitting CPU jobs: The queue will be available to all the users.
    A single user can submit only 1 jobs.

  Name of Queue = CORE160
  No of nodes = 4
  No of x86 Processors = 160
  Name of node = Any CPU node {1-16}
  Walltime = 360 hrs
  MaxJob = 1 per user
  • Queue for submitting CPU jobs: The queue will be available to all the users.
    A single user can submit only 1 jobs.
  Name of Queue = CORE320
  No of nodes = 8
  No of x86 Processors = 320
  Name of node = Any CPU node {1-16}
  Walltime = 24 hrs
  MaxJob = 1 per user
  • Queue for submitting GPU jobs: The queue will be available to all the users.
    The jobs that utilize GPUs shall only be allowed.
    A single user can submit only 1 jobs.
  Name of Queue = GPU
  No of nodes = 1
  No of x86 Processors = 40
  CUDA cores = 10,240
  Name of node = Any GPU node {17-20}
  Walltime = 360 hrs
  MaxJob = 1 per user

Node Configuration

Based on the queuing system given above, the node configurations can be summarized as follows:

>
  Queue Type   Queue Name   Node Configuration
  CPU   CORE160   CPU : 160, RAM : 1,536 GB
  CPU   CORE320   CPU : 320, RAM : 3,072 GB
  GPU   GPU   CPU : 40, RAM : 384 GB, 2x Tesla-V100 : 16GB

Sample Scripts to submit job for various queue:

CPU Queue : CORE160

      #!/bin/bash

      #PBS -u USER_NAME
      #PBS -N STUDENT_NAME
      #PBS -q core160
      #PBS -l nodes=4:ppn=40
      #PBS -o out.log
      #PBS -j oe
      #PBS -V

      module load compilers/intel/parallel_studio_xe_2018_update3_cluster_edition
      cd $PBS_O_WORKDIR
      mpiexec.hydra -f $PBS_NODEFILE -np 160 “/SCRIPT_PATH/”
      ./Job_script.sh
      exit;

CPU Queue : CORE320

      #!/bin/bash

      #PBS -u USER_NAME
      #PBS -N STUDENT_NAME
      #PBS -q core320
      #PBS -l nodes=8:ppn=40
      #PBS -o out.log
      #PBS -j oe
      #PBS -V

      module load compilers/intel/parallel_studio_xe_2018_update3_cluster_edition
      cd $PBS_O_WORKDIR
      mpiexec.hydra -f $PBS_NODEFILE -np 320 “/SCRIPT_PATH/”
      ./Job_script.sh
      exit;

GPU Queue : GPU

      #!/bin/bash

      #PBS -u USER_NAME
      #PBS -N STUDENT_NAME
      #PBS -q GPU
      #PBS –l select=1:ncpus=20:ngpus=1 (For ONE GPU)
      #PBS –l select=2:ncpus=20:ngpus=1 (For TWO GPU)
      #PBS –l nodes={17-20}:ncpus=20:ngpus=1 (For Specific Node)
      #PBS -o out.log
      #PBS -j oe
      #PBS -V

      module load compilers/intel/parallel_studio_xe_2018_update3_cluster_edition
      export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
      export CUDA_VISIBLE_DEVICES=0,1
      cd $PBS_O_WORKDIR
      python your_script_name.py
      mpiexec.hydra -np 2 your script_name.sh
      ./Job_script.sh
      exit;

 

Useful Commands

  • Accessing a user account: ssh <username>@172.20.70.12
  • For submitting a job: qsub submit_script.sh
  • For checking queue status: qstat {-a, -s, -n}
  • For checking node status: ssh node{1-20}
  • For cancelling the job: qdel <job-id>

 

Usage Guidelines

  • Users are supposed to submit jobs only through scheduler.
  • Users are not supposed to run any job on the master node.
  • Users are not allowed to run a job by directly login to any compute node.
  • Users must report for any weaknesses in computer security and incidents of possible misuse or violation of the account policies to the HPC administrators or write regarding the same to ccf@iiita.ac.in.
  • There is no system backup for data in /home or any other partition, it is the user's responsibility to back up his/her data on a regular basis that is not stored in his/her home directory. We cannot recover any data in any of location, including files lost to system crashes or hardware failure so it is important to make copies of your important data regularly.