Computers

Computational Environment

Yokota lab provides multiple access to supercomputers:

In addition, a private cluster "Hinadori" is maintained within the lab.

Hinadori Cluster

Hinadori cluster (Hinadori) is designed and operated to provide the latest environments that have yet to be introduced in supercomputers to conduct cutting-edge research within the lab.
Students take the initiative in determining specification, procurement, and operation.

Hardware

To conduct forefront research in HPC and Deep Learning, GPUs are necessary for daily research in Yokota Lab.
Hence, we provide multiple GPU servers (28 GPUs in total):

CPU Host Memory GPU (units per node) Nodes
Intel Xeon Silver 4215 96GB NVIDIA GeForce GTX 1080Ti (2) 4
    NVIDIA GeForce RTX 2080 (2) 2
    NVIDIA GeForce RTX 2080Ti (2) 1
    VIDIA TITAN V (2) 1
    NVIDIA TESLA V100 PCIe 16GB (1) 1
Intel Xeon E5-2630v3 64GB NVIDIA TITAN RTX (1) 1
Intel Core i9-7940X 64GB NVIDIA A6000 (2) 1
AMD EPYC 7742 1TB NVIDIA A100 SXM4 (8) 1

(2021.02.18 present)

Other features of Hinadori include:

  • Login node as a bastion for SSH
  • 96TB of file-server
  • Private VPN service

These features enable students to conduct experiments remotely.

Hinadori also supports parallel computing with multiple computers using MPI by 10GbE.
In addition, research using special processors, such as Intel KNL nodes and PEZY SC2, is performed.



▲Click to enlarge

Software

Job Scheduling System

Hinadori adopts a customized job scheduling system, with Slurm Workload Manager as the base, enabling users to submit jobs effortlessly.

It is equipped with a feature to make easy use of Slurm's ability to assign multiple jobs to a single node.

Monitoring System

Hinadori utilizes Prometheus to aggregate metrics, and Grafana to visualize such metrics.

Metrics that are monitored include the usage rate of each CPU and GPU that can be used for performance optimization and metrics such as usage history, GPU temperature, and power consumption for administration purposes.

All users can access this information via a browser.

Development Environment

Users can specify appropriate versions of CUDA libraries and compilers, which are managed by Environment Modules.

Besides standard applications, Hinadori provides internal applications such as one that records GPU temperature, power consumption, etc., during program execution.

Operation

Operations are done with one simple but important rule: "Don't waste time managing."
As the cluster is operated by students voluntarily, it is essential that we do not cut on research time.
Hence, we have introduced Ansible, a configuration management tool, IPMI, a remote management tool, and LDAP's SaaS for user management, to minimize maintenance time.
Setting up a new node is also automated, and users can start using it within 30 minutes after OS installation.