The Harvard High-Availability Cloud for Research Computing
As part of the Faculty of Arts & Sciences (FAS) Division of Science at Harvard University, Research Computing (RC) facilitates the advancement of complex research by providing leading edge computing services for high performance and scientific computing, bioinformatics analysis, visualization, and data storage. Since 2008, Harvard FAS RC has undertaken a significant scaling challenge increasing their available High Performance Computing (HPC) and storage from 200 cores and 20TB to over 70,000 cores and 35PB.
Recently, FAS RC designed a strategy to convert their legacy internal KVM (kernel-based virtual machine) infrastructure from a homemade virtualized cluster dependent on scripted tools to a more robust, reliable, and automated private-cloud system. They then configured the system to enable integration with the public cloud to improve agility, increase resource utilization, and allow implementation of further advanced services.
Their first step was designing a cloud reference architecture with high availability, multitenancy, orchestration, and provisioning to yield a common frame of reference for all private-cloud instances. This provided a foundation for further development and innovation, and helped them make well-founded strategic and technical decisions. The architecture design provides APIs and features that help serve users more efficiently. These features also better enabled them to test new configurations and dynamically increase resources for continuous integration and deployment.
The reference architecture design adopts a classical cluster-like organization with a controller host, a set of hypervisors where VMs will be hosted, a storage cluster and at least one physical network joining all of the hosts. The architecture has been implemented with OpenNebula as the orchestration manager, Ceph as the storage cluster, KVM as the hypervisor, and Microsoft Azure and Amazon Web Services as two public clouds for cloud-bursting. The architecture was designed to be functional across two datacenters to enable live-migration of running VMs between them for load balancing, or in case of maintenance or issues.
This article describes the lessons learned, challenges faced, and innovations made in the design and implementation of this system to guide other organizations transitioning to a private cloud and automated infrastructure.