We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition,
scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous NVIDIA GPU clusters: K80, and a customized P100. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the MNIST and CIFAR 10 datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected.