As the repository of publicly available pre-trained vision foundation models (VFMs) — such as CLIP, DINOv2, and SAM — grows, users face challenges in storage, memory, and computational efficiency when deploying multiple models concurrently. To address these concerns, we introduce a unique approach that merges the capabilities of multiple VFMs into a single efficient multi-task model. Our method, termed “joint distillation,” seamlessly integrates teacher-student learning with self-distillation, operating with just unlabeled image data and drastically cutting down on computational requirements compared to traditional multi-task learning. In a practical demonstration of merging CLIP and SAM, we reveal that the resultant merged model, SAM-CLIP, not only maintains the foundational strengths of both parent models but also uncovers synergistic functions, such as text-prompted zero-shot segmentation. Given the increasing availability of VFMs, our methodology promises to deliver significant value in streamlining model deployment and operations.