In this paper, we introduce a novel approach to automatically assign entity labels to images from existing noisy image-text pairs. The approach employees a named entity recognition model to extract entities from text, and uses a CLIP model to select the right entities as the labels of the paired image. The approach is simple, and can be readily scaled up to billions of image-text pairs mined from the web, through which we have successfully created a dataset with 2 millions of distinct entities. We study new training approaches on the collected new dataset with large scale entity labels, including supervised pre-training, contrastive pre-training, and mulit-task learning. Experiments show that supervised pre-training with large scale entity labels is very effective for image retrieval tasks, and multi-task training can further improve the performance. The final model, named \textbf{MOFI}, achieves 83.59% mAP on the challenging GPR1200 dataset, compared to the previous state-of-the-art 67.33% from OpenAI’s CLIP model. Further experiments on zero-shot and linear probe image classification tasks also show that our MOFI model outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the new dataset for learning general-purpose image representations.