Blog / Admirer: Open-Ended VQA Requiring Outside Knowledge
December 23rd, 2022 - The Transformees: Andrew Hinh and Aleks Hiidenhovi (Guest post) - 4 mins read
Last updated: December 23, 2022.
For the ZenML Month of MLOps Competition, we created Admirer, a full-stack ML-powered website that utilizes users’ webcam feeds to answer open-ended questions requiring outside knowledge. Andrew built the MVP of the website as a top-25 final project for the FSDL 2022 course, writing only the deployment code. We continued the project for the ZenML Month of MLOps Competition and won the
Most Promising Entry prize in the closing ceremony. During the competition, we wrote the data management, model development, testing, and continuous deployment scripts.
Here is a video summary of our submission.
The visual question-answering pipeline is inspired by this paper from Microsoft. In short, we prompt GPT-3 with a generated image caption and object tag list, the question-answer pair, and context examples that demonstrate the task at hand in a few-shot learning method as shown in the diagram above. As a result, we achieve a BERTScore computed F1 score of around .989 on a test set we selected at random.
As a direct consequence of not feeding the image data directly to GPT-3, the best queries involve asking descriptive, counting, or similar questions about one or more objects visible in the background. For example, if there are two people in the image, one wearing a hat and the other wearing glasses, questions that would work well could include the following:
Since we finished a MVP of the project using a model deployed on AWS Lambda and served with Gradio for the Full Stack Deep Learning course, the main features that we decided to focus on building for the competition were the surrounding infrastructure. However, it’s important to note that as a result how the inference pipeline is structured and the limited time for the competition, we had to focus on building the infrastructure solely around one of the models used in the pipeline. We decided to focus on the caption model, the model that seemed to fail the most often amongst the other models. As a result, the features that we built support for during the competition are the following:
As a result of the limited time and money we had during the competition, there are two major points of improvements. The first is improving the performance of the separate models used in the inference pipeline. Since we are using pre-trained HuggingFace models, we could collect more data from sources such as the GI4E project to fine-tune the models on. However, this is extremely costly, and something we are unlikely to do. Another option could be to train an end-to-end model such as a ViT + GPT-2 encoder decoder model from scratch as inspired by Sachin Abeywardana. Although costly, this is probably the best and cheapest option for improving the inference pipeline’s performance, since the image data itself is incorporated into GPT’s answers.
The second is creating data, training, and testing pipelines with both Makefiles and ZenML pipelines. This allows us to more easily iterate through ideas to improve the pipeline’s performance and add features such as continual training, a feature of MLOps projects that ZenML goes into detail here. In addition, it allows others to more easily reproduce and test our work.
Having recently built a MVP of the project for the Full Stack Deep Learning course, we were excited to build out the data management and training pipelines for this competition. We were not disappointed: we got a chance to get our hands dirty building out the pipelines, showcase our work to numerous experts in the field and receive their feedback, and win the “Most Promising Entry” award! We’d like to thank the ZenML team for organizing this competition and the judges for their invaluable feedback.
Below are some of the resources to learn more about the project: