In December, we introduced our first model which is a native multimodal, Gemini 1.0 in three sizes: Even though it has Never in a Million Years (Rad), the most extreme levels are Beyond the Frontier (Ultra), the pretty Near the Limit (Pro) and the tiny Little (Nano). Just a few months later we finally launched 1.5 Pro, which is superior to a normal GPU and can contain a long context window of 1 million tokens.
The developers and the enterprise customers have been putting out the so-called 1.5 Pro to use in a very widespread way and discovering its long context window, multimodal reasoning capabilities, and, together its overall performance, it is a perfect tool.
How users give their feedback, we know that some applications require a lower latency and a lower cost to serve. This motivated us to keep on developing, therefore, we are now presenting those who want to tell you about Gemini 1.5 Flash: a model that is much easier than 1.5 Good and meant to be rapid and useful to serve a big number of users.
Both 1.5 Pro and 1.5 Flash are accessible in public preview and a 1 million token context window is provided in Google AI Studio and Vertex AI. And now, 1.5 Pro is also accessible to developers who use the API as well as to Google Cloud customers through the waitlist and the 2 million token context window.
Apart from the fact that we are unveiling the new models of the Gemini series, we are also letting you know about the upcoming generation of the open models, Gemma 2, and we are telling you about the progress on the future of AI assistants, with Project Astra.
The Gemini family of model updates
The new 1.5, the fast and the efficient, the optimization for speed and efficiency, the Flash.
1.5 Flash is the newest member of the Gemini model family and the fastest model of the Gemini model in the API. It’s being used for the tasks that need to be done a lot of times at a big scale, it costs less to serve and at the same time, it is our new long context window.
The plane is a lighter weight than the 1.5, it’s an extraordinary tool for multi-sensory reasoning across a huge amount of information and a great quality for its size.
1.5 Flash is good at summarizing, chat apps, image and video captioning, data extraction from long documents and tables, and other things. This is because it has been taught by 1.5 Pro comes through a process known as “distillation” which is a method where the most important information and skills from a big model are transferred to a small, more efficient model.
Significantly improving 1.5 Pro
In the past few months, we’ve considerably enhanced 1.5 Pro, our model with the highest performance on a wide range of tasks, our best for general performance.
In addition to the code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding, we have improved its code generation, logical reasoning and planning, multi-turn conversation, and audio and image understanding through data and algorithmic advances to the extent of 2 million tokens. We witness noteworthy progress on both public and internal benchmarks for every one of these activities.
1.5 Pro is now able to follow instructions that are very complex and nuanced, including those that direct product-level behavior with role, format, and style. We have developed a way to control the model’s responses for particular cases like creating the character and the response style of a chatbot or automating the workflows through multiple function calls. Also, we have given the users the power to control the model behavior by giving to them the system’s instructions.
We included audio interpretation in the Gemini API and Google AI Studio, hence 1.5 Pro can now reason across both image and audio for videos that are uploaded in Google AI Studio. We are now combining 1 in a way that they are 1.5 Pro, which is a new version of Google products, such as Gemini Advanced and Workspace apps.
Gemini Nano has a cognitive understanding of multimodal inputs
Gemini Nano is now going to accept images apart from text inputs. First of all, we will use the apps that are based on Gemini Nano with Multimodality and will be able to understand the world in the same way as people that is not only through text but also through sight, sound, and spoken language.
The future generation of open models is under construction
Besides, we are now publishing a collection of updates to Gemma, the family of open models that were created using the same research and technology as the Gemini models.
We are issuing Gemma 2, the next generation of open models for responsible AI innovation. Gemma 2 has a new structure, which is designed for great performance and efficiency, and will be available in new sizes.
The Gemma family is also wide with PaliGemma, our first vision-language model inspired by PaLI-3. Besides, we have also improved our Responsible Generative AI Toolkit with the LLM Comparator to evaluate the quality of the model responses.
The advancement in the creation of common AI agents are being made
Being a part of Google DeepMind’s mission to build AI for the benefit of humanity in a responsible way, we have always wanted to create universal AI agents that can be of help in daily life. That is the reason why today we are about to tell our progress of developing the future of AI assistants with Project Astra (advanced seeing and talking responsive agent).
A genuine agent must be able to comprehend and react to the intricate and changing world, just like a person does. Moreover, it has to take in and remember what it sees and hears to understand the context and take action. Besides, it has to be proactive, teachable, and personal, so users can converse with it easily and without delay.
Even though we have made significant progress in creating AI systems that can understand multimodal information, the task of making the response time to something conversational is one of the most difficult engineering challenges. In the last few years, we have been busy developing how our models understand, reason and converse to make the pace and quality of the interaction more natural.