Top News

Gemini, developed collaboratively by Google teams, is a multimodal model, able to process and combine text, code, audio, images, and video. It is designed to be flexible and efficient, running on various platforms from data centers to mobile devices. Gemini comes in three versions: Ultra, Pro, and Nano, each optimized for different scales and complexities of tasks.

Gemini Ultra has surpassed human experts in MMLU (massive multitask language understanding) and outperformed previous models in various benchmarks, including image understanding and complex reasoning. It's also highly capable in coding, with its advanced version, AlphaCode 2, outperforming in programming competitions.

Gemini's development utilized Google's Tensor Processing Units (TPUs), ensuring efficiency and scalability. With its multimodal capabilities, Gemini can analyze complex information across various formats, aiding in fields like science and finance.

Gemini is being integrated into various Google products like Bard, Pixel, Search, and others. It will be available to developers and enterprise customers through the Gemini API, with Gemini Ultra set to be released after extensive safety checks and refinements. 


Despite the excitement, the Gemini benchmarks appear to be heavily gamed. They seem to be carefully presented and use different prompting techniques to show Gemini beating GPT-4. Since no one can actually use Gemini Ultra to confirm, I think it’s safe to assume this is fluff for shareholders and that once the public gets their hands on it will be apparent that GPT-4 is still the better model and better aligned with fewer hallucinations.

The process involves three main components:

  1. Vision Model: This model uses a computer camera to "see" and describe images. The author suggests two options:

    • Llava 13B: An open-source, cost-effective model that gives basic descriptions of images.

    • GPT-4-Vision: A more advanced and slightly more expensive model that provides detailed, nuanced descriptions.

  2. Language Model: This model writes the script for narration. Mistral 7B is used to generate a nature documentary-style script based on the image description provided by the vision model. Alternatively, GPT-4-Vision can be used to both describe the image and generate the script in one step.

  3. Text-to-Speech Model: This converts the written script into spoken audio. The author recommends ElevenLabs's voice cloning feature for high-quality output or XTTS-v2 as an open-source alternative. These tools can mimic specific voices, enhancing the narration's realism.

The author demonstrates this process using a video where an AI clone of Sir David Attenborough describes the author drinking from a cup, humorously suggesting it as part of a “mating display.” This video went viral, showcasing the potential of these AI tools.

It was a strange Thanksgiving for Sam Altman. Normally, the CEO of OpenAI flies home to St. Louis to visit family. But this time the holiday came after an existential struggle for control of a company that some believe holds the fate of humanity in its hands. Altman was weary. He went to his Napa Valley ranch for a hike, then returned to San Francisco to spend a few hours with one of the board members who had just fired and reinstated him in the span of five frantic days. He put his computer away for a few hours to cook vegetarian pasta, play loud music, and drink wine with his fiancé Oliver Mulherin. “This was a 10-out-of-10 crazy thing to live through,” Altman tells TIME on Nov. 30. “So I’m still just reeling from that.” Continue reading …

Other stuff

