ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

  1. 4 DAYS AGO

    📆 ThursdAI - Jan 9th - NVIDIA's Tiny Supercomputer, Phi-4 is back, Kokoro TTS & Moondream gaze, ByteDance SOTA lip sync & more AI news

    Hey everyone, Alex here 👋 This week's ThursdAI was a whirlwind of announcements, from Microsoft finally dropping Phi-4's official weights on Hugging Face (a month late, but who's counting?) to Sam Altman casually mentioning that OpenAI's got AGI in the bag and is now setting its sights on superintelligence. Oh, and NVIDIA? They're casually releasing a $3,000 supercomputer that can run 200B parameter models on your desktop. No big deal. We had some amazing guests this week too, with Oliver joining us to talk about a new foundation model in genomics and biosurveillance (yes, you read that right - think wastewater and pandemic monitoring!), and then, we've got some breaking news! Vik returned to the show with a brand new Moondream release that can do some pretty wild things. Ever wanted an AI to tell you where someone's looking in a photo? Now you can, thanks to a tiny model that runs on edge devices. 🤯 So buckle up, folks, because we've got a ton to cover. Let's dive into the juicy details of this week's AI madness, starting with open source. 03:10 TL;DR 03:10 Deep Dive into Open Source LLMs 10:58 MetaGene: A New Frontier in AI 20:21 PHI4: The Latest in Open Source AI 27:46 R Star Math: Revolutionizing Small LLMs 34:02 Big Companies and AI Innovations 42:25 NVIDIA's Groundbreaking Announcements 43:49 AI Hardware: Building and Comparing Systems 46:06 NVIDIA's New AI Models: LLAMA Neumatron 47:57 Breaking News: Moondream's Latest Release 50:19 Moondream's Journey and Capabilities 58:41 Weights & Biases: New Evals Course 01:08:29 NVIDIA's World Foundation Models 01:08:29 ByteDance's LatentSync: State-of-the-Art Lip Sync 01:12:54 Kokoro TTS: High-Quality Text-to-Speech As always, TL;DR section with links and show notes below 👇 Open Source AI & LLMs Phi-4: Microsoft's "Small" Model Finally Gets its Official Hugging Face Debut Finally, after a month, we're getting Phi-4 14B on HugginFace. So far, we've had bootlegged copies of it, but it's finally officially uploaded by Microsoft. Not only is it now official, it's also officialy MIT licensed which is great! So, what's the big deal? Well, besides the licensing, it's a 14B parameter, dense decoder-only Transformer with a 16K token context length and trained on a whopping 9.8 trillion tokens. It scored 80.4 on math and 80.6 on MMLU, making it about 10% better than its predecessor, Phi-3 and better than Qwen 2.5's 79 What’s interesting about phi-4 is that the training data consisted of 40% synthetic data (almost half!) The vibes are always interesting with Phi models, so we'll keep an eye out, notable also, the base models weren't released due to "safety issues" and that this model was not trained for multi turn chat applications but single turn use-cases MetaGene-1: AI for Pandemic Monitoring and Pathogen Detection Now, this one's a bit different. We usually talk about LLMs in this section, but this is more about the "open source" than the "LLM." Prime Intellect, along with folks from USC, released MetaGene-1, a metagenomic foundation model. That's a mouthful, right? Thankfully, we had Oliver Liu, a PhD student at USC, and an author on this paper, join us to explain. Oliver clarified that the goal is to use AI for "biosurveillance, pandemic monitoring, and pathogen detection." They trained a 7B parameter model on 1.5 trillion base pairs of DNA and RNA sequences from wastewater, creating a model surprisingly capable of zero-shot embedding. Oliver pointed out that while using genomics to pretrain foundation models is not new, MetaGene-1 is, "in its current state, the largest model out there" and is "one of the few decoder only models that are being used". They also have collected 15T bae pairs but trained on 10% of them due to grant and compute constraints. I really liked this one, and though the science behind this was complex, I couldn't help but get excited about the potential of transformer models catching or helping catch the next COVID 👏 rStar-Math: Making Small LLMs Math Whizzes with Monte Carlo Tree Search Alright, this one blew my mind. A paper from Microsoft (yeah, them again) called "rStar-Math" basically found a way to make small LLMs do math better than o1 using Monte Carlo Tree Search (MCTS). I know, I know, it sounds wild. They took models like Phi-3-mini (a tiny 3.8B parameter model) and Qwen 2.5 3B and 7B, slapped some MCTS magic on top, and suddenly these models are acing the AIME 2024 competition math benchmark and scoring 90% on general math problems. For comparison, OpenAI's o1-preview scores 85.5% on math and o1-mini scores 90%. This is WILD, as just 5 months ago, it was unimaginable that any LLM can solve math of this complexity, then reasoning models could, and now small LLMs with some MCTS can! Even crazier, they observed an "emergence of intrinsic self-reflection capability" in these models during problem-solving, something they weren't designed to do. LDJ chimed in saying "we're going to see more papers showing these things emerging and caught naturally." So, is 2025 the year of not just AI agents, but also emergent reasoning in LLMs? It's looking that way. The code isn't out yet (the GitHub link in the paper is currently a 404), but when it drops, you can bet we'll be all over it. Big Companies and LLMs OpenAI: From AGI to ASI Okay, let's talk about the elephant in the room: Sam Altman's blog post. While reflecting on getting fired from his job on like a casual Friday, he dropped this bombshell: "We are now confident that we know how to build AGI as we have traditionally understood it." And then, as if that wasn't enough, he added, "We're beginning to turn our aim beyond that to superintelligence in the true sense of the word." So basically, OpenAI is saying, "AGI? Done. Next up: ASI." This feels like a big shift in how openly folks at OpenAI is talking about Superintelligence, and while AGI is yet to be properly defined (LDJ read out the original OpenAI definition on the live show, but the Microsoft definition contractually with OpenAI was a system that generates $100B in revenue) they are already talking about Super Intelligence which supersedes all humans ever lived in all domains NVIDIA @ CES - Home SuperComputers, 3 scaling laws, new Models There was a lot of things happening at CES, the largest consumer electronics show, but the AI focus was on NVIDIA, namely on Jensen Huangs keynote speech! He talked about a lot of stuff, really, it's a show, and is a very interesting watch, NVIDIA is obviously at the forefront of all of this AI wave, and when Jensen tells you that we're at the high of the 3rd scaling law, he knows what he's talking about (because he's fueling all of it with his GPUs) - the third one is of course test time scaling or "reasoning", the thing that powers o1, and the coming soon o3 model and other reasoners. Project Digits - supercomputer at home? Jensen also announced Project Digits: a compact AI supercomputer priced at a relatively modest $3,000. Under the hood, it wields a Grace Blackwell “GB10” superchip that supposedly offers 1 petaflop of AI compute and can support LLMs up to 200B parameters (or you can link 2 of them to run LLama 405b at home!) This thing seems crazy, but we don't know more details like the power requirements for this beast! Nemotrons again? Also announced was a family of NVIDIA LLama Nemotron foundation models, but.. weirdly we already have Nemotron LLamas (3 months ago) , so those are... new ones? I didn't really understand what was announced here, as we didn't get new models, but the announcement was made nonetheless. We're due to get 3 new version of Nemotron on the Nvidia NEMO platform (and Open), sometime soon. NVIDIA did release new open source models, with COSMOS, which is a whole platform that includes pretrained world foundation models to help simulate world environments to train robots (among other things). They have released txt2world and video2world Pre-trained Diffusion and Autoregressive models in 7B and 14B sizes, that generate videos to simulate visual worlds that have strong alignment to physics. If you believe Elon when he says that Humanoid Robots are going to be the biggest category of products (every human will want 1 or 3, so we're looking at 20 billion of them), then COSMOS is a platform to generate synthetic data to train these robots to do things in the real world! This weeks buzz - Weights & Biases corner The wait is over, our LLM Evals course is now LIVE, featuring speakers Graham Neubig (who we had on the pod before, back when Open Hands was still called Open Devin) and Paige Bailey, and Anish and Ayush from my team at W&B! If you're building with LLM in production and don't have a robust evaluation setup, or don't even know where to start with one, this course is definitely for you! Sign up today. You'll learn from examples of Imagen and Veo from Paige, Agentic examples using Weave from Graham and Basic and Advanced Evaluation from Anish and Ayush. The workshop in Seattle next was filled out super quick, so since we didn't want to waitlist tons of folks, we have extended it to another night, so those of you who couldn't get in, will have another opportunity on Tuesday! (Workshop page) but while working on it I came up with this distillation of what I'm going to deliver, and wanted to share with you. Vision & Video New Moondream 01-09 can tell where you look (among other things) (blog, HF) We had some breaking news on the show! Vik Korrapati, the creator of Moondream, joined us to announce updates to Moondream, a new version of his tiny vision language model. This new release has some incredible capabilities, including pointing, object detection, structured output (like JSON), and even gaze detection. Yes, you read that right. Moondream can now tell you where someone (or even a pet!) is looking in an image. Vic explained how they achieved this: "We took one of the training datasets that Gazelle trained on and added it to the Moondream fine tun

    1h 20m
  2. 2 JAN

    📆 ThursdAI - Jan 2 - is 25' the year of AI agents?

    Hey folks, Alex here 👋 Happy new year! On our first episode of this year, and the second quarter of this century, there wasn't a lot of AI news to report on (most AI labs were on a well deserved break). So this week, I'm very happy to present a special ThursdAI episode, an interview with Joāo Moura, CEO of Crew.ai all about AI agents! We first chatted with Joāo a year ago, back in January of 2024, as CrewAI was blowing up but still just an open source project, it got to be the number 1 trending project on Github, and #1 project on Product Hunt. (You can either listen to the podcast or watch it in the embedded Youtube above) 00:36 Introduction and New Year Greetings 02:23 Updates on Open Source and LLMs 03:25 Deep Dive: AI Agents and Reasoning 03:55 Quick TLDR and Recent Developments 04:04 Medical LLMs and Modern BERT 09:55 Enterprise AI and Crew AI Introduction 10:17 Interview with João Moura: Crew AI 25:43 Human-in-the-Loop and Agent Evaluation 33:17 Evaluating AI Agents and LLMs 44:48 Open Source Models and Fin to OpenAI 45:21 Performance of Claude's Sonnet 3.5 48:01 Different parts of an agent topology, brain, memory, tools, caching 53:48 Tool Use and Integrations 01:04:20 Removing LangChain from Crew 01:07:51 The Year of Agents and Reasoning 01:18:43 Addressing Concerns About AI 01:24:31 Future of AI and Agents 01:28:46 Conclusion and Farewell --- Is 2025 "the year of AI agents"? AI agents as I remember them as a concept started for me a few month after I started ThursdAI ,when AutoGPT exploded. Was such a novel idea at the time, run LLM requests in a loop, (In fact, back then, I came up with a retry with AI concept and called it TrAI/Catch, where upon an error, I would feed that error back into the GPT api and ask it to correct itself. it feels so long ago!) AutoGPT became the fastest ever Github project to reach 100K stars, and while exciting, it did not work. Since then we saw multiple attempts at agentic frameworks, like babyAGI, autoGen. Crew AI was one of them that keeps being the favorite among many folks. So, what is an AI agent? Simon Willison, friend of the pod, has a mission, to ask everyone who announces a new agent, what they mean when they say it because it seems that everyone "shares" a common understanding of AI agents, but it's different for everyone. We'll start with Joāo's explanation and go from there. But let's assume the basic, it's a set of LLM calls, running in a self correcting loop, with access to planning, external tools (via function calling) and a memory or sorts that make decisions. Though, as we go into detail, you'll see that since the very basic "run LLM in the loop" days, the agents in 2025 have evolved and have a lot of complexity. My takeaways from the conversation I encourage you to listen / watch the whole interview, Joāo is deeply knowledgable about the field and we go into a lot of topics, but here are my main takeaways from our chat * Enterprises are adopting agents, starting with internal use-cases * Crews have 4 different kinds of memory, Long Term (across runs), short term (each run), Entity term (company names, entities), pre-existing knowledge (DNA?) * TIL about a "do all links respond with 200" guardrail * Some of the agent tools we mentioned * Stripe Agent API - for agent payments and access to payment data (blog) * Okta Auth for Gen AI - agent authentication and role management (blog) * E2B - code execution platform for agents (e2b.dev) * BrowserBase - programmatic web-browser for your AI agent * Exa - search grounding for agents for real time understanding * Crew has 13 crews that run 24/7 to automate their company * Crews like Onboarding User Enrichment Crew, Meetings Prep, Taking Phone Calls, Generate Use Cases for Leads * GPT-4o mini is the most used model for 2024 for CrewAI with main factors being speed / cost * Speed of AI development makes it hard to standardize and solidify common integrations. * Reasoning models like o1 still haven't seen a lot of success, partly due to speed, partly due to different way of prompting required. This weeks Buzz We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI (previously Open Devin). We've distilled a lot of what we learned about evaluating LLM applications while building Weave, our LLM Observability and Evaluation tooling, and are excited to share this with you all! Get on the list Also, 2 workshops (also about Evals) from us are upcoming, one in SF on Jan 11th and one in Seattle on Jan 13th (which I'm going to lead!) so if you're in those cities at those times, would love to see you! And that's it for this week, there wasn't a LOT of news as I said. The interesting thing is, even in the very short week, the news that we did get were all about agents and reasoning, so it looks like 2025 is agents and reasoning, agents and reasoning! See you all next week 🫡 TL;DR with links: * Open Source LLMs * HuatuoGPT-o1 - medical LLM designed for medical reasoning (HF, Paper, Github, Data) * Nomic - modernbert-embed-base - first embed model on top of modernbert (HF) * HuggingFace - SmolAgents lib to build agents (Blog) * SmallThinker-3B-Preview - a QWEN 2.5 3B "reasoning" finetune (HF) * Wolfram new Benchmarks including DeepSeek v3 (X) * Big CO LLMs + APIs * Newcomer Rubik's AI Sonus-1 family - Mini, Air, Pro and Reasoning (X, Chat) * Microsoft "estimated" GPT-4o-mini is a ~8B (X) * Meta plans to bring AI profiles to their social networks (X) * This Week's Buzz * W&B Free Evals Course with Page Bailey and Graham Beubig - Free Sign Up * SF evals event - January 11th * Seattle evals workshop - January 13th This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

    1h 31m
  3. 27/12/2024

    📆 ThursdAI - Dec 26 - OpenAI o3 & o3 mini, DeepSeek v3 658B beating Claude, Qwen Visual Reasoning, Hume OCTAVE & more AI news

    Hey everyone, Alex here 👋 I was hoping for a quiet holiday week, but whoa, while the last newsletter was only a week ago, what a looong week it has been, just Friday after the last newsletter, it felt like OpenAI has changed the world of AI once again with o3 and left everyone asking "was this AGI?" over the X-mas break (Hope Santa brought you some great gifts!) and then not to be outdone, DeepSeek open sourced basically a Claude 2.5 level behemoth DeepSeek v3 just this morning! Since the breaking news from DeepSeek took us by surprise, the show went a bit longer (3 hours today!) than expected, so as a Bonus, I'm going to release a separate episode with a yearly recap + our predictions from last year and for next year in a few days (soon in your inbox!) TL;DR * Open Source LLMs * CogAgent-9B (Project, Github) * Qwen QvQ 72B - open weights visual reasoning (X, HF, Demo, Project) * GoodFire Ember - MechInterp API - GoldenGate LLama 70B * 🔥 DeepSeek v3 658B MoE - Open Source Claude level model at $6M (X, Paper, HF, Chat) * Big CO LLMs + APIs * 🔥 OpenAI reveals o3 and o3 mini (Blog, X) * X.ai raises ANOTHER 6B dollars - on their way to 200K H200s (X) * This weeks Buzz * Two W&B workshops upcoming in January * SF - January 11 * Seattle - January 13 (workshop by yours truly!) * New Evals course with Paige Bailey and Graham Neubig - pre-sign up for free * Vision & Video * Kling 1.6 update (Tweet) * Voice & Audio * Hume OCTAVE - 3B speech-language model (X, Blog) * Tools * OpenRouter added Web Search Grounding to 300+ models (X) Open Source LLMs DeepSeek v3 658B - frontier level open weights model for ~$6M (X, Paper, HF, Chat ) This was absolutely the top of the open source / open weights news for the past week, and honestly maybe for the past month. DeepSeek, the previous quant firm from China, has dropped a behemoth model, a 658B parameter MoE (37B active), that you'd need 8xH200 to even run, that beats Llama 405, GPT-4o on most benchmarks and even Claude Sonnet 3.5 on several evals! The vibes seem to be very good with this one, and while it's not all the way beating Claude yet, it's nearly up there already, but the kicker is, they trained it with a very restricted compute, per the paper, with ~2K h800 (which is like H100 but with less bandwidth) for 14.8T tokens. (that's 15x cheaper than LLama 405 for comparison) For evaluations, this model excels on Coding and Math, which is not surprising given how excellent DeepSeek coder has been, but still, very very impressive! On the architecture front, the very interesting thing is, this feels like Mixture of Experts v2, with a LOT of experts (256) and 8+1 active at the same time, multi token prediction, and a lot optimization tricks outlined in the impressive paper (here's a great recap of the technical details) The highlight for me was, that DeepSeek is distilling their recent R1 version into this version, which likely increases the performance of this model on Math and Code in which it absolutely crushes (51.6 on CodeForces and 90.2 on MATH-500) The additional aspect of this is the API costs, and while they are going to raise the prices come February (they literally just swapped v2.5 for v3 in their APIs without telling a soul lol), the price performance for this model is just absurd. Just a massive massive release from the WhaleBros, now I just need a quick 8xH200 to run this and I'm good 😅 Other OpenSource news - Qwen QvQ, CogAgent-9B and GoldenGate LLama In other open source news this week, our friends from Qwen have released a very interesting preview, called Qwen QvQ, a visual reasoning model. It uses the same reasoning techniques that we got from them in QwQ 32B, but built with the excellent Qwen VL, to reason about images, and frankly, it's really fun to see it think about an image. You can try it here and a new update to CogAgent-9B (page), an agent that claims to understand and control your computer, claims to beat Claude 3.5 Sonnet Computer Use with just a 9B model! This is very impressive though I haven't tried it just yet, I'm excited to see those very impressive numbers from open source VLMs driving your computer and doing tasks for you! A super quick word from ... Weights & Biases! We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI. We've distilled a lot of what we learned about evaluating LLM applications while building Weave, our LLM Observability and Evaluation tooling, and are excited to share this with you all! Get on the list Also, 2 workshops (also about Evals) from us are upcoming, one in SF on Jan 11th and one in Seattle on Jan 13th (which I'm going to lead!) so if you're in those cities at those times, would love to see you! Big Companies - APIs & LLMs OpenAI - introduces o3 and o3-mini - breaking Arc-AGI challenge, GQPA and teasing AGI? On the last day of the 12 days of OpenAI, we've got the evals of their upcoming o3 reasoning model (and o3-mini) and whoah. I think I speak on behalf of most of my peers that we were all shaken by how fast the jump in capabilities happened from o1-preview and o1 full (being released fully just two weeks prior on day 1 of the 12 days) Almost all evals shared with us are insane, from 96.7 on AIME (from 13.4 with Gpt40 earlier this year) to 87.7 GQPA Diamond (which is... PhD level Science Questions) But two evals stand out the most, and one of course is the Arc-AGI eval/benchmark. It was designed to be very difficult for LLMs and easy for humans, and o3 solved it with an unprecedented 87.5% (on high compute setting) This benchmark was long considered impossible for LLMs, and just the absolute crushing of this benchmark for the past 6 months is something to behold: The other thing I want to highlight is the Frontier Math benchmark, which was released just two months ago by Epoch, collaborating with top mathematicians to create a set of very challenging math problems. At the time of release (Nov 12), the top LLMs solved only 2% of this benchmark. With o3 solving 25% of this benchmark just 3 months after o1 taking 2%, it's quite incredible to see how fast these models are increasing in capabilities. Is this AGI? This release absolutely started or restarted a debate of what is AGI, given that, these goal posts move all the time. Some folks are freaking out and saying that if you're a software engineer, you're "cooked" (o3 solved 71.7 of SWE-bench verified and gets 2727 ELO on CodeForces which is competition code, which is 175th global rank among human coders!), some have also calculated its IQ and estimate it to be at 157 based on the above CodeForces rating. So the obvious question is being asked (among the people who follow the news, most people who don't follow the news could care less) is.. is this AGI? Or is something else AGI? Well, today we got a very interesting answer to this question, from a leak between a Microsoft and OpenAI negotiation and agreement, in which they have a very clear definition of AGI. "A system generating $100 Billion in profits" - a reminder, per their previous agreement, if OpenAI builds AGI, Microsoft will lose access to OpenAI technologies. o3-mini and test-time compute as the new scaling law While I personally was as shaken as most of my peers at these incredible breakthroughs, I was also looking at the more practical and upcoming o3-mini release, which is supposed to come on January this year per Sam Altman. Per their evaluations, o3-mini is going to be significantly cheaper and faster than o3, while offering 3 levels of reasoning effort to developers (low, medium and high) and on medium level, it would beat the current best model (o1) while being cheaper than o1-mini. All of these updates and improvements in the span of less than 6 months are a testament of just how impressive test-time compute is as our additional new scaling law. Not to mention that current scaling laws still hold, we're waiting for Orion or GPT 4.5 or whatever it's called, and that underlying model will probably significantly improve the reasoning models that are built on top of it. Also, if the above results from DeepSeek are anything to go by (and they should be), the ability of these reasoning models to generate incredible synthetic training data for the next models is also quite incredible so... flywheel is upon us, models get better and make better models. Other AI news from this week: The most impressive other news came from HUME, showcasing OCTAVE - their new 3B speech-language model, which is able to not only fake someone's voice with 5 seconds of audio, but also take on their personality and style of speaking and mannerisms. This is not only a voice model mind you, but a 3B LLM as well, so it can mimic a voice, and even create new voices from a prompt. While they mentioned the size, the model was not released yet and will be coming to their API soon, and when I asked about open source, it seems that Hume CEO did not think it's a safe bet opening up this kind of tech to the world yet. I also loved a new little x-mas experiment from OpenRouter and Exa, where-in on the actual OpenRouter interface, you can now chat with over 300 models they serve, and ground answers in search. This is it for this week, which again, I thought is going to be a very chill one, and .. nope! The second part of the show/newsletter, in which we did a full recap of the last year, talked about our predictions from last year and did predictions for this next year, is going to drop in a few days 👀 So keep your eyes peeled. (I decided to separate the two, as 3 hour podcast about AI is... long, I'm no Lex Fridman lol) As always, if you found any of this interesting, please share with a friend, and comment on social media, or right here on Substack, I love getting feedback on what works and what doesn't. Thank you for being part of the ThursdAI community 👋 ThursdAI - Recaps of the

    1h 36m
  4. 20/12/2024

    🎄ThursdAI - Dec19 - o1 vs gemini reasoning, VEO vs SORA, and holiday season full of AI surprises

    For the full show notes and links visit https://sub.thursdai.news 🔗 Subscribe to our show on Spotify: https://thursdai.news/spotify 🔗 Apple: https://thursdai.news/apple Ho, ho, holy moly, folks! Alex here, coming to you live from a world where AI updates are dropping faster than Santa down a chimney! 🎅 It's been another absolutely BANANAS week in the AI world, and if you thought last week was wild, and we're due for a break, buckle up, because this one's a freakin' rollercoaster! 🎢 In this episode of ThursdAI, we dive deep into the recent innovations from OpenAI, including their 1-800 ChatGPT phone service and new advancements in voice mode and API functionalities. We discuss the latest updates on O1 model capabilities, including Reasoning Effort settings, and highlight the introduction of WebRTC support by OpenAI. Additionally, we explore the groundbreaking VEO2 model from Google, the generative physics engine Genesis, and new developments in open source models like Cohere's Command R7b. We also provide practical insights on using tools like Weights & Biases for evaluating AI models and share tips on leveraging GitHub Gigi. Tune in for a comprehensive overview of the latest in AI technology and innovation. 00:00 Introduction and OpenAI's 12 Days of Releases 00:48 Advanced Voice Mode and Public Reactions 01:57 Celebrating Tech Innovations 02:24 Exciting New Features in AVMs 03:08 TLDR - ThursdAI December 19 12:58 Voice and Audio Innovations 14:29 AI Art, Diffusion, and 3D 16:51 Breaking News: Google Gemini 2.0 23:10 Meta Apollo 7b Revisited 33:44 Google's Sora and Veo2 34:12 Introduction to Veo2 and Sora 34:59 First Impressions of Veo2 35:49 Comparing Veo2 and Sora 37:09 Sora's Unique Features 38:03 Google's MVP Approach 43:07 OpenAI's Latest Releases 44:48 Exploring OpenAI's 1-800 CHAT GPT 47:18 OpenAI's Fine-Tuning with DPO 48:15 OpenAI's Mini Dev Day Announcements 49:08 Evaluating OpenAI's O1 Model 54:39 Weights & Biases Evaluation Tool - Weave 01:03:52 ArcAGI and O1 Performance 01:06:47 Introduction and Technical Issues 01:06:51 Efforts on Desktop Apps 01:07:16 ChatGPT Desktop App Features 01:07:25 Working with Apps and Warp Integration 01:08:38 Programming with ChatGPT in IDEs 01:08:44 Discussion on Warp and Other Tools 01:10:37 GitHub GG Project 01:14:47 OpenAI Announcements and WebRTC 01:24:45 Modern BERT and Smaller Models 01:27:37 Genesis: Generative Physics Engine 01:33:12 Closing Remarks and Holiday Wishes Here’s a talking podcast host speaking excitedly about his show TL;DR - Show notes and Links * Open Source LLMs * Meta Apollo 7B – LMM w/ SOTA video understanding (Page, HF) * Microsoft Phi-4 – 14B SLM (Blog, Paper) * Cohere Command R 7B – (Blog) * Falcon 3 – series of models (X, HF, web) * IBM updates Granite 3.1 + embedding models (HF, Embedding) * Big CO LLMs + APIs * OpenAI releases new o1 + API access (X) * Microsoft makes CoPilot Free! (X) * Google - Gemini Flash 2 Thinking experimental reasoning model (X, Studio) * This weeks Buzz * W&B weave Playground now has Trials (and o1 compatibility) (try it) * Alex Evaluation of o1 and Gemini Thinking experimental (X, Colab, Dashboard) * Vision & Video * Google releases Veo 2 – SOTA text2video modal - beating SORA by most vibes (X) * HunyuanVideo distilled with FastHunyuan down to 6 steps (HF) * Kling 1.6 (X) * Voice & Audio * OpenAI realtime audio improvements (docs) * 11labs new Flash 2.5 model – 75ms generation (X) * Nexa OmniAudio – 2.6B – multimodal local LLM (Blog) * Moonshine Web – real time speech recognition in the browser (X) * Sony MMAudio - open source video 2 audio model (Blog, Demo) * AI Art & Diffusion & 3D * Genesys – open source generative 3D physics engine (X, Site, Github) * Tools * CerebrasCoder – extremely fast apps creation (Try It) * RepoPrompt to chat with o1 Pro – (download) This is a public episode. If you’d like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

    1h 36m
  5. 13/12/2024

    📆 ThursdAI - Dec 12 - unprecedented AI week - SORA, Gemini 2.0 Flash, Apple Intelligence, LLama 3.3, NeurIPS Drama & more AI news

    Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, this was one hell of a week to not be glued to the screen. After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode during ThursdAI, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays? 🎅 A day after our show, on Friday, Google announced a new Gemini 1206 that became the #1 leading model on LMarena and Meta released LLama 3.3, then on Saturday Xai releases their new image model code named Aurora. On a regular week, the above Fri-Sun news would be enough for a full 2 hour ThursdAI show on it's own, but not this week, this week this was barely a 15 minute segment 😅 because so MUCH happened starting Monday, we were barely able to catch our breath, so lets dive into it! As always, the TL;DR and full show notes at the end 👇 and this newsletter is sponsored by W&B Weave, if you're building with LLMs in production, and want to switch to the new Gemini 2.0 today, how will you know if your app is not going to degrade? Weave is the best way! Give it a try for free. Gemini 2.0 Flash - a new gold standard of fast multimodal LLMs Google has absolutely taken the crown away from OpenAI with Gemini 2.0 believe it or not this week with this incredible release. All of us on the show were in agreement that this is a phenomenal release from Google for the 1 year anniversary of Gemini. Gemini 2.0 Flash is beating Pro 002 and Flash 002 on all benchmarks, while being 2x faster than Pro, having 1M context window, and being fully multimodal! Multimodality on input and output This model was announced to be fully multimodal on inputs AND outputs, which means in can natively understand text, images, audio, video, documents and output text, text + images and audio (so it can speak!). Some of these capabilities are restricted for beta users for now, but we know they exists. If you remember project Astra, this is what powers that project. In fact, we had Matt Wolfe join the show, and he demoed had early access to Project Astra and demoed it live on the show (see above) which is powered by Gemini 2.0 Flash. The most amazing thing is, this functionality, that was just 8 months ago, presented to us in Google IO, in a premium Booth experience, is now available to all, in Google AI studio, for free! Really, you can try out right now, yourself at https://aistudio.google.com/live but here's a demo of it, helping me proof read this exact paragraph by watching the screen and talking me through it. Performance out of the box This model beating Sonnet 3.5 on Swe-bench Verified completely blew away the narrative on my timeline, nobody was ready for that. This is a flash model, that's outperforming o1 on code!? So having a Flash MMIO model with 1M context that is accessible via with real time streaming option available via APIs from the release time is honestly quite amazing to begin with, not to mention that during the preview phase, this is currently free, but if we consider the previous prices of Flash, this model is going to considerably undercut the market on price/performance/speed matrix. You can see why this release is taking the crown this week. 👏 Agentic is coming with Project Mariner An additional thing that was announced by Google is an Agentic approach of theirs is project Mariner, which is an agent in the form of a Chrome extension completing webtasks, breaking SOTA on the WebVoyager with 83.5% score with a single agent setup. We've seen agents attempts from Adept to Claude Computer User to Runner H, but this breaking SOTA from Google seems very promising. Can't wait to give this a try. OpenAI gives us SORA, Vision and other stuff from the bag of goodies Ok so now let's talk about the second winner of this week, OpenAI amazing stream of innovations, which would have taken the crown, if not for, well... ☝️ SORA is finally here (for those who got in) Open AI has FINALLY released SORA, their long promised text to video and image to video (and video to video) model (nee, world simulator) to general availability, including a new website - sora.com and a completely amazing UI to come with it. SORA can generate images of various quality from 480p up to 1080p and up to 20 seconds long, and they promised that those will be generating fast, as what they released is actually SORA turbo! (apparently SORA 2 is already in the works and will be even more amazing, more on this later) New accounts paused for now OpenAI seemed to have severely underestimated how many people would like to generate the 50 images per month allowed on the plus account (pro account gets you 10x more for $200 + longer durations whatever that means), and since the time of writing these words on ThursdAI afternoon, I still am not able to create a sora.com account and try out SORA myself (as I was boarding a plane when they launched it) SORA magical UI I've invited one of my favorite video creators, Blaine Brown to the show, who does incredible video experiments, that always go viral, and had time to play with SORA to tell us what he thinks both from a video perspective and from a interface perspective. Blaine had a great take that we all collectively got so much HYPE over the past 8 months of getting teased, that many folks expected SORA to just be an incredible text to video 1 prompt to video generator and it's not that really, in fact, if you just send prompts, it's more like a slot machine (which is also confirmed by another friend of the pod Bilawal) But the magic starts to come when the additional tools like blend are taken into play. One example that Blaine talked about is the Remix feature, where you can Remix videos and adjust the remix strength (Strong, Mild) Another amazing insight Blaine shared is a that SORA can be used by fusing two videos that were not even generated with SORA, but SORA is being used as a creative tool to combine them into one. And lastly, just like Midjourney (and StableDiffusion before that), SORA has a featured and a recent wall of video generations, that show you videos and prompts that others used to create those videos with, for inspiration and learning, so you can remix those videos and learn to prompt better + there are prompting extension tools that OpenAI has built in. One more thing.. this model thinks I love this discovery and wanted to share this with you, the prompt is "A man smiles to the camera, then holds up a sign. On the sign, there is only a single digit number (the number of 'r's in 'strawberry')" Advanced Voice mode now with Video! I personally have been waiting for Voice mode with Video for such a long time, since the that day in the spring, where the first demo of advanced voice mode talked to an OpenAI employee called Rocky, in a very flirty voice, that in no way resembled Scarlet Johannson, and told him to run a comb through his hair. Well today OpenAI have finally announced that they are rolling out this option soon to everyone, and in chatGPT, we'll all going to have the camera button, and be able to show chatGPT what we're seeing via camera or the screen of our phone and have it have the context. If you're feeling a bit of a deja-vu, yes, this is very similar to what Google just launched (for free mind you) with Gemini 2.0 just yesterday in AI studio, and via APIs as well. This is an incredible feature, it will not only see your webcam, it will also see your IOS screen, so you’d be able to reason about an email with it, or other things, I honestly can’t wait to have it already! They also announced Santa mode, which is also super cool, tho I don’t quite know how to .. tell my kids about it? Do I… tell them this IS Santa? Do I tell them this is an AI pretending to be Santa? Where is the lie end exactly? And in one of his funniest jailbreaks (and maybe one of the toughest ones) Pliny the liberator just posted a Santa jailbreak that will definitely make you giggle (and him get Coal this X-mas) The other stuff (with 6 days to go) OpenAI has 12 days of releases, and the other amazing things we got obviously got overshadowed but they are still cool, Canvas can now run code and have custom GPTs, GPT in Apple Intelligence is now widely supported with the public release of iOS 18.2 and they have announced fine tuning with reinforcement learning, allowing to funetune o1-mini to outperform o1 on specific tasks with a few examples. There's 6 more work days to go, and they promised to "end with a bang" so... we'll keep you updated! This weeks Buzz - Guard Rail Genie Alright, it's time for "This Week's Buzz," our weekly segment brought to you by Weights & Biases! This week I hosted Soumik Rakshit from the Weights and Biases AI Team (The team I'm also on btw!). Soumik gave us a deep dive into Guardrails, our new set of features in Weave for ensuring reliability in GenAI production! Guardrails serve as a "safety net" for your LLM powered applications, filtering out inputs or llm responses that trigger a certain criteria or boundary. Types of guardrails include prompt injection attacks, PII leakage, jailbreaking attempts and toxic language as well, but can also include a competitor mention, or selling a product at $0 or a policy your company doesn't have. As part of developing the guardrails Soumik also developed and open sourced an app to test prompts against those guardrails "Guardrails Genie" and we're going to host it to allow folks to test their prompts against our guardrails, and also are developing it and the guardrails in the open so please check out our Github Apple iOS 18.2 Apple Intelligence + ChatGPT integration Apple Intelligence is finally here, you can download it if you have iPhone 15 pro and pro Max and iPhone 16 all series. If you have one of those phones, you will get the

    1h 39m
  6. 06/12/2024

    📆 ThursdAI - Dec 5 - OpenAI o1 & o1 pro, Tencent HY-Video, FishSpeech 1.5, Google GENIE2, Weave in GA & more AI news

    Well well well, December is finally here, we're about to close out this year (and have just flew by the second anniversary of chatGPT 🎂) and it seems that all of the AI labs want to give us X-mas presents to play with over the holidays! Look, I keep saying this, but weeks are getting crazier and crazier, this week we got the cheapest and the most expensive AI offerings all at once (the cheapest from Amazon and the most expensive from OpenAI), 2 new open weights models that beat commercial offerings, a diffusion model that predicts the weather and 2 world building models, oh and 2 decentralized fully open sourced LLMs were trained across the world LIVE and finished training. I said... crazy week! And for W&B, this week started with Weave launching finally in GA 🎉, which I personally was looking forward for (read more below)! TL;DR Highlights * OpenAI O1 & Pro Tier: O1 is out of preview, now smarter, faster, multimodal, and integrated into ChatGPT. For heavy usage, ChatGPT Pro ($200/month) offers unlimited calls and O1 Pro Mode for harder reasoning tasks. * Video & Audio Open Source Explosion: Tencent’s HYVideo outperforms Runway and Luma, bringing high-quality video generation to open source. Fishspeech 1.5 challenges top TTS providers, making near-human voice available for free research. * Open Source Decentralization: Nous Research’s DiStRo (15B) and Prime Intellect’s INTELLECT-1 (10B) prove you can train giant LLMs across decentralized nodes globally. Performance is on par with centralized setups. * Google’s Genie 2 & WorldLabs: Generating fully interactive 3D worlds from a single image, pushing boundaries in embodied AI and simulation. Google’s GenCast also sets a new standard in weather prediction, beating supercomputers in accuracy and speed. * Amazon’s Nova FMs: Cheap, scalable LLMs with huge context and global language coverage. Perfect for cost-conscious enterprise tasks, though not top on performance. * 🎉 Weave by W&B: Now in GA, it’s your dashboard and tool suite for building, monitoring, and scaling GenAI apps. Get Started with 1 line of code OpenAI’s 12 Days of Shipping: O1 & ChatGPT Pro The biggest splash this week came from OpenAI. They’re kicking off “12 days of launches,” and Day 1 brought the long-awaited full version of o1. The main complaint about o1 for many people is how slow it was! Well, now it’s not only smarter but significantly faster (60% faster than preview!), and officially multimodal: it can see images and text together. Better yet, OpenAI introduced a new ChatGPT Pro tier at $200/month. It offers unlimited usage of o1, advanced voice mode, and something called o1 pro mode — where o1 thinks even harder and longer about your hardest math, coding, or science problems. For power users—maybe data scientists, engineers, or hardcore coders—this might be a no-brainer. For others, 200 bucks might be steep, but hey, someone’s gotta pay for those GPUs. Given that OpenAI recently confirmed that there are now 300 Million monthly active users on the platform, and many of my friends already upgraded, this is for sure going to boost the bottom line at OpenAI! Quoting Sam Altman from the stream, “This is for the power users who push the model to its limits every day.” For those who complained o1 took forever just to say “hi,” rejoice: trivial requests will now be answered quickly, while super-hard tasks get that legendary deep reasoning including a new progress bar and a notification when a task is complete. Friend of the pod Ray Fernando gave pro a prompt that took 7 minutes to think through! I've tested the new o1 myself, and while I've gotten dangerously close to my 50 messages per week quota, I've gotten some incredible results already, and very fast as well. This ice-cubes question failed o1-preview and o1-mini and it took both of them significantly longer, and it took just 4 seconds for o1. Open Source LLMs: Decentralization & Transparent Reasoning Nous Research DiStRo & DeMo Optimizer We’ve talked about decentralized training before, but the folks at Nous Research are making it a reality at scale. This week, Nous Research wrapped up the training of a new 15B-parameter LLM—codename “Psyche”—using a fully decentralized approach called “Nous DiStRo.” Picture a massive AI model trained not in a single data center, but across GPU nodes scattered around the globe. According to Alex Volkov (host of ThursdAI), “This is crazy: they’re literally training a 15B param model using GPUs from multiple companies and individuals, and it’s working as well as centralized runs.” The key to this success is “DeMo” (Decoupled Momentum Optimization), a paper co-authored by none other than Diederik Kingma (yes, the Kingma behind Adam optimizer and VAEs). DeMo drastically reduces communication overhead and still maintains stability and speed. The training loss curve they’ve shown looks just as good as a normal centralized run, proving that decentralized training isn’t just a pipe dream. The code and paper are open source, and soon we’ll have the fully trained Psyche model. It’s a huge step toward democratizing large-scale AI—no more waiting around for Big Tech to drop their weights. Instead, we can all chip in and train together. Prime Intellect INTELLECT-1 10B: Another Decentralized Triumph But wait, there’s more! Prime Intellect also finished training their 10B model, INTELLECT-1, using a similar decentralized setup. INTELLECT-1 was trained with a custom framework that reduces inter-GPU communication by 400x. It’s essentially a global team effort, with nodes from all over the world contributing compute cycles. The result? A model hitting performance similar to older Meta models like Llama 2—but fully decentralized. Ruliad DeepThought 8B: Reasoning You Can Actually See If that’s not enough, we’ve got yet another open-source reasoning model: Ruliad’s DeepThought 8B. This 8B parameter model (finetuned from LLaMA-3.1) from friends of the show FarEl, Alpin and Sentdex 👏 Ruliad’s DeepThought attempts to match or exceed performance of much larger models in reasoning tasks (beating several 72B param models while being 8B itself) is very impressive. Google is firing on all cylinders this week Google didn't stay quiet this week as well, and while we all wait for the Gemini team to release the next Gemini after the myriad of very good experimental models recently, we've gotten some very amazing things this week. Google’s PaliGemma 2 - finetunable SOTA VLM using Gemma PaliGemma v2, a new vision-language family of models (3B, 10B and 33B) for 224px, 448px, 896px resolutions are a suite of base models, that include image segmentation and detection capabilities and are great at OCR which make them very versatile for fine-tuning on specific tasks. They claim to achieve SOTA on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation! Google GenCast SOTA weather prediction with... diffusion!? More impressively, Google DeepMind released GenCast, a diffusion-based model that beats the state-of-the-art ENS system in 97% of weather predictions. Did we say weather predictions? Yup. Generative AI is now better at weather forecasting than dedicated physics based deterministic algorithms running on supercomputers. Gencast can predict 15 days in advance in just 8 minutes on a single TPU v5, instead of hours on a monstrous cluster. This is mind-blowing. As Yam said on the show, “Predicting the world is crazy hard” and now diffusion models handle it with ease. W&B Weave: Observability, Evaluation and Guardrails now in GA Speaking of building and monitoring GenAI apps, we at Weights & Biases (the sponsor of ThursdAI) announced that Weave is now GA. Weave is a developer tool for evaluating, visualizing, and debugging LLM calls in production. If you’re building GenAI apps—like a coding agent or a tool that processes thousands of user requests—Weave helps you track costs, latency, and quality systematically. We showcased two internal apps: Open UI (a website builder from a prompt) and Winston (an AI agent that checks emails, Slack, and more). Both rely on Weave to iterate, tune prompts, measure user feedback, and ensure stable performance. With O1 and other advanced models coming to APIs soon, tools like Weave will be crucial to keep those applications under control. If you follow this newsletter and develop with LLMs, now is a great way to give Weave a try Open Source Audio & Video: Challenging Proprietary Models Tencent’s HY Video: Beating Runway & Luma in Open Source Tencent came out swinging with their open-source model, HYVideo. It’s a video model that generates incredible realistic footage, camera cuts, and even audio—yep, Foley and lip-synced character speech. Just a single model doing text-to-video, image-to-video, puppeteering, and more. It even outperforms closed-source giants like Runway Gen 3 and Luma 1.6 on over 1,500 prompts. This is the kind of thing we dreamed about when we first heard of video diffusion models. Now it’s here, open-sourced, ready for tinkering. “It’s near SORA-level,” as I mentioned, referencing OpenAI’s yet-to-be-fully-released SORA model. The future of generative video just got more accessible, and competitors should be sweating right now. We may just get SORA as one of the 12 days of OpenAI releases! FishSpeech 1.5: Open Source TTS Rivaling the Big Guns Not just video—audio too. FishSpeech 1.5 is a multilingual, zero-shot voice cloning model that ranks #2 overall on TTS benchmarks, just behind 11 Labs. This is a 500M-parameter model, trained on a million hours of audio, achieving near-human quality, fast inference, and open for research. This puts high-quality text-to-speech capabilities in the open-source community’s hands. You can now run a top-tier TTS system locally, clone voices, and generate spe

    1h 32m
  7. 28/11/2024

    🦃 ThursdAI - Thanksgiving special 24' - Qwen Open Sources Reasoning, BlueSky hates AI, H controls the web & more AI news

    Hey ya'll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you! We had a blast on today's celebratory stream, especially given that today's "main course" was the amazing open sourcing of a reasoning model from Qwen, and we had Junyang Lin with us again to talk about it! First open source reasoning model that you can run on your machine, that beats a 405B model, comes close to o1 on some metrics 🤯 We also chatted about a new hybrid approach from Nvidia called Hymba 1.5B (Paper, HF) that beats Qwen 1.5B with 6-12x less training, and Allen AI releasing Olmo 2, which became the best fully open source LLM 👏 (Blog, HF, Demo), though they didn't release WandB logs this time, they did release data! I encourage you to watch todays show (or listen to the show, I don't judge), there's not going to be a long writeup like I usually do, as I want to go and enjoy the holiday too, but of course, the TL;DR and show notes are right here so you won't miss a beat if you want to use the break to explore and play around with a few things! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. TL;DR and show notes * Qwen QwQ 32B preview - the first open weights reasoning model (X, Blog, HF, Try it) * Allen AI - Olmo 2 the best fully open language model (Blog, HF, Demo) * NVIDIA Hymba 1.5B - Hybrid smol model beating Qwen, SmolLM w/ 6-12x less training (X, Paper, HF) * Big CO LLMs + APIs * Anthropic MCP - model context protocol (X,Blog, Spec, Explainer) * Cursor, Jetbrains now integrate with ChatGPT MacOS app (X) * Xai is going to be a Gaming company?! (X) * H company shows Runner H - WebVoyager Agent (X, Waitlist) * This weeks Buzz * Interview w/ Thomas Cepelle about Weave scorers and guardrails (Guide) * Vision & Video * OpenAI SORA API was "leaked" on HuggingFace (here) * Runway launches video Expand feature (X) * Rhymes Allegro-TI2V - updated image to video model (HF) * Voice & Audio * OuteTTS v0.2 - 500M smol TTS with voice cloning (Blog, HF) * AI Art & Diffusion & 3D * Runway launches an image model called Frames (X, Blog) * ComfyUI Desktop app was released 🎉 * Chat * 24 hours of AI hate on 🦋 (thread) * Tools * Cursor agent (X thread) * Google Generative Chess toy (Link) See you next week and happy Thanks Giving 🦃 Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it. Full Subtitles for convenience [00:00:00] Alex Volkov: let's get it going. [00:00:10] Alex Volkov: Welcome, welcome everyone to ThursdAI November 28th Thanksgiving special. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. You're on ThursdAI. We are live [00:00:30] on ThursdAI. Everywhere pretty much. [00:00:32] Alex Volkov: [00:00:32] Hosts and Guests Introduction [00:00:32] Alex Volkov: I'm joined here with two of my co hosts. [00:00:35] Alex Volkov: Wolfram, welcome. [00:00:36] Wolfram Ravenwolf: Hello everyone! Happy Thanksgiving! [00:00:38] Alex Volkov: Happy Thanksgiving, man. [00:00:39] Alex Volkov: And we have Junyang here. Junyang, welcome, man. [00:00:42] Junyang Lin: Yeah, hi everyone. Happy Thanksgiving. Great to be here. [00:00:46] Alex Volkov: You had a busy week. We're going to chat about what you had. I see Nisten joining us as well at some point. [00:00:51] Alex Volkov: Yam pe joining us as well. Hey, how, Hey Yam. Welcome. Welcome, as well. Happy Thanksgiving. It looks like we're assembled folks. We're across streams, across [00:01:00] countries, but we are. [00:01:01] Overview of Topics for the Episode [00:01:01] Alex Volkov: For November 28th, we have a bunch of stuff to talk about. Like really a big list of stuff to talk about. So why don't we just we'll just dive in. We'll just dive in. So obviously I think the best and the most important. [00:01:13] DeepSeek and Qwen Open Source AI News [00:01:13] Alex Volkov: Open source kind of AI news to talk about this week is going to be, and I think I remember last week, Junyang, I asked you about this and you were like, you couldn't say anything, but I asked because last week, folks, if you remember, we talked about R1 from DeepSeek, a reasoning model from [00:01:30] DeepSeek, which really said, Oh, maybe it comes as a, as open source and maybe it doesn't. [00:01:33] Alex Volkov: And I hinted about, and I asked, Junyang, what about some reasoning from you guys? And you couldn't say anything. so this week. I'm going to do a TLDR. So we're going to actually talk about the stuff that, you know, in depth a little bit later, but this week, obviously one of the biggest kind of open source or sorry, open weights, and news is coming from our friends at Qwen as well, as we always celebrate. [00:01:56] Alex Volkov: So one of the biggest things that we get as. [00:02:00] is, Qwen releases, I will actually have you tell me what's the pronunciation here, Junaid, what is, I say Q W Q or maybe quick, what is the pronunciation of this? [00:02:12] Junyang Lin: I mentioned it in the blog, it is just like the word quill. Yeah. yeah, because for the qw you can like work and for the q and you just like the U, so I just combine it together and create a new pronunciation called Quill. [00:02:28] Junyang Lin: Yeah. [00:02:28] Alex Volkov: So we're saying it's Quin [00:02:30] Quill 32 B. Is that the right pronunciation to say this? [00:02:33] Junyang Lin: Yeah, it's okay. I would just call it qui quill. It is, some something funny because,the ca the characters look very funny. Oh, we have a subculture,for these things. Yeah. Just to express some, yeah. [00:02:46] Junyang Lin: our. feelings. [00:02:49] Alex Volkov: Amazing. Qwen, Quill, 32B, and it's typed,the name is typed QWQ, 32Breview. This is the first OpenWeights reasoning model. This [00:03:00] model is not only predicting tokens, it's actually doing reasoning behind this. What this means is we're going to tell you what this means after we get to this. [00:03:07] Alex Volkov: So we're still in the, we're still in the TLDR area. We also had. Another drop from Alien Institute for AI, if you guys remember last week we chatted with Nathan, our dear friend Nathan, from Alien Institute about 2. 0. 3, about their efforts for post training, and he gave us all the details about post training, so they released 2. [00:03:28] Alex Volkov: 0. 3, this week they released Olmo 2. [00:03:30] 0. We also talked about Olmo with the friends from Alien Institute a couple of months ago, and now they released Olmo 2. 0. Which they claim is the best fully open sourced, fully open sourced language models, from Allen Institute for AI.and, so we're going to chat about, Olmo a little bit as well. [00:03:46] Alex Volkov: And last minute addition we have is NVIDIA Haimba, which is a hybrid small model from NVIDIA, very tiny one, 1. 5 billion parameters. small model building Qwen and building small LLM as well. this is in the area [00:04:00] of open source. I [00:04:01] Alex Volkov: Okay, in the big companies, LLMs and APIs, I want to run through a few things. [00:04:06] Anthropic's MCP and ChatGPT macOS Integrations [00:04:06] Alex Volkov: So first of all, Anthropic really something called MCP. It's a, something they called Model Context Protocol. We're going to briefly run through this. It's a, it's a kind of a release from them that's aimed for developers is a protocol that enables secure connections between a host application, like a cloud desktop, for example, [00:04:24] Alex Volkov: there's also a bunch of new integrations for the ChatGPT macOS app. If you guys remember a couple of [00:04:30] weeks ago, We actually caught this live. [00:04:31] Alex Volkov: I refreshed my MacOS app and there's ta da, there's a new thing. And we discovered this live. It was very fun. The MacOS app for ChatGPT integrates with VS Code, et cetera. and so we tried to run this with Cursor. It didn't work. So now it works with Cursor, [00:04:43] Wolfram Ravenwolf: [00:04:43] Alex Volkov: So the next thing we're going to look at, I don't know if it's worth mentioning, but you guys know the XAI, the company that Elon Musk is raising another 6 billion for that tries to compete with OpenAI [00:04:54] Alex Volkov: Do you guys hear that it's going to be a gaming company as well? I don't know if it's worth talking about, but we'll at least [00:05:00] mention this. And the one thing that I wanted to chat about is H, the French company, H that showed a runner that looks. Three times as fast and as good as the Claude computer use runner, and we're definitely going to show examples of this, video live because that looks just incredible. [00:05:18] Alex Volkov: this out of nowhere company, the biggest fundraise or the biggest seed round that Europe has ever seen, at least French has ever seen, just show they, An agent that controls your [00:05:30] computer that's tiny, ridiculously tiny, I think it's like the three billion parameter, two billion parameter or something. [00:05:36] Alex Volkov: And it runs way better than computer, cloud computer use. Something definitely worth talking about. after with, after which in this week's Bars, we're going to talk with Thomas Capelli, from, from my team at Weights Biases. about LLM guardrails, that's gonna be fun. and in vision video category, we're gonna cover that OpenAI Sora quote unquote leaked, this week. [00:05:56] Alex Volkov: And this leak wasn't really a leak, but, definitely [00:06:00] we saw some stuff. and then there's also a new expand feature that we saw in, Runway. And we saw another video model from, Rhymes called Allegro TIV2. which is pretty cool in voice and audio. If we get there in voice and audio, we saw out TTS vision 0. [00:06:19] Alex Volkov: 2, which is a new TTS, a 500 million parameter, small TTS you can run in your browser and sounds pretty d

    1h 46m
  8. 22/11/2024

    📆 ThursdAI - Nov 21 - The fight for the LLM throne, OSS SOTA from AllenAI, Flux new tools, Deepseek R1 reasoning & more AI news

    Hey folks, Alex here, and oof what a 🔥🔥🔥 show we had today! I got to use my new breaking news button 3 times this show! And not only that, some of you may know that one of the absolutely biggest pleasures as a host, is to feature the folks who actually make the news on the show! And now that we're in video format, you actually get to see who they are! So this week I was honored to welcome back our friend and co-host Junyang Lin, a Dev Lead from the Alibaba Qwen team, who came back after launching the incredible Qwen Coder 2.5, and Qwen 2.5 Turbo with 1M context. We also had breaking news on the show that AI2 (Allen Institute for AI) has fully released SOTA LLama post-trained models, and I was very lucky to get the core contributor on the paper, Nathan Lambert to join us live and tell us all about this amazing open source effort! You don't want to miss this conversation! Lastly, we chatted with the CEO of StackBlitz, Eric Simons, about the absolutely incredible lightning in the bottle success of their latest bolt.new product, how it opens a new category of code generator related tools. 00:00 Introduction and Welcome 00:58 Meet the Hosts and Guests 02:28 TLDR Overview 03:21 Tl;DR 04:10 Big Companies and APIs 07:47 Agent News and Announcements 08:05 Voice and Audio Updates 08:48 AR, Art, and Diffusion 11:02 Deep Dive into Mistral and Pixtral 29:28 Interview with Nathan Lambert from AI2 30:23 Live Reaction to Tulu 3 Release 30:50 Deep Dive into Tulu 3 Features 32:45 Open Source Commitment and Community Impact 33:13 Exploring the Released Artifacts 33:55 Detailed Breakdown of Datasets and Models 37:03 Motivation Behind Open Source 38:02 Q&A Session with the Community 38:52 Summarizing Key Insights and Future Directions 40:15 Discussion on Long Context Understanding 41:52 Closing Remarks and Acknowledgements 44:38 Transition to Big Companies and APIs 45:03 Weights & Biases: This Week's Buzz 01:02:50 Mistral's New Features and Upgrades 01:07:00 Introduction to DeepSeek and the Whale Giant 01:07:44 DeepSeek's Technological Achievements 01:08:02 Open Source Models and API Announcement 01:09:32 DeepSeek's Reasoning Capabilities 01:12:07 Scaling Laws and Future Predictions 01:14:13 Interview with Eric from Bolt 01:14:41 Breaking News: Gemini Experimental 01:17:26 Interview with Eric Simons - CEO @ Stackblitz 01:19:39 Live Demo of Bolt's Capabilities 01:36:17 Black Forest Labs AI Art Tools 01:40:45 Conclusion and Final Thoughts As always, the show notes and TL;DR with all the links I mentioned on the show and the full news roundup below the main new recap 👇 Google & OpenAI fighting for the LMArena crown 👑 I wanted to open with this, as last week I reported that Gemini Exp 1114 has taken over #1 in the LMArena, in less than a week, we saw a new ChatGPT release, called GPT-4o-2024-11-20 reclaim the arena #1 spot! Focusing specifically on creating writing, this new model, that's now deployed on chat.com and in the API, is definitely more creative according to many folks who've tried it, with OpenAI employees saying "expect qualitative improvements with more natural and engaging writing, thoroughness and readability" and indeed that's what my feed was reporting as well. I also wanted to mention here, that we've seen this happen once before, last time Gemini peaked at the LMArena, it took less than a week for OpenAI to release and test a model that beat it. But not this time, this time Google came prepared with an answer! Just as we were wrapping up the show (again, Logan apparently loves dropping things at the end of ThursdAI), we got breaking news that there is YET another experimental model from Google, called Gemini Exp 1121, and apparently, it reclaims the stolen #1 position, that chatGPT reclaimed from Gemini... yesterday! Or at least joins it at #1 LMArena Fatigue? Many folks in my DMs are getting a bit frustrated with these marketing tactics, not only the fact that we're getting experimental models faster than we can test them, but also with the fact that if you think about it, this was probably a calculated move by Google. Release a very powerful checkpoint, knowing that this will trigger a response from OpenAI, but don't release your most powerful one. OpenAI predictably releases their own "ready to go" checkpoint to show they are ahead, then folks at Google wait and release what they wanted to release in the first place. The other frustration point is, the over-indexing of the major labs on the LMArena human metrics, as the closest approximation for "best". For example, here's some analysis from Artificial Analysis showing that the while the latest ChatGPT is indeed better at creative writing (and #1 in the Arena, where humans vote answers against each other), it's gotten actively worse at MATH and coding from the August version (which could be a result of being a distilled much smaller version) . In summary, maybe the LMArena is no longer 1 arena is all you need, but the competition at the TOP scores of the Arena has never been hotter. DeepSeek R-1 preview - reasoning from the Chinese Whale While the American labs fight for the LM titles, the real interesting news may be coming from the Chinese whale, DeepSeek, a company known for their incredibly cracked team, resurfaced once again and showed us that they are indeed, well super cracked. They have trained and released R-1 preview, with Reinforcement Learning, a reasoning model that beasts O1 at AIME and other benchmarks! We don't know many details yet, besides them confirming that this model comes to the open source! but we do know that this model , unlike O1, is showing the actual reasoning it uses to achieve it's answers (reminder: O1 hides its actual reasoning and what we see is actually another model summarizing the reasoning) The other notable thing is, DeepSeek all but confirmed the claim that we have a new scaling law with Test Time / Inference time compute law, where, like with O1, the more time (and tokens) you give a model to think, the better it gets at answering hard questions. Which is a very important confirmation, and is a VERY exciting one if this is coming to the open source! Right now you can play around with R1 in their demo chat interface. In other Big Co and API news In other news, Mistral becomes a Research/Product company, with a host of new additions to Le Chat, including Browse, PDF upload, Canvas and Flux 1.1 Pro integration (for Free! I think this is the only place where you can get Flux Pro for free!). Qwen released a new 1M context window model in their API called Qwen 2.5 Turbo, making it not only the 2nd ever 1M+ model (after Gemini) to be available, but also reducing TTFT (time to first token) significantly and slashing costs. This is available via their APIs and Demo here. Open Source is catching up AI2 open sources Tulu 3 - SOTA 8B, 70B LLama post trained FULLY open sourced (Blog ,Demo, HF, Data, Github, Paper) Allen AI folks have joined the show before, and this time we got Nathan Lambert, the core contributor on the Tulu paper, join and talk to us about Post Training and how they made the best performing SOTA LLama 3.1 Funetunes with careful data curation (which they also open sourced), preference optimization, and a new methodology they call RLVR (Reinforcement Learning with Verifiable Rewards). Simply put, RLVR modifies the RLHF approach by using a verification function instead of a reward model. This method is effective for tasks with verifiable answers, like math problems or specific instructions. It improves performance on certain benchmarks (e.g., GSM8K) while maintaining capabilities in other areas. The most notable thing is, just how MUCH is open source, as again, like the last time we had AI2 folks on the show, the amount they release is staggering In the show, Nathan had me pull up the paper and we went through the deluge of models, code and datasets they released, not to mention the 73 page paper full of methodology and techniques. Just absolute ❤️ to the AI2 team for this release! 🐝 This weeks buzz - Weights & Biases corner This week, I want to invite you to a live stream announcement that I am working on behind the scenes to produce, on December 2nd. You can register HERE (it's on LinkedIn, I know, I'll have the YT link next week, promise!) We have some very exciting news to announce, and I would really appreciate the ThursdAI crew showing up for that! It's like 5 minutes and I helped produce 🙂 Pixtral Large is making VLMs cool again Mistral had quite the week this week, not only adding features to Le Chat, but also releasing Pixtral Large, their updated multimodal model, which they claim state of the art on multiple benchmarks. It's really quite good, not to mention that it's also included, for free, as part of the le chat platform, so now when you upload documents or images to le chat you get Pixtral Large. The backbone for this model is Mistral Large (not the new one they also released) and this makes this 124B model a really really good image model, albeit a VERY chonky one that's hard to run locally. The thing I loved about the Pixtral release the most is, they used the new understanding to ask about Weights & Biases charts 😅 and Pixtral did a pretty good job! Some members of the community though, reacted to the SOTA claims by Mistral in a very specific meme-y way: This meme has become a very standard one, when labs tend to not include Qwen VL 72B or other Qwen models in the evaluation results, all while claiming SOTA. I decided to put these models to a head to head test myself, only to find out, that ironically, both models say the other one is better, while both hallucinate some numbers. BFL is putting the ART in Artificial Intelligence with FLUX.1 Tools (blog) With the absolute breaking news bombastic release, the folks at BFL (Black Forest Labs) have released Flux.1 Tools, which will allow AI artist to use these models in all kind of creative inspiring ways. T

    1h 45m

About

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news

You Might Also Like

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada