This vision paper presents a new generation of multimodal streaming systems that embed Multimodal Large Language Models (MLLMs) as first-class operators, enabling real-time query processing across multiple modalities. While recent work has integrated MLLMs into databases for multimodal queries, streaming systems require fundamentally different approaches due to their strict latency and throughput requirements. Novel optimizations at logical, physical, and semantic query transformation levels reduce model load to improve throughput while preserving accuracy. The prototype Samsara leverages such optimizations to improve performance by more than an order of magnitude.