Hanami 3.0: In Full Bloom
10 by PuercoPop | 0 comments on Hacker News.
Word News
World News - Find latest world news and headlines today based on politics, crime, entertainment, sports, lifestyle, technology and many
Wednesday, 1 July 2026
New top story on Hacker News: Show HN: Morph Reflexes – Multi-head classifiers for agent traces
Show HN: Morph Reflexes – Multi-head classifiers for agent traces
11 by bhaktatejas922 | 1 comments on Hacker News.
The most common failures for production agents are behavioral: looping, reasoning leakage, user frustration, and more. Using a frontier model like GPT or Sonnet to judge every turn is too expensive and slow to run at scale. To solve this, we built Reflexes: semantic signals from agent traces, served fast and cheap over API. Built on custom kernels and a custom inference engine forked from vLLM. Under the hood, it is a small LLM architected around multi-head inference. Small models need to be trained for specific tasks, but running 50 separate small models on the same input for 50 tasks makes no sense. How it works: We use a modern LLM with hybrid attention and remove the decode step. We built an inference engine that lets prefill compute be 99% reused from reflex to reflex, similar in spirit to older 2019-era BERT/HYDRA and older multiple-head techniques. we built the inference engine to reuse the KV/cache across inputs and compute across all reflexes. One shared backbone reads the trace once, then many heads classify different signals. Our inference engine reuses the same KV/cache and compute across all reflexes, giving us sub-30ms inference with less than 0.1% overhead for each additional reflex. We took the same high-level idea and did the hard work to make it work with a modern architecture and attention. On it, we can run inference in under 30ms and serve the full request in under 90ms. If you run 4 reflexes or 100, the extra overhead is less than 2ms. Why does optimizing this matter? If you’re even a medium-sized startup, you’re dealing with tens of thousands of agent runs and millions of turns. If you want to track things like user frustration rates over time, frontier LLM-as-judge does not scale. I built a similar stack at Tesla. When ML engineers needed to sample data across petabytes for signals like `is_camera_obfuscated=true`, along with 200 other things, you need to 1) spin them up quickly 2) run at scale efficiently What it is not: A dashboard. 99% of dashboards go unused. 100% API first and made for devs who want to use this to trigger their own stuff. vibetrain a custom reflex in our dashboard, and/or then let it self improve in production: https://ift.tt/EDGbzvJ Docs: https://ift.tt/OAu70C5 I’d love feedback from people running agents in prod: what sorts of things do you wish you could track over time across 100% of turns but cant right now? TLDR: semantic signals from agent traces, super fast, cheap via API
11 by bhaktatejas922 | 1 comments on Hacker News.
The most common failures for production agents are behavioral: looping, reasoning leakage, user frustration, and more. Using a frontier model like GPT or Sonnet to judge every turn is too expensive and slow to run at scale. To solve this, we built Reflexes: semantic signals from agent traces, served fast and cheap over API. Built on custom kernels and a custom inference engine forked from vLLM. Under the hood, it is a small LLM architected around multi-head inference. Small models need to be trained for specific tasks, but running 50 separate small models on the same input for 50 tasks makes no sense. How it works: We use a modern LLM with hybrid attention and remove the decode step. We built an inference engine that lets prefill compute be 99% reused from reflex to reflex, similar in spirit to older 2019-era BERT/HYDRA and older multiple-head techniques. we built the inference engine to reuse the KV/cache across inputs and compute across all reflexes. One shared backbone reads the trace once, then many heads classify different signals. Our inference engine reuses the same KV/cache and compute across all reflexes, giving us sub-30ms inference with less than 0.1% overhead for each additional reflex. We took the same high-level idea and did the hard work to make it work with a modern architecture and attention. On it, we can run inference in under 30ms and serve the full request in under 90ms. If you run 4 reflexes or 100, the extra overhead is less than 2ms. Why does optimizing this matter? If you’re even a medium-sized startup, you’re dealing with tens of thousands of agent runs and millions of turns. If you want to track things like user frustration rates over time, frontier LLM-as-judge does not scale. I built a similar stack at Tesla. When ML engineers needed to sample data across petabytes for signals like `is_camera_obfuscated=true`, along with 200 other things, you need to 1) spin them up quickly 2) run at scale efficiently What it is not: A dashboard. 99% of dashboards go unused. 100% API first and made for devs who want to use this to trigger their own stuff. vibetrain a custom reflex in our dashboard, and/or then let it self improve in production: https://ift.tt/EDGbzvJ Docs: https://ift.tt/OAu70C5 I’d love feedback from people running agents in prod: what sorts of things do you wish you could track over time across 100% of turns but cant right now? TLDR: semantic signals from agent traces, super fast, cheap via API
Tuesday, 30 June 2026
Monday, 29 June 2026
Sunday, 28 June 2026
New top story on Hacker News: I used Claude Code to get a second opinion on my MRI
I used Claude Code to get a second opinion on my MRI
106 by engmarketer | 144 comments on Hacker News.
106 by engmarketer | 144 comments on Hacker News.
Saturday, 27 June 2026
Friday, 26 June 2026
New top story on Hacker News: Ask HN: Is "no source code was copied" still a sufficient copyright defense?
Ask HN: Is "no source code was copied" still a sufficient copyright defense?
17 by oscgam1 | 13 comments on Hacker News.
We are all familiar with the Corgi event: https://ift.tt/HtAY4cl With the barrier to create new apps having dropped significantly thanks to LLMs, I am seeing more cases about copyright and unfair competition. I've seen and participated in some of these cases. Usually expert witnesses are required. Curious to hear the community stance on this one. "Now software developers are feeling what authors and artist felt". https://ift.tt/PhfCA9Q There are several claims of: Copying UI is Ok, your product is not undifferentiated enough. Here is a legal assessment of the situation: https://ift.tt/SFn8Hhp
17 by oscgam1 | 13 comments on Hacker News.
We are all familiar with the Corgi event: https://ift.tt/HtAY4cl With the barrier to create new apps having dropped significantly thanks to LLMs, I am seeing more cases about copyright and unfair competition. I've seen and participated in some of these cases. Usually expert witnesses are required. Curious to hear the community stance on this one. "Now software developers are feeling what authors and artist felt". https://ift.tt/PhfCA9Q There are several claims of: Copying UI is Ok, your product is not undifferentiated enough. Here is a legal assessment of the situation: https://ift.tt/SFn8Hhp
Subscribe to:
Posts (Atom)