Wednesday, 1 July 2026

New top story on Hacker News: Hanami 3.0: In Full Bloom

Hanami 3.0: In Full Bloom
10 by PuercoPop | 0 comments on Hacker News.


New top story on Hacker News: Show HN: Pglayers – PostgreSQL extensions as stackable Docker layers

Show HN: Pglayers – PostgreSQL extensions as stackable Docker layers
15 by iemejia | 2 comments on Hacker News.


New top story on Hacker News: Show HN: Morph Reflexes – Multi-head classifiers for agent traces

Show HN: Morph Reflexes – Multi-head classifiers for agent traces
11 by bhaktatejas922 | 1 comments on Hacker News.
The most common failures for production agents are behavioral: looping, reasoning leakage, user frustration, and more. Using a frontier model like GPT or Sonnet to judge every turn is too expensive and slow to run at scale. To solve this, we built Reflexes: semantic signals from agent traces, served fast and cheap over API. Built on custom kernels and a custom inference engine forked from vLLM. Under the hood, it is a small LLM architected around multi-head inference. Small models need to be trained for specific tasks, but running 50 separate small models on the same input for 50 tasks makes no sense. How it works: We use a modern LLM with hybrid attention and remove the decode step. We built an inference engine that lets prefill compute be 99% reused from reflex to reflex, similar in spirit to older 2019-era BERT/HYDRA and older multiple-head techniques. we built the inference engine to reuse the KV/cache across inputs and compute across all reflexes. One shared backbone reads the trace once, then many heads classify different signals. Our inference engine reuses the same KV/cache and compute across all reflexes, giving us sub-30ms inference with less than 0.1% overhead for each additional reflex. We took the same high-level idea and did the hard work to make it work with a modern architecture and attention. On it, we can run inference in under 30ms and serve the full request in under 90ms. If you run 4 reflexes or 100, the extra overhead is less than 2ms. Why does optimizing this matter? If you’re even a medium-sized startup, you’re dealing with tens of thousands of agent runs and millions of turns. If you want to track things like user frustration rates over time, frontier LLM-as-judge does not scale. I built a similar stack at Tesla. When ML engineers needed to sample data across petabytes for signals like `is_camera_obfuscated=true`, along with 200 other things, you need to 1) spin them up quickly 2) run at scale efficiently What it is not: A dashboard. 99% of dashboards go unused. 100% API first and made for devs who want to use this to trigger their own stuff. vibetrain a custom reflex in our dashboard, and/or then let it self improve in production: https://ift.tt/EDGbzvJ Docs: https://ift.tt/OAu70C5 I’d love feedback from people running agents in prod: what sorts of things do you wish you could track over time across 100% of turns but cant right now? TLDR: semantic signals from agent traces, super fast, cheap via API