Monday, 23 March 2026

Show HN: OpenCastor Agent Harness Evaluator Leaderboard https://bit.ly/4bGGUc3

Show HN: OpenCastor Agent Harness Evaluator Leaderboard I've been building OpenCastor, a runtime layer that sits between a robot's hardware and its AI agent. One thing that surprised me: the order you arrange the skill pipeline (context builder → model router → error handler, etc.) and parameters like thinking_budget and context_budget affect task success rates as much as model choice does. So I built a distributed evaluator. Robots contribute idle compute to benchmark harness configurations against OHB-1, a small benchmark of 30 real-world robot tasks (grip, navigate, respond, etc.) using local LLM calls via Ollama. The search space is 263,424 configs (8 dimensions: model routing, context budget, retry logic, drift detection, etc.). The demo leaderboard shows results so far, broken down by hardware tier (Pi5+Hailo, Jetson, server, budget boards). The current champion config is free to download as a YAML and apply to any robot. P66 safety parameters are stripped on apply — no harness config can touch motor limits or ESTOP logic. Looking for feedback on: (1) whether the benchmark tasks are representative, (2) whether the hardware tier breakdown is useful, and (3) anyone who's run fleet-wide distributed evals of agent configs for robotics or otherwise. https://bit.ly/4c1pica March 23, 2026 at 11:13PM

Show HN: Cq – Stack Overflow for AI coding agents https://bit.ly/47gYJgx

Show HN: Cq – Stack Overflow for AI coding agents Hi all, I'm Peter at Staff Engineer and Mozilla.ai and I want to share our idea for a standard for shared agent learning, conceptually it seemed to fit easily in my mental model as a Stack Overflow for agents. The project is trying to see if we can get agents (any agent, any model) to propose 'knowledge units' (KUs) as a standard schema based on gotchas it runs into during use, and proactively query for existing KUs in order to get insights which it can verify and confirm if they prove useful. It's currently very much a PoC with a more lofty proposal in the repo, we're trying to iterate from local use, up to team level, and ideally eventually have some kind of public commons. At the team level (see our Docker compose example) and your coding agent configured to point to the API address for the team to send KUs there instead - where they can be reviewed by a human in the loop (HITL) via a UI in the browser, before they're allowed to appear in queries by other agents in your team. We're learning a lot even from using it locally on various repos internally, not just in the kind of KUs it generates, but also from a UX perspective on trying to make it easy to get using it and approving KUs in the browser dashboard. There are bigger, complex problems to solve in the future around data privacy, governance etc. but for now we're super focussed on getting something that people can see some value from really quickly in their day-to-day. Tech stack: * Skills - markdown * Local Python MCP server (FastMCP) - managing a local SQLite knowledge store * Optional team API (FastAPI, Docker) for sharing knowledge across an org * Installs as a Claude Code plugin or OpenCode MCP server * Local-first by default; your knowledge stays on your machine unless you opt into team sync by setting the address in config * OSS (Apache 2.0 licensed) Here's an example of something which seemed straight forward, when asking Claude Code to write a GitHub action it often used actions that were multiple major versions out of date because of its training data. In this case I told the agent what I saw when I reviewed the GitHub action YAML file it created and it proposed the knowledge unit to be persisted. Next time in a completely different repo using OpenCode and an OpenAI model, the cq skill was used up front before it started the task and it got the information about the gotcha on major versions in training data and checked GitHub proactively, using the correct, latest major versions. It then confirmed the KU, increasing the confidence score. I guess some folks might say: well there's a CLAUDE.md in your repo, or in ~/.claude/ but we're looking further than that, we want this to be available to all agents, to all models, and maybe more importantly we don't want to stuff AGENTS.md or CLAUDE.md with loads of rules that lead to unpredictable behaviour, this is targetted information on a particular task and seems a lot more useful. Right now it can be installed locally as a plugin for Claude Code and OpenCode: claude plugin marketplace add mozilla-ai/cq claude plugin install cq This allows you to capture data in your local ~/.cq/local.db (the data doesn't get sent anywhere else). We'd love feedback on this, the repo is open and public - so GitHub issues are welcome. We've posted on some of our social media platforms with a link to the blog post (below) so feel free to reply to us if you found it useful, or ran into friction, we want to make this something that's accessible to everyone. Blog post with the full story: https://bit.ly/41ukHZX GitHub repo: https://bit.ly/4soBZ6I Thanks again for your time. https://bit.ly/41ukHZX March 23, 2026 at 05:11PM

Sunday, 22 March 2026

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/4srsrrA

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/47WxiJ2 March 23, 2026 at 02:48AM

Show HN: Quillium, Git for Writers https://bit.ly/4c0H92U

Show HN: Quillium, Git for Writers This is a tool which lets you easily manage different versions of ideas, helpful for writing essays. I've found myself wanting this every single time I go through the drafting process when writing, and I've been frustrated every time I find myself accidentally working on an old draft just because there was a paragraph that I liked better. This solves it. I hope the community like this as much I enjoyed working on it! Note that it's currently a beta waitlist because there's some bugs with the undo/redo state management and so I want to dogfood it for a bit for reliability. It says April 2nd, but I may allow earlier beta testers. https://bit.ly/4bFReRH March 23, 2026 at 01:22AM

Show HN: Plot-Hole.com a daily movie puzzle I made https://bit.ly/47C1U2H

Show HN: Plot-Hole.com a daily movie puzzle I made https://bit.ly/4brdZd9 March 23, 2026 at 01:15AM

Show HN: Refrax – my Arc Browser replacement I made from scratch https://bit.ly/4ssbdKD

Show HN: Refrax – my Arc Browser replacement I made from scratch Open the same tab in two browser windows. In Chrome or Safari, you get two unconnected pages. In Arc, one window shows a placeholder. In Zen, it silently creates a duplicate. In Refrax, the browser I built, both windows show the same page updating live. The same web page, in as many windows as you want. This shouldn't be possible. WebKit's WKWebView can exist in exactly one view hierarchy at a time. With macOS 26, Apple added a SwiftUI API separating WebView from WebPage, so you can end up with multiple views referencing the same page. But if you try it, your app crashes. WebKit source code has a precondition with this comment: "We can't have multiple owning pages regardless, but we'll want to decide if it's an error, if we can handle it gracefully, and how deterministic it might even be..." So here's how I did it. CAPortalLayer is an undocumented private class that's been in macOS since 10.12. It mirrors a layer's composited output by referencing the same GPU memory, not copying it. Every scroll, animation, or repaint reflects instantly. This is what powers Liquid Glass effects, the iOS text selection magnifier, and ghost images during drag and drop. Apple uses portals for effects. I use them to put the same web page in two windows. Refrax keeps one real WKWebView per tab and displays a CAPortalLayer mirror everywhere else. When you click a different window, the coordinator moves the real view there and the old window gets a portal. You can't tell which is which. This sounds simple in theory, but making this actually work seamlessly took quite a lot of effort. Each macOS window has its own rendering context, and the context ID updates asynchronously, so creating a portal immediately captures a stale ID and renders nothing. The portal creation needs to be delayed, but delaying creates a visual gap. I capture a GPU snapshot using a private CoreGraphics function and place it behind the portal as a fallback. Another hard part is that none of it is documented. Portals are very capricious and would crash the app if you use them incorrectly. I had to inspect the headers and then disassemble the binaries to explore exactly how it works in order to build something robust. I never worked on a browser before this, I've only been a user. I started using Arc in 2022. I remember asking for an invite, learning the shortcuts, slowly getting used to it. I didn't like it at first as it had too much Google Chrome in it for my taste, and I'd been using Safari at the time. But it grew on me, and by the time it was essentially abandoned and sold to Atlassian, I couldn't go back to Safari anymore. I tried everything: Zen, SigmaOS, Helium. None felt right, and I didn't want another Chromium fork. WebKit ships with the OS, but all you get is the rendering engine. Tabs, history, bookmarks, passwords, extensions, everything else has to be made separately. And so, being a very reasonable person, I decided to make my own Arc replacement from scratch. And I did. Refrax is built in Swift and Objective-C with no external dependencies. The app itself is less than 30 MB. I have 393 tabs open right now using 442 MB of RAM; 150 tabs in Safari was already over 1 GB. I've been using it daily for over a month, and so have some of my friends. The portal mirror is just one feature. The same approach, finding what Apple built for themselves and using it to create something they didn't think about, runs through the entire browser. You can tint your glass windows with adjustable blend modes and transparency. The sidebar in compact mode samples the page and matches the colors. And it has support for Firefox and Chrome extensions. The alpha is public. Download from the linked website, enter REFRAX-ALPHA-HACKERNEWS to activate. No account needed. Telemetry is crash reports and a daily active-user ping, nothing else. And if you find a bug – I built this alone, so I'll actually read your report. https://bit.ly/4bs6AdM March 22, 2026 at 11:52PM

Saturday, 21 March 2026

Show HN: An event loop for asyncio written in Rust https://bit.ly/4sBBVR2

Show HN: An event loop for asyncio written in Rust actually, nothing special about this implementation. just another event loop written in rust for educational purposes and joy in tests it shows seamless migration from uvloop for my scraping framework https://bit.ly/4lL0CIq with APIs (fastapi) it shows only one advantage: better p99, uvloop is faster about 10-20% in the synthetic run currently, i am forking on the win branch to give it windows support that uvloop lacks https://bit.ly/4v2jgQn March 21, 2026 at 11:12PM

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI https://bit.ly/3PlmMF2

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI I use points and miles for most of my travel. Every booking comes down to the same decision: use points or pay cash? To answer that, you need award availability across multiple programs, cash prices, your current balances, transfer partner ratios, and the math to compare them. I got tired of doing it manually across a dozen tabs. This toolkit teaches Claude Code and OpenCode how to do it. 7 skills (markdown files with API docs and curl examples) and 6 MCP servers (real-time tools the AI calls directly). It searches award flights across 25+ mileage programs (Seats.aero), compares cash prices (Google Flights, Skiplagged, Kiwi.com, Duffel), pulls your loyalty balances (AwardWallet), searches hotels (Trivago, LiteAPI, Airbnb, Booking.com), finds ferry routes across 33 countries, and looks up weird hidden gems near your destination (Atlas Obscura). Reference data is included: transfer partner ratios for Chase UR, Amex MR, Bilt, Capital One, and Citi TY. Point valuations sourced from TPG, Upgraded Points, OMAAT, and View From The Wing. Alliance membership, sweet spot redemptions, booking windows, hotel chain brand lookups. 5 of the 6 MCP servers need zero API keys. Clone, run setup.sh, start searching. Skills are, as usual, plain markdown. They work in OpenCode and Claude Code automatically (I added a tiny setup script), and they'll work in anything else that supports skills. PRs welcome! Help me expand the toolkit! :) https://bit.ly/47ObeAl https://bit.ly/47ObeAl March 21, 2026 at 10:25PM

Friday, 20 March 2026

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/4rJtaDi

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/47WxiJ2 March 21, 2026 at 02:25AM

Show HN: Rover – turn any web interface into an AI agent with one script tag https://bit.ly/4blbIAg

Show HN: Rover – turn any web interface into an AI agent with one script tag https://bit.ly/3NAOc9a March 21, 2026 at 01:58AM

Show HN: Vibefolio – a place to showcase your vibecoded projects https://bit.ly/47h4FGh

Show HN: Vibefolio – a place to showcase your vibecoded projects Over the last months, more people are shipping small apps, experiments, and side-projects at a much higher pace. I'm one of them and initially created a showcase page for myself to track them but this week decided to create something for others. Happy to read feedback on how to improve it further! https://bit.ly/47fd3pN March 20, 2026 at 09:53PM

Show HN: Cybertt – Cybersecurity Tabletop https://bit.ly/47x7hQH

Show HN: Cybertt – Cybersecurity Tabletop https://bit.ly/3PmIIzx March 20, 2026 at 10:29AM

Thursday, 19 March 2026

Show HN: Download entire/partial Substack to ePub for offline reading https://bit.ly/4uGIhQO

Show HN: Download entire/partial Substack to ePub for offline reading Hi HN, This is a small python app with optional webUI. It is intended to be run locally. It can be run with Docker (cookie autodetection will not work). It allows you to download a single substack, either entirely or partially, and saves the output to an epub file, which can be easily transferred to Kindle or other reading devices. This is admittedly a "vibe coded" app made with Claude Code and a few hours of iterating, but I've already found it very useful for myself. It supports both free and paywalled posts (if you are a paid subscriber to that creator). You can order the entries in the epub by popularity, newest first, or oldest first, and also limit to a specific number of entries, if you don't want all of them. You can either provide your substack.sid cookie manually, or you can have it be autodetected from most browsers/operating systems. https://bit.ly/4uwnXRY March 20, 2026 at 04:36AM

Show HN: Screenwriting Software https://bit.ly/3Phmteo

Show HN: Screenwriting Software I’ve spent the last year getting back into film and testing a bunch of screenwriting software. After a while I realized I wanted something different, so I started building it myself. The core text engine is written in Rust/wasm-bindgen. https://bit.ly/47cYh2P March 20, 2026 at 03:07AM

Wednesday, 18 March 2026

Show HN: Browser grand strategy game for hundreds of players on huge maps https://bit.ly/41cC0i3

Show HN: Browser grand strategy game for hundreds of players on huge maps Hi HN, I've been building a browser-based multiplayer strategy game called Borderhold. Matches run on large maps designed for hundreds of players. Players expand territory, attack neighbors, and adapt as borders shift across the map. You can put buildings down, build ships, and launch nukes. The main thing I wanted to explore was scale: most strategy games are small matches, modest maps, or modest player counts, but here maps are large and game works well with hundreds of players. Matches are relatively short so you can jump in and see a full game play out. Curious what people think. https://bit.ly/4uDPCAC Gameplay: https://youtu.be/nrJTZEP-Cw8 Discord: https://bit.ly/4uEbuvu https://bit.ly/4uDPCAC March 16, 2026 at 09:51AM

Show HN: Fitness MCP https://bit.ly/4sr8Jwo

Show HN: Fitness MCP There's no external MCP for your fitness (Garmin / Strava) data, so we built one. https://bit.ly/4uCviiR March 19, 2026 at 03:00AM

Show HN: ATO – a GUI to see and fix what your LLM agents configured https://bit.ly/476fStf

Show HN: ATO – a GUI to see and fix what your LLM agents configured https://bit.ly/476fSJL March 19, 2026 at 01:28AM

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training https://bit.ly/4bGv6H0

Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training I replicated David Ng's RYS method ( https://bit.ly/4ll5ILb ) on consumer AMD GPUs (RX 7900 XT + RX 6950 XT) and found something I didn't expect. Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer. The results on standard benchmarks (lm-evaluation-harness, n=50): Devstral-24B, layers 12-14 duplicated once: - BBH Logical Deduction: 0.22 → 0.76 - GSM8K (strict): 0.48 → 0.64 - MBPP (code gen): 0.72 → 0.78 - Nothing degraded Qwen2.5-Coder-32B, layers 7-9 duplicated once: - Reasoning probe: 76% → 94% The weird part: different duplication patterns create different cognitive "modes" from the same weights. Double-pass boosts math. Triple-pass boosts emotional reasoning. Interleaved doubling (13,13,14,14,15,15,16) creates a pure math specialist. Same model, same VRAM, different routing. The circuit boundaries are sharp — shift by one layer and the effect disappears or inverts. Smaller models (24B) have tighter circuits (3 layers) than larger ones (Ng found 7 layers in 72B). Tools to find circuits in any GGUF model and apply arbitrary layer routing are in the repo. The whole thing — sweep, discovery, validation — took one evening. Happy to answer questions. https://bit.ly/4rEg2PM March 18, 2026 at 10:31PM

Tuesday, 17 March 2026

Show HN: Sonder – self-hosted AI social simulation engine https://bit.ly/4rE8hcG

Show HN: Sonder – self-hosted AI social simulation engine https://bit.ly/4bhXvEi March 18, 2026 at 01:21AM

Show HN: CodeLedger – deterministic context and guardrails for AI https://bit.ly/4saYs7c

Show HN: CodeLedger – deterministic context and guardrails for AI We’ve been working on a tool called CodeLedger to solve a problem we kept seeing with AI coding agents (Claude Code, Cursor, Codex): They’re powerful, but on real codebases they: - read too much irrelevant code - edit outside the intended scope - get stuck in loops (fix → test → fail) - drift away from the task - introduce architectural issues that linters don’t catch The root issue isn’t the model — it’s: - poor context selection - lack of execution guardrails - no visibility at team/org level --- What CodeLedger does: It sits between the developer and the agent and: 1) Gives the agent the right files first 2) Keeps the agent inside the task scope 3) Validates output against architecture + constraints It works deterministically (no embeddings, no cloud, fully local). --- Example: Instead of an agent scanning 100–500 files, CodeLedger narrows it down to ~10–25 relevant files before the first edit :contentReference[oaicite:0]{index=0} --- What we’re seeing so far: - ~40% faster task completion - ~50% fewer iterations - significant reduction in token usage --- Works with: Claude Code, Cursor, Codex, Gemini CLI --- Repo + setup: https://bit.ly/4bxAhJd Quick start: npm install -g @codeledger/cli cd your-project codeledger init codeledger activate --task "Fix null handling in user service" --- Would love feedback from folks using AI coding tools on larger codebases. Especially curious: - where agents break down for you today - whether context selection or guardrails are the bigger issue - what other issues are you seeing. https://bit.ly/47F3l01 March 18, 2026 at 12:22AM