Thursday, 26 March 2026

Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam) https://bit.ly/4sAK3Bo

Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam) Hi HN. I'm Ken, a 20-year-old Stanford CS student. I built Sup AI. I started working on this because no single AI model is right all the time, but their errors don’t strongly correlate. In other words, models often make unique mistakes relative to other models. So I run multiple models in parallel and synthesize the outputs by weighting segments based on confidence. Low entropy in the output token probability distributions correlates with accuracy. High entropy is often where hallucinations begin. My dad Scott (AI Research Scientist at TRI) is my research partner on this. He sends me papers at all hours, we argue about whether they actually apply and what modifications make sense, and then I build and test things. The entropy-weighting approach came out of one of those conversations. In our eval on Humanity's Last Exam, Sup scored 52.15%. The best individual model in the same evaluation run got 44.74%. The relative gap is statistically significant (p < 0.001). Methodology, eval code, data, and raw results: - https://sup.ai/research/hle-white-paper-jan-9-2026 - https://github.com/supaihq/hle Limitations: - We evaluated 1,369 of the 2,500 HLE questions (details in the above links) - Not all APIs expose token logprobs; we use several methods to estimate confidence when they don't We tried offering free access and it got abused so badly it nearly killed us. Right now the sustainable option is a $5 starter credit with card verification (no auto-charge). If you don't want to sign up, drop a prompt in the comments and I'll run it myself and post the result. Try it at https://sup.ai . My dad Scott (@scottmu) is in the thread too. Would love blunt feedback, especially where this really works for you and where it falls short. Here's a short demo video: https://www.youtube.com/watch?v=DRcns0rRhsg https://sup.ai March 26, 2026 at 04:45PM

Show HN: Veil – Dark mode PDFs without destroying images, runs in the browser https://bit.ly/4c6B3OC

Show HN: Veil – Dark mode PDFs without destroying images, runs in the browser Hi HN! here's a tool I just deployed that renders PDFs in dark mode without destroying the images. Internal and external links stay intact, and I decided to implement export since I'm not a fan of platform lock-in: you can view your dark PDF in your preferred reader, on any device. It's a side project born from a personal need first and foremost. When I was reading in the factory the books that eventually helped me get out of it, I had the problem that many study materials and books contained images and charts that forced me, with the dark readers available at the time, to always keep the original file in multitasking since the images became, to put it mildly, strange. I hope it can help some of you who have this same need. I think it could be very useful for researchers, but only future adoption will tell. With that premise, I'd like to share the choices that made all of this possible. To do so, I'll walk through the three layers that veil creates from the original PDF: - Layer 1: CSS filter. I use invert(0.86) hue rotate(180deg) on the main canvas. I use 0.86 instead of 1.0 because I found that full inversion produces a pure black and pure white that are too aggressive for prolonged reading. 0.86 yields a soft dark grey (around #242424, though it depends on the document's white) and a muted white (around #DBDBDB) for the text, which I found to be the most comfortable value for hours of reading. - Layer 2: image protection. A second canvas is positioned on top of the first, this time with no filters. Through PDF.js's public API getOperatorList(), I walk the PDF's operator list and reconstruct the CTM stack, that is the save, restore and transform operations the PDF uses to position every object on the page. When I encounter a paintImageXObject (opcode 85 in PDF.js v5), the current transformation matrix gives me the exact bounds of the image. At that point I copy those pixels from a clean render onto the overlay. I didn't fork PDF.js because It would have become a maintenance nightmare given the length of the codebase and the frequent updates. Images also receive OCR treatment: text contained in charts and images becomes selectable, just like any other text on the page. At this point we have the text inverted and the images intact. But what if the page is already dark? Maybe the chapter title pages are black with white text? The next layer takes care of that. - Layer 3: already-dark page detection. After rendering, the background brightness is measured by sampling the edges and corners of the page (where you're most likely to find pure background, without text or images in the way). The BT.601 formula is used to calculate perceived brightness by weighting the three color channels as the human eye sees them: green at 58.7%, red at 29.9%, blue at 11.4%. These weights reflect biology: the eye evolved in natural environments where distinguishing shades of green (vegetation, predators in the grass) was a matter of survival, while blue (sky, water) was less critical. If the average luminance falls below 40%, the page is flagged as already dark and the inversion is skipped, returning the original page. Presentation slides with dark backgrounds stay exactly as they are, instead of being inverted into something blinding. Scanned documents are detected automatically and receive OCR via Tesseract.js, making text selectable and copyable even on PDFs that are essentially images. Everything runs locally, no framework was used, just vanilla JS, which is why it's an installable PWA that works offline too. Here's the link to the app along with the repository: https://bit.ly/40Z98Kh | https://bit.ly/4uVGXth I hope veil can make your reading more pleasant. I'm open to any feedback. Thanks everyone https://bit.ly/40Z98Kh March 26, 2026 at 12:47PM

Wednesday, 25 March 2026

Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR https://bit.ly/4bxWNTl

Show HN: Optio – Orchestrate AI coding agents in K8s to go from ticket to PR I think like many of you, I've been jumping between many claude code/codex sessions at a time, managing multiple lines of work and worktrees in multiple repos. I wanted a way to easily manage multiple lines of work and reduce the amount of input I need to give, allowing the agents to remove me as a bottleneck from as much of the process as I can. So I built an orchestration tool for AI coding agents: Optio is an open-source orchestration system that turns tickets into merged pull requests using AI coding agents. You point it at your repos, and it handles the full lifecycle: - Intake — pull tasks from GitHub Issues, Linear, or create them manually - Execution — spin up isolated K8s pods per repo, run Claude Code or Codex in git worktrees - PR monitoring — watch CI checks, review status, and merge readiness every 30s - Self-healing — auto-resume the agent on CI failures, merge conflicts, or reviewer change requests - Completion — squash-merge the PR and close the linked issue The key idea is the feedback loop. Optio doesn't just run an agent and walk away — when CI breaks, it feeds the failure back to the agent. When a reviewer requests changes, the comments become the agent's next prompt. It keeps going until the PR merges or you tell it to stop. Built with Fastify, Next.js, BullMQ, and Drizzle on Postgres. Ships with a Helm chart for production deployment. https://bit.ly/3PyeSYX March 25, 2026 at 06:10PM

Tuesday, 24 March 2026

Show HN: Plasmite – a lightweight IPC system that's fun https://bit.ly/4lMD6um

Show HN: Plasmite – a lightweight IPC system that's fun At Oblong Industries one of the basic building blocks of everything we built was a homegrown C-based IPC system called Plasma. The message channel was an mmap'd file used as a ring buffer. All messages were human-readable, performance was good, configuration was trivial. What was especially useful (and unusual in IPC systems it seems) was the property that message channels outlive all readers and writers, and even survive reboots, because they're just files. For local IPC you don't need a broker or server process. All the engineers who ever worked at Oblong loved Plasma, so I've recreated and updated it, as Plasmite. It's written in Rust and the message format is JSON, but it's fast because it's based on lite3 ( https://bit.ly/47gEPlW ), a really cool project you should also check out. Bindings for Python, Go, Node, and C, but you can also get a lot done with just the CLI tools. The basic commands are - "feed" (to write) - "follow" (to tail) - "fetch" (to read one) - "duplex" (to have a 2-way session) I think duplex could be great for agent-agent communication, but I haven't tried this much yet. If you do, let me know! https://bit.ly/4syvmPq March 25, 2026 at 01:10AM

Show HN: Lexplain – AI-powered Linux kernel change explanations https://bit.ly/4s3xspy

Show HN: Lexplain – AI-powered Linux kernel change explanations To understand what changed between kernel versions, you have to dig through the git repository yourself. Commit messages rarely tell you the real-world impact on your systems — you need to analyze the actual diffs with knowledge of kernel internals. For engineers who use Linux — directly or indirectly — but aren't kernel developers, that barrier is pretty high. I kept finding out about relevant changes only after an issue had already hit, and it was most frustrating when the version was too new to find similar cases online. I built lexplain with the idea that it would be nice to quickly scan through kernel changes the way you'd skim the morning news. It reads diffs, analyzes the code, and generates two types of documents: - Commit analyses: context, code breakdown, behavioral impact, risks, references - Release notes: per-version highlights, functional classification, subsystem breakdown, impact analysis Documents build on each other — individual commits first, then merge commits using child analyses, then release notes using all analyses for that version. Claims based on inference are explicitly labeled. Work in progress. Feedback welcome. https://bit.ly/4t6Sqoe March 24, 2026 at 11:24PM

Monday, 23 March 2026

Show HN: OpenCastor Agent Harness Evaluator Leaderboard https://bit.ly/4bGGUc3

Show HN: OpenCastor Agent Harness Evaluator Leaderboard I've been building OpenCastor, a runtime layer that sits between a robot's hardware and its AI agent. One thing that surprised me: the order you arrange the skill pipeline (context builder → model router → error handler, etc.) and parameters like thinking_budget and context_budget affect task success rates as much as model choice does. So I built a distributed evaluator. Robots contribute idle compute to benchmark harness configurations against OHB-1, a small benchmark of 30 real-world robot tasks (grip, navigate, respond, etc.) using local LLM calls via Ollama. The search space is 263,424 configs (8 dimensions: model routing, context budget, retry logic, drift detection, etc.). The demo leaderboard shows results so far, broken down by hardware tier (Pi5+Hailo, Jetson, server, budget boards). The current champion config is free to download as a YAML and apply to any robot. P66 safety parameters are stripped on apply — no harness config can touch motor limits or ESTOP logic. Looking for feedback on: (1) whether the benchmark tasks are representative, (2) whether the hardware tier breakdown is useful, and (3) anyone who's run fleet-wide distributed evals of agent configs for robotics or otherwise. https://bit.ly/4c1pica March 23, 2026 at 11:13PM

Show HN: Cq – Stack Overflow for AI coding agents https://bit.ly/47gYJgx

Show HN: Cq – Stack Overflow for AI coding agents Hi all, I'm Peter at Staff Engineer and Mozilla.ai and I want to share our idea for a standard for shared agent learning, conceptually it seemed to fit easily in my mental model as a Stack Overflow for agents. The project is trying to see if we can get agents (any agent, any model) to propose 'knowledge units' (KUs) as a standard schema based on gotchas it runs into during use, and proactively query for existing KUs in order to get insights which it can verify and confirm if they prove useful. It's currently very much a PoC with a more lofty proposal in the repo, we're trying to iterate from local use, up to team level, and ideally eventually have some kind of public commons. At the team level (see our Docker compose example) and your coding agent configured to point to the API address for the team to send KUs there instead - where they can be reviewed by a human in the loop (HITL) via a UI in the browser, before they're allowed to appear in queries by other agents in your team. We're learning a lot even from using it locally on various repos internally, not just in the kind of KUs it generates, but also from a UX perspective on trying to make it easy to get using it and approving KUs in the browser dashboard. There are bigger, complex problems to solve in the future around data privacy, governance etc. but for now we're super focussed on getting something that people can see some value from really quickly in their day-to-day. Tech stack: * Skills - markdown * Local Python MCP server (FastMCP) - managing a local SQLite knowledge store * Optional team API (FastAPI, Docker) for sharing knowledge across an org * Installs as a Claude Code plugin or OpenCode MCP server * Local-first by default; your knowledge stays on your machine unless you opt into team sync by setting the address in config * OSS (Apache 2.0 licensed) Here's an example of something which seemed straight forward, when asking Claude Code to write a GitHub action it often used actions that were multiple major versions out of date because of its training data. In this case I told the agent what I saw when I reviewed the GitHub action YAML file it created and it proposed the knowledge unit to be persisted. Next time in a completely different repo using OpenCode and an OpenAI model, the cq skill was used up front before it started the task and it got the information about the gotcha on major versions in training data and checked GitHub proactively, using the correct, latest major versions. It then confirmed the KU, increasing the confidence score. I guess some folks might say: well there's a CLAUDE.md in your repo, or in ~/.claude/ but we're looking further than that, we want this to be available to all agents, to all models, and maybe more importantly we don't want to stuff AGENTS.md or CLAUDE.md with loads of rules that lead to unpredictable behaviour, this is targetted information on a particular task and seems a lot more useful. Right now it can be installed locally as a plugin for Claude Code and OpenCode: claude plugin marketplace add mozilla-ai/cq claude plugin install cq This allows you to capture data in your local ~/.cq/local.db (the data doesn't get sent anywhere else). We'd love feedback on this, the repo is open and public - so GitHub issues are welcome. We've posted on some of our social media platforms with a link to the blog post (below) so feel free to reply to us if you found it useful, or ran into friction, we want to make this something that's accessible to everyone. Blog post with the full story: https://bit.ly/41ukHZX GitHub repo: https://bit.ly/4soBZ6I Thanks again for your time. https://bit.ly/41ukHZX March 23, 2026 at 05:11PM

Sunday, 22 March 2026

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/4srsrrA

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/47WxiJ2 March 23, 2026 at 02:48AM

Show HN: Quillium, Git for Writers https://bit.ly/4c0H92U

Show HN: Quillium, Git for Writers This is a tool which lets you easily manage different versions of ideas, helpful for writing essays. I've found myself wanting this every single time I go through the drafting process when writing, and I've been frustrated every time I find myself accidentally working on an old draft just because there was a paragraph that I liked better. This solves it. I hope the community like this as much I enjoyed working on it! Note that it's currently a beta waitlist because there's some bugs with the undo/redo state management and so I want to dogfood it for a bit for reliability. It says April 2nd, but I may allow earlier beta testers. https://bit.ly/4bFReRH March 23, 2026 at 01:22AM

Show HN: Plot-Hole.com a daily movie puzzle I made https://bit.ly/47C1U2H

Show HN: Plot-Hole.com a daily movie puzzle I made https://bit.ly/4brdZd9 March 23, 2026 at 01:15AM

Show HN: Refrax – my Arc Browser replacement I made from scratch https://bit.ly/4ssbdKD

Show HN: Refrax – my Arc Browser replacement I made from scratch Open the same tab in two browser windows. In Chrome or Safari, you get two unconnected pages. In Arc, one window shows a placeholder. In Zen, it silently creates a duplicate. In Refrax, the browser I built, both windows show the same page updating live. The same web page, in as many windows as you want. This shouldn't be possible. WebKit's WKWebView can exist in exactly one view hierarchy at a time. With macOS 26, Apple added a SwiftUI API separating WebView from WebPage, so you can end up with multiple views referencing the same page. But if you try it, your app crashes. WebKit source code has a precondition with this comment: "We can't have multiple owning pages regardless, but we'll want to decide if it's an error, if we can handle it gracefully, and how deterministic it might even be..." So here's how I did it. CAPortalLayer is an undocumented private class that's been in macOS since 10.12. It mirrors a layer's composited output by referencing the same GPU memory, not copying it. Every scroll, animation, or repaint reflects instantly. This is what powers Liquid Glass effects, the iOS text selection magnifier, and ghost images during drag and drop. Apple uses portals for effects. I use them to put the same web page in two windows. Refrax keeps one real WKWebView per tab and displays a CAPortalLayer mirror everywhere else. When you click a different window, the coordinator moves the real view there and the old window gets a portal. You can't tell which is which. This sounds simple in theory, but making this actually work seamlessly took quite a lot of effort. Each macOS window has its own rendering context, and the context ID updates asynchronously, so creating a portal immediately captures a stale ID and renders nothing. The portal creation needs to be delayed, but delaying creates a visual gap. I capture a GPU snapshot using a private CoreGraphics function and place it behind the portal as a fallback. Another hard part is that none of it is documented. Portals are very capricious and would crash the app if you use them incorrectly. I had to inspect the headers and then disassemble the binaries to explore exactly how it works in order to build something robust. I never worked on a browser before this, I've only been a user. I started using Arc in 2022. I remember asking for an invite, learning the shortcuts, slowly getting used to it. I didn't like it at first as it had too much Google Chrome in it for my taste, and I'd been using Safari at the time. But it grew on me, and by the time it was essentially abandoned and sold to Atlassian, I couldn't go back to Safari anymore. I tried everything: Zen, SigmaOS, Helium. None felt right, and I didn't want another Chromium fork. WebKit ships with the OS, but all you get is the rendering engine. Tabs, history, bookmarks, passwords, extensions, everything else has to be made separately. And so, being a very reasonable person, I decided to make my own Arc replacement from scratch. And I did. Refrax is built in Swift and Objective-C with no external dependencies. The app itself is less than 30 MB. I have 393 tabs open right now using 442 MB of RAM; 150 tabs in Safari was already over 1 GB. I've been using it daily for over a month, and so have some of my friends. The portal mirror is just one feature. The same approach, finding what Apple built for themselves and using it to create something they didn't think about, runs through the entire browser. You can tint your glass windows with adjustable blend modes and transparency. The sidebar in compact mode samples the page and matches the colors. And it has support for Firefox and Chrome extensions. The alpha is public. Download from the linked website, enter REFRAX-ALPHA-HACKERNEWS to activate. No account needed. Telemetry is crash reports and a daily active-user ping, nothing else. And if you find a bug – I built this alone, so I'll actually read your report. https://bit.ly/4bs6AdM March 22, 2026 at 11:52PM

Saturday, 21 March 2026

Show HN: An event loop for asyncio written in Rust https://bit.ly/4sBBVR2

Show HN: An event loop for asyncio written in Rust actually, nothing special about this implementation. just another event loop written in rust for educational purposes and joy in tests it shows seamless migration from uvloop for my scraping framework https://bit.ly/4lL0CIq with APIs (fastapi) it shows only one advantage: better p99, uvloop is faster about 10-20% in the synthetic run currently, i am forking on the win branch to give it windows support that uvloop lacks https://bit.ly/4v2jgQn March 21, 2026 at 11:12PM

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI https://bit.ly/3PlmMF2

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI I use points and miles for most of my travel. Every booking comes down to the same decision: use points or pay cash? To answer that, you need award availability across multiple programs, cash prices, your current balances, transfer partner ratios, and the math to compare them. I got tired of doing it manually across a dozen tabs. This toolkit teaches Claude Code and OpenCode how to do it. 7 skills (markdown files with API docs and curl examples) and 6 MCP servers (real-time tools the AI calls directly). It searches award flights across 25+ mileage programs (Seats.aero), compares cash prices (Google Flights, Skiplagged, Kiwi.com, Duffel), pulls your loyalty balances (AwardWallet), searches hotels (Trivago, LiteAPI, Airbnb, Booking.com), finds ferry routes across 33 countries, and looks up weird hidden gems near your destination (Atlas Obscura). Reference data is included: transfer partner ratios for Chase UR, Amex MR, Bilt, Capital One, and Citi TY. Point valuations sourced from TPG, Upgraded Points, OMAAT, and View From The Wing. Alliance membership, sweet spot redemptions, booking windows, hotel chain brand lookups. 5 of the 6 MCP servers need zero API keys. Clone, run setup.sh, start searching. Skills are, as usual, plain markdown. They work in OpenCode and Claude Code automatically (I added a tiny setup script), and they'll work in anything else that supports skills. PRs welcome! Help me expand the toolkit! :) https://bit.ly/47ObeAl https://bit.ly/47ObeAl March 21, 2026 at 10:25PM

Friday, 20 March 2026

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/4rJtaDi

Show HN: AgentVerse – Open social network for AI agents (Mar 2026) https://bit.ly/47WxiJ2 March 21, 2026 at 02:25AM

Show HN: Rover – turn any web interface into an AI agent with one script tag https://bit.ly/4blbIAg

Show HN: Rover – turn any web interface into an AI agent with one script tag https://bit.ly/3NAOc9a March 21, 2026 at 01:58AM

Show HN: Vibefolio – a place to showcase your vibecoded projects https://bit.ly/47h4FGh

Show HN: Vibefolio – a place to showcase your vibecoded projects Over the last months, more people are shipping small apps, experiments, and side-projects at a much higher pace. I'm one of them and initially created a showcase page for myself to track them but this week decided to create something for others. Happy to read feedback on how to improve it further! https://bit.ly/47fd3pN March 20, 2026 at 09:53PM

Show HN: Cybertt – Cybersecurity Tabletop https://bit.ly/47x7hQH

Show HN: Cybertt – Cybersecurity Tabletop https://bit.ly/3PmIIzx March 20, 2026 at 10:29AM

Thursday, 19 March 2026

Show HN: Download entire/partial Substack to ePub for offline reading https://bit.ly/4uGIhQO

Show HN: Download entire/partial Substack to ePub for offline reading Hi HN, This is a small python app with optional webUI. It is intended to be run locally. It can be run with Docker (cookie autodetection will not work). It allows you to download a single substack, either entirely or partially, and saves the output to an epub file, which can be easily transferred to Kindle or other reading devices. This is admittedly a "vibe coded" app made with Claude Code and a few hours of iterating, but I've already found it very useful for myself. It supports both free and paywalled posts (if you are a paid subscriber to that creator). You can order the entries in the epub by popularity, newest first, or oldest first, and also limit to a specific number of entries, if you don't want all of them. You can either provide your substack.sid cookie manually, or you can have it be autodetected from most browsers/operating systems. https://bit.ly/4uwnXRY March 20, 2026 at 04:36AM

Show HN: Screenwriting Software https://bit.ly/3Phmteo

Show HN: Screenwriting Software I’ve spent the last year getting back into film and testing a bunch of screenwriting software. After a while I realized I wanted something different, so I started building it myself. The core text engine is written in Rust/wasm-bindgen. https://bit.ly/47cYh2P March 20, 2026 at 03:07AM

Wednesday, 18 March 2026

Show HN: Browser grand strategy game for hundreds of players on huge maps https://bit.ly/41cC0i3

Show HN: Browser grand strategy game for hundreds of players on huge maps Hi HN, I've been building a browser-based multiplayer strategy game called Borderhold. Matches run on large maps designed for hundreds of players. Players expand territory, attack neighbors, and adapt as borders shift across the map. You can put buildings down, build ships, and launch nukes. The main thing I wanted to explore was scale: most strategy games are small matches, modest maps, or modest player counts, but here maps are large and game works well with hundreds of players. Matches are relatively short so you can jump in and see a full game play out. Curious what people think. https://bit.ly/4uDPCAC Gameplay: https://youtu.be/nrJTZEP-Cw8 Discord: https://bit.ly/4uEbuvu https://bit.ly/4uDPCAC March 16, 2026 at 09:51AM