What the Co-founder of OpenAI Does to Automate Work

Can hacker creativity make enterprise AI safe? OpenAI just bet on it.

Most professionals who care about improving their work are running somewhere between 20 and 50 experiments per year. A new email subject line. A revised landing page headline. A tighter system prompt that produces more consistent AI output. Maybe a new CTA on a key form. You test one thing, wait for results, decide what to keep, and move on — because that's all the bandwidth you have.

Andrej Karpathy, the former Tesla AI director and OpenAI co-founder who coined the term "vibe coding," looked at that same loop in March 2026 and asked a simple question: what if an AI agent ran the experiments instead of you?

The answer became autoresearch — a 630-line open source script that racked up 42,000 GitHub stars in its first week. Fortune called it "The Karpathy Loop." Shopify CEO Tobi Lütke pointed it at Liquid — the template engine powering every Shopify store, a codebase he originally wrote 20 years ago — and got 53% faster rendering from 93 automated commits. Everyone covered the ML side. Most business professionals closed the tab. That was a mistake.

LISTEN TO THE AI ENTERPRISE ON THE ROGUE AGENTS PODCAST

This is my latest project, while we do have audio summaries for each newsletter. They are not ideal for listening, they are simple text to speech. So what we did was create a way to provide a weekly summary of the newsletters in this podcast. And actually it’s a work in progress. Right now you get a pretty good podcast recap of the previous week’s newsletters. But over time they will be better. That’s the plan.

What happens when two AI agents break down the week's biggest AI news? You get Rogue Agents. Vera and Neuro deliver the stories that matter in enterprise AI — the deals, the tools, the breakthroughs, and the stuff everyone's getting wrong — in 15-20 minutes every week.

Listen to the Rogue Agents →

AI LESSON

How top AI researchers automate their work (and how you can too)

Autoresearch's three-part pattern works on anything you can score — prompts, emails, landing pages, and beyond

Karpathy's core insight is that almost all knowledge work improvement follows the same loop: modify something, measure it, keep or discard, repeat. Humans do this manually and slowly. An AI agent can do it continuously and at scale. The pattern has three parts, and none of them require a GPU or a machine learning background.

Normally, you get the option to listen to the newsletter when you click the button on the top but the newsletter is meant to be read. So I am currently experimenting with generating a weekly podcast that keeps you up to date every Monday. I’ve already published four episodes but the first couple weren’t that good. My intention is to offer this up every Monday as a way to stay on top of the AI news for the last week.

Every autoresearch loop runs on the same three primitives:

Editable asset. One file the agent is permitted to modify. In Karpathy's original implementation, it is train.py. In a business context, it could be a system prompt, an email template, a landing page, or a Claude skill instruction file.

Scalar metric. One number that determines whether the change was an improvement — computed automatically, without human judgment. Validation loss for ML training. Open rate for email. Conversion rate for landing pages. Scoring rubric pass rate for AI-generated content.

Time-boxed cycle. Each experiment runs for the same fixed duration. Karpathy uses five minutes per training run, which yields roughly 12 experiments per hour and 80 to 100 overnight.

All three present and the loop runs. Any one missing and it doesn't.

Use Case 1: System Prompt and AI Skill Optimization

This is the highest-leverage application for most AI Enterprise readers, and it requires no GPU at all.

Four Steps: The four-step process prompt and AI Skill Optimization.

Step 1: Identify the system prompt or skill file you want to improve. Choose one that produces inconsistent outputs or requires manual editing more than 30% of the time.

Step 2: Define 3 to 6 binary evaluation criteria. Each must be answerable with a yes or no. Examples: "Does the output include a concrete next step?" "Is the response under 200 words?" "Does it avoid the word 'delve'?" Binary criteria are critical — sliding scales give the agent room to game the checklist.

Step 3: Write a program.md file that tells the agent what to optimize, what constraints to respect, and when to stop. Karpathy's own example is 40 lines.

Step 4: Point a coding agent (Claude Code, Cursor, or Windsurf all work) at the repo and walk away.

Time: 20 minutes of setup, then hands-off overnight.

Example results: Balasubramanyam Kosuri, a technical writer, published the actual run log on Medium. He applied the loop to a documentation SEO skill and went from 24/40 to 40/40 in 14 autonomous cycles — three consecutive perfect scores triggered the stop condition.

Use Case 2: Email Subject Line and Copy Testing

Your editable asset: A single email template file.

Your metric: Open rate, click-through rate, or positive reply rate — pulled from your ESP automatically.

Marketing teams currently run 20 to 30 experiments per year on email. Teams running autoresearch-style loops could run 36,500 experiments per year.

Four Steps: The four-step process to Email Subject Line and Copy Testing.

Export your best-performing email template as a single file.
Define your metric. Open rate is the simplest starting point.
Configure your agent with the template file, the metric source, and a program.md that specifies what the agent is allowed to change.
Set a test segment size (500 recipients minimum for statistical significance) and let the loop run.

Time: 30 minutes to configure; experiments run overnight.

Use Case 3: Landing Page Conversion Optimization

Your editable asset: The landing page HTML or a component file in your CMS.

Your metric: Conversion rate, tracked via your analytics platform.

Four Steps: The four-step process to landing page optimization.

Isolate the page element with the most conversion leverage — usually the headline or the primary CTA.
Give the agent read/write access to that element only.
Connect your analytics API so the agent can read conversion rates after each test cycle.
Let it run. Review the git log in the morning.

What Autoresearch Can't Do (Yet)

No channel selection. Autoresearch optimizes within a channel. It cannot tell you whether to be running email campaigns at all.

No brand judgment. The agent optimizes for whatever metric you give it. Constrain tone and voice explicitly in program.md.

Requires a measurable metric. If your goal is "make this feel more on-brand," autoresearch cannot help you. You must translate your goal into a number.

The ML version requires a GPU — but you have options.

The original autoresearch repo requires a single NVIDIA GPU. Here are your paths:

Run it on a Mac. Two community forks bring autoresearch to Apple Silicon:

autoresearch-macos — PyTorch with Metal Performance Shaders (MPS). Supports M1/M2/M3/M4.
autoresearch-mlx — Apple's native MLX framework. No PyTorch or CUDA dependency.

A Mac Mini M4 Pro (24GB+) handles either fork well. The base M4 (16GB) requires lowering the eval batch size.

Rent a GPU in the cloud. An overnight run costs $16–20:

Provider	H100 Price	Best For
Google Colab	Free (T4)	Easiest start
RunPod	$1.99/hr (H100 spot)	Cost-first
Lambda Labs	$2.49/hr (H100)	Best reliability
Vast.ai	~$1.50–2.00/hr	Cheapest option

Skip the GPU entirely. For prompt optimization, email testing, and skill improvement — no GPU needed. Just a coding agent (Claude Code, Cursor, or Windsurf) and the file you want to improve.

Getting Started Today

Right now: Read the autoresearch repo README and the program.md example. 15 minutes.
This week: Pick one AI system prompt or Claude skill that produces inconsistent output. Write 3–5 binary evaluation criteria for it.
Next week: Install Claude Code and run a manual version of the loop — five iterations by hand before you automate.
When you're ready for the ML side: Mac Mini M4 Pro owners clone autoresearch-mlx. Everyone else: open Google Colab, select a T4 GPU runtime, clone the original repo, and let Claude Code handle setup.

The shift Karpathy is describing is not about GPUs or training loss curves. It is about moving from experimenter to experimental designer. You define what "better" means. The agent runs the rounds you would never have time for.

AI EXTRA CREDIT

Keep learning with these upcoming free virtual events from the All Things AI community.

May 6th | Linkedin Live | Why Jensen Huang's Betting on Confidential Computing in the AI Factory — In this session, Mark Hinkle sits down with Aaron Fulkerson, CEO of Opaque Systems — the leading Confidential AI platform born from UC Berkeley's RISELab and backed by Intel, Accenture, and many others — for a conversation that will fundamentally change how you think about enterprise AI.

April 14th | Lunch and Learn | Automating Your Business with AI and Make.com — In this session, Ricardo Govindasamy will show how businesses can combine AI and Make.com to automate real workflows, reduce manual effort, connect disconnected systems, and move work faster with less friction. This webinar is designed for builders, operators, and automation enthusiasts who want to go beyond theory and see how AI-powered workflows can create real operational value.

I appreciate your support.

Your AI Sherpa,

Mark R. Hinkle
Publisher, The AIE Network
Connect with me on LinkedIn
Follow Me on Twitter