AI Engineering Code Summit

Jump to Conference Photo Gallery

Themes#

(AI Generated from the rest of the notes below)

Engineering Perspective Themes#

  • Context Window Management & Progressive Discovery

    • Skills over monolithic agents (Anthropic’s approach)
    • Keeping models in the “smart zone” vs “dumb zone” of context windows
    • Progressive discovery through skills.md files rather than exhaustive tool descriptions
    • Context compaction and subagent delegation strategies
  • RL Infrastructure & Training Optimization

    • Compute-efficient reinforcement learning as fundamental requirement
    • Async RL tools to maximize GPU utilization
    • Custom kernels for inference speedups (Cursor achieving several-fold improvements)
    • Environment-training alignment as critical success factor
  • Evaluation & Benchmarking

    • Need for non-cheatable, real-world benchmarks (cline-bench)
    • RL Environment Factory concept for automated environment creation
    • Importance of tight feedback cycles (code/coding ideal for RL)
    • Monitoring RL reward hacking
    • Gap between self-reported and actual efficacy increases
  • Agent Architecture Patterns

    • Harness-model co-optimization
    • Bash as foundational layer (“Bash is all you need”)
    • Tools > Bash > CodeGen / Skills hierarchy
    • Multi-agent systems with specialized roles (spec/implementation/tests triangle)
  • Code Quality & Production Readiness

    • Deterministic validation everywhere
    • Clear, actionable error messages
    • Framework standardization across organizations
    • CLI/API-first design for agent compatibility
    • Code simplicity enabling better reasoning
    • Comprehensive documentation requirements
  • Platform & Infrastructure

    • Model serving via gateways
    • MCP directories and hubs
    • PAAS for tool creation/deployment
    • Standardized development environments
    • Cross-hardware kernel optimization

Organizational Perspective Themes#

  • Change Management & Adoption

    • “All-in” vs 90% adoption creates 10x difference
    • Shift from written culture to demo culture
    • Structured adoption strategies:
      • Centralized communication
      • Peer slack channels for best practice sharing
      • Role-based training
      • Specific use cases
      • Rewards programs
      • Coaching and library resources
  • Organizational Workflow Transformation

    • Moving beyond Agile to AI-native workflows
    • 2-person teams building production products
    • Organizations less productive than individuals (needs addressing)
  • Developer Experience Principles

    • “Freedom and Responsibility” with guardrails
    • “Golden Paths” with Platform Enablement
    • Democratized infrastructure via Inner Source
    • Service abstraction and strong contracts
    • Code review velocity optimization (clear ownership, not Slack groups)
  • ROI & Metrics

    • Tool access ≠ tool usage
    • Engineering outcomes easiest to measure
    • Per-line level data collection (Cursor model)
    • Clean coding environments correlate with better outcomes (tests/types/docs/modularity)
    • Vicious vs virtuous cycles in development
  • Organizational Structure Evolution

    • 15-person companies with 6 apps achieving good ARR
    • Single engineers building/maintaining complex production products
    • Support teams authoring code patches
    • Compounding engineering through prompt sharing

LLM Training#

  • Domain-Specific Training

    • Importance of tight RL feedback cycles
    • Need for domain-specific benchmarks
    • Training environment must match deployment environment exactly
    • Reward function design and monitoring for scientific tasks
  • Data Quality & Collection

    • Real-world task data collection critical
    • Clean, structured data environments
    • Comprehensive documentation as training signal
    • Tests and types as quality indicators
  • Scientific Workflow Optimization

    • Multi-modal model capabilities (Minimax M2)
    • Interleaved thinking improving performance
    • Code execution in sandboxes gaining traction
    • Browser-based automation for testing
  • Evaluation Frameworks

    • LLM-as-judge for prompt improvement
    • Continuous evaluation loops
    • KPI gathering automation
    • Safety training automation potential

Forward Deployed Engineer Perspective#

  • Rapid Prototyping & Iteration

    • Demo-first culture
    • Quick wins establishing credibility
    • Small bets, big impact approach
    • Progressive enhancement of capabilities
  • Integration & Interoperability

    • MCP server implementation
    • Tool composition strategies
    • Internal data accessibility via MCPs
    • Cross-platform standardization
  • Production Deployment Challenges

    • Demos easy; production requires QC
    • Security and governance considerations
    • Distribution and versioning of skills/tools
    • Failure mode handling
  • User Experience Focus

    • Invisible co-pilots in existing tools (Slack/CRM/BI)
    • Tight integration with existing workflows
    • Google AI Studio-style accessibility
    • Low-friction onboarding
  • Specialized Tool Development

    • Context engines for code understanding
    • Automated ticket creation and patch authoring
    • Report retrieval automation
    • Code review automation with learning benefits
  • Infrastructure Requirements

    • LLM-ready data fabrics
    • Secure model serving
    • Standardized RL environments
    • Reproducibility platforms (Prime Intellect approach)

Cross-Cutting Themes#

  • The War on Slop

    • Quality control as primary concern
    • Need for validation at every layer
    • Code integrity vs code generation balance
    • Process-level vs code-level problems
  • Speed as Feature

    • Fast inference as competitive advantage
    • Rapid iteration cycles
    • Quick feedback loops
    • Async processing patterns
  • Specialization vs Generalization

    • Skills for domain expertise
    • Specialized agents for task classes
    • Progressive discovery vs exhaustive description
    • Context-aware tool selection

20251122 Engineering#

Composable MCP Architectures: Handling What the Protocol Can’t#

Prompt to Production with Google AI Studio

Claude Code SDK

  • Speaker: Thariq Shihipar / Anthropic
  • Location: Datadog 46th floor Cafe

This was a semi-interactive walkthrough of what skills are and what they can do.

“Bash is all you need” → if possible via bash, the agent can construct the perfect query for the task using a sharp cli tools with very little context use. Part of the broader message about context window management.

Tools > Bash > CodeGen / Skills

In general I didn’t think there was too much depth to this workshop - or rather if this is the cutting edge then we are all there. The use of skills/filesystem is great for the idea of progressive context management but the story for distribution and sharing of skills was not discussed, nor was failure modes - and of course the big issue is codegen - e.g. if you want to run these things locally you have all the sharp-edge problems of giving a user bash access.

Building Durable, Production-Ready Agents with OpenAI SDK and Temporal

  • Speaker: Cornelia Davis / Temporal
  • Location: Datadog 47th floor Puss in Boots/Cafe

Build a Real-Time AI Sales Agent#

  • Speaker: Sarah Chieng / Cerebras
  • Location: Datadog 46th floor Cafe

Building Intelligent Research Agents with Manus

  • Speaker: Ivan Leo / Manus
  • Location: AWS JFK27 B1.296

I missed this talk as there were concurrent tracks but Ivan is an impressive guy and Manus has had absolutely massive growth. I will definitely try it - they are really going hard at the app for everything - from simple agentic office tasks to building and deploying full stack apps a la Replit.

Build a Prompt Learning Loop#

  • Speaker: SallyAnn DeLucia / Arize
  • Location: AWS JFK27 B1.300

This workshop showed the use of evals to engineer improved prompts. The CEO had spoken earlier and I was curious. They showed that you can use evaluation frameworks to test and improve your prompts to be more effective and walked us through the use of a toy problem where the LLM-as-a-judge would provide suggested prompts. This seems like such a powerful low-hanging fruit for the types of prompts that an org might face when making bespoke agents. Although it does take the collection of data, the establishment of evals, and the continuous update and measure of success. But you could imagine this being applicable to all aspects of running a business - e.g. KPI gathering, emails/Slack/Notion/docs; engineering configuration; safety trainings…

Building durable Agents with Workflow DevKit & AI SDK#

  • Speaker: Peter Wielander / Vercel
  • Location: AWS JFK27 B1.100

gemini fix –quality: A Story of CLI Bugs and Patches#

  • Speaker: Anjali Sridhar / Google
  • Location: Datadog 47th floor Cafe

Sharpening your AI toolbox for Spec-Driven Dev#

  • Speaker: Al Harris / Amazon Kiro
  • Location: Datadog 47th floor Puss in Boots

Started on this one but it was a bit slower than I wanted. Basically AWS has:

  • proprietary coding agent fork on top of VSCode
  • an opinionated flow on how to write software via a set of dev flow that encourages planning/thinking prior to execution
  • signin experience via Google/OAuth was a strangely nice deviation from the IAM experience

Automating Large-Scale Refactors with Parallel Agents#

  • Speaker: Robert Brennan / AllHands
  • Location: AWS JFK27 B1.300

Caught a few minutes of this. The AllHands can do a topology of the code, find small refactorable bits and create reasonably sized PRs for review. E.g. it will break the problem into bite-sized, addressable pieces.

Running Multi-Agent Systems with AgentOS

  • Speaker: Ashpreet Bedi / Agno AI
  • Location: AWS JFK27 B1.296

Okta Identity for AI Agents

  • Speakers: Patrick Riley / Auth0, Carlos Galan / Auth0
  • Location: Datadog 46th floor Cafe

Memory in LLMs: Weights and Activations#

  • Speaker: Jack Morris / Cornell
  • Location: AWS JFK27 B1.100

Context Engineering for Intelligent AI Code Reviews#

  • Speaker: Erik Thorelli / Coderabbit
  • Location: AWS JFK27 B1.300

DSPy is (really) All You Need

  • Speaker: Kevin Madura / AlixPartners
  • Location: Datadog 47th floor Puss in Boots/Cafe

Your MCP Server is Bad and You Should Feel Bad

  • Speaker: Jeremiah Lowin / Prefect
  • Location: AWS JFK27 B1.296

Long Tasks and Experienced Open Source Dev Productivity#

  • Speaker: Joel Becker / METR
  • Location: AWS JFK27 B1.100

How Claude Code Works

  • Speaker: Jared Zoneraich / PromptLayer
  • Location: Datadog 46th floor Cafe

20251121 Engineering#

The War on Slop#

SWYX introduces the theme of the engineering talks.

Don’t Build Agents, Build Skills Instead#

Barry Zhang / Anthropic, Mahesh Murag / Anthropic

Lays out the vision of Skills:

  • codegen is where Claude excels
  • skills provide specialist code tools that provide domain expertise
  • the Anthropic solution to context window saturation via progressive discovery
  • skills.md is small and lists capabilities rather than tools; scripts can be discovered later
  • the CLI is really good at file systems and unix tools / grep etc. Skill builds on that
  • Had some insight into where it’s going - e.g. versioning but didn’t say much about ideas for distribution

No Vibes Allowed: Solving Hard Problems in Complex Codebases#

Dex Horthy / HumanLayer

This guy clicked through like 500 slides. Main messages:

  • there is a “smart” zone and a “dumb zone” within the 200k context
  • you want to stay in the “smart” zone so you need to be careful about saturating the context window with prompts/tool descriptions etc.
  • you ALSO need to think about compaction of the context window and delegating to subagents
  • he is working on a solution

Building a fast frontier model with RL#

Lee Robinson / Cursor

Holy crap do not sleep on Cursor.

They have insane-level detail of edits and a really good RL infra:

  • they want their RL environments to be EXACTLY the same as the editing environment of their users so they can tune their tools to be responsive/performant/excellent on the searches of specific users
  • they have custom kernels to speed training to make inference faster. They are getting several fold speedups on inference
  • Cursor Blog
  • Shoutout to Ray RL Controller for improving GPU Access during training/inference
  • Composer becomes the power user for semantic search
  • Def look at Cursor ENG blog - these guys are really good.

From Code Snippets to Codebases: Then, Now, and What’s Next for Coding Evals#

Naman Jain / Cursor

Lots of tips on improving models. Need good harness, consistent environment.

Jacob Kahn / Meta

Code World Model: Building World Models for Computation

Applied Compute Efficient Reinforcement Learning#

Rhythm Garg / Applied Compute, Linden Li

Lots of tips/tricks on getting RL runs to be fast as a fundamental requirement for fast RL runs. Most notably they described async RL tools to maximize GPU usage.

RL Environments at Scale#

Will Brown / Prime Intellect

Very interesting. The creation of a platform to establish Environments and run RL within them. Aiming to expand the number of people who are doing AI research by standardizing the environments and allowing runs and reproducibility.

Agent Reinforcement Fine Tuning#

Will Hang / OpenAI, Cathy Zhou / OpenAI

Agent RL to train agentic behavior to get really good at a task:

  • The environment in which you train is REALLY important and any drift between training env and actual env can lead to issues
  • A number of successes in training an Agent to get better at Code Review or Coding
  • Code and coding are the tasks that seem most likely to benefit from RL given their tight feedback cycles
  • RL rewards are hackable and need to be monitored very closely

From Vibe Coding To Vibe Engineering#

Kitze / Sizzy

This guy is hysterical. Go look up his videos.

Google Building in the Gemini Era with Google DeepMind#

Kat Kampf / Google, Ammaar Reshi

Great demo on the power of Google AI Studio:

  • they walked through a comic book app that created a comic book with a few pictures and some prompt
  • they also demoed a game
  • looked like really fun and light way to use AI to make things

Making Codebases “Agent-Ready”#

Eno Reyes / Factory AI

Discussion of key aspects to making code bases ready for consumption:

  • structure / tests / documents

Amp Code: Next-Generation AI Coding#

Beyang Liu / Amp Code / Sourcegraph

  • Amp is developing an opinionated set of coding tools that are honed for classes of work
  • Take care to hone the skill/experience/execution of each of those specialized toolsets
  • Note: smart model in Amp is now Gemini 3 Pro
  • Some number of coding tools / co-coding tools

Using AI-Generated Kernels to Instantly Speed Up PyTorch#

Natalie Serrino / Gimlet Labs

Trying to solve cross-hardware kernel optimization for PyTorch:

  • Setup the eval env, patch and test/compare
  • lots of potential for cheating the benchmarks by the agent, a few nice wins with fused kernels

The Infinite Software Crisis#

Jake Nations / Netflix

You need to understand the code:

  • Use skill/taste to “decomplect” the software to avoid unmanageable tangled messes
  • Gave an example of deliberative use of AI to help understand big, difficult codebase and begin to plan refactor

AGI: The Path Forward#

Eiso Kant / Poolside

  • proprietary and fast model - they did an ADA to Rust conversion and, assuming it wasn’t spoofed in some way it was really fast
  • serving the military at the moment
  • private beta may open soon
  • weights on AWS Bedrock likely to open up as well

Continual System-Prompt Learning for Code Agents#

Aparna Dhinakaran / Arize

Very nice idea. Use evals on the prompt itself to hone the outputs of an agentic process. Potentially easy win across multiple categories, even those that don’t need the full RL treatment.

Hard-Won Lessons from Building Effective AI Coding Agents#

Nik Pash / Cline

Feisty talk. Cut to the point and showed how to make effective coding agents:

  • Describes the simple recipe for success: benchmarks that are not cheatable
  • Introduces cline-bench as a way to get REAL WORLD data on problems
  • RL Environment Factory: Automate the creation of RL envs
  • We need to collect more high quality engineering tasks
  • Agent data on real world data is where the agents get better… and all the providers are scooping up this data…

Benchmarks vs economics: the AI capability measurement gap#

Joel Becker / METR

Big gap between self-reported efficacy increases and actual increases.

Defying Gravity#

Kevin Hou / Google DeepMind

Product intro. Tight integration of Gemini and Nano Banana.


20251120 Leadership#

Evolving Claude APIs for Agents#

Katelyn Lesse / Anthropic

Overview of skills. Main idea is context window engineering to maintain focus of the core agent.

Autonomy Is All You Need#

Michele Catasta / Replit

Replit is interested in creating coding environments with complete end-to-end app creation for non-coders. They really want to get to a low error rate but still see in their apps lots of “painted doors” where there are elements that show up in the UI but do not do anything.

They are investing in long running agents that scan codebases and autocorrect all errors. The builder agent is being trained to build apps that are amenable to testing.

Significant effort is going into browser automated parallel testing.

Your Support Team Should Ship Code#

Lisa Orr / Zapier

Building tools to allow support team to auto create tickets, identify issues, and author patches to the upstream code that is leading to the issue.

Vibe Coding#

Steve Yegge / Sourcegraph/Amp, Gene Kim / IT Revolution

Great talk. Steve is a forceful advocate of adapting to change and allowing our tools to accelerate us - more agents, more speed. He is working to shape the AMP experience and is also the creator of beads which we should checkout/use.

Gene talked about the organizational and industry wide changes we are in the midst of. He talked about how he lived through DevOps and agile and that this is just the tip of the iceberg. Lots of vignettes from individuals and teams who are using Vibe Coding/Vibe Engineering.

Future-Proof Coding Agents: Building Reliable Systems That Outlast Model#

Bill Chen / OpenAI, Brian Fioca / OpenAI

Talked about the combination of utilities you provide to a model to give it action as a “Harness”

Talked about how important it is to match a model with the Harness it was trained on

Codex as an agent to delegate to other agents.

Subagent patterns:

  • CODEX SDK
  • SDK + MCP
  • ACP

Moving away from Agile: What’s Next?#

Martin Harrysson / McKinsey, Natasha Maniar / McKinsey

Brainstorm the post-agile world - e.g. if you don’t speed up the whole cycle you get pileups of unaddressable work - e.g. in number of PRs.

Organizations are less productive than individuals.

“Agent Factories”

What do AI Native workflows and AI native roles look like?

AI Agents to automate all aspects of each process.

AI introduction into an org does not automatically lead to use. You need continuous improvement, training, and some successes.

Change management:

  • Centralized communication
  • Peer slack channels for best practice sharing
  • Role based training
  • Specific use cases
  • Rewards
  • Coaching / library

How to Quantify AI ROI in Software Engineering (Stanford Study)#

Yegor Denisov-Blanch / Stanford

Measuring the impact of AI on coding

Clean coding environments lead to better outcomes:

  • tests / types / docs / modularity

Tool access doesn’t lead to tool usage.

Engineering outcomes are easiest to measure.

Cursor gives per-line level data.

The State of AI Code Quality: Hype vs. Reality#

Itamar Friedman / Qodo

Rules can get ignored.

Improve code via:

  • code gen
  • agentic code gen
  • agentic QC
  • learning feedback that provides QC

Lots of quality concerns around code produced by models.

Iceberg metaphor:

  • Code Gen is only part of the story
  • Code integrity is the long tail of maintenance review; standards
  • There are Code-Level Problems (insecure etc.)
  • There are Process-level problems: learning / verification / guardrails / standards
  • Bad code leads to large long term costs on the team

Add AI testing to increase trust.

Make sure there is code review:

  • code review isn’t ONLY for code quality, it’s also for learning the code together

Context engine to help know/understand the parts of the code that are relevant:

  • context engine in the code review process

Software Dev Gen 3.5:

  • Triangle: Spec / Implementation / Tests
  • Agents for each part that also talk to one another

Minimax M2#

Olive Song / MiniMax

Model is excelling in many modalities and tool uses.

Interleaved thinking improves its performance.

Kath Korevec / Google Labs

Proactive Agents

Small Bets, Big Impact: Building GenBI at a Fortune 100#

Asaf Bord / Northwestern Mutual

GenBI tool. One simple win was to automate the retrieval of reports.

You need:

  • LLM ready data fabric
  • Invisible co-pilots inside Slack / CRM / BI
  • Secure models

What We Learned Deploying AI within Bloomberg’s Engineering Organization#

Lei Zhang / Bloomberg

9000+ engineers!!

Ideas:

  • Uplift agents - take care of routine patching
  • Incidence response agents
  • All internal Data accessible via MCPs to write AI agents against

Bloomberg Principles:

  • “Freedom and Responsibility” with guardrails
  • “Golden Paths” with Platform Enablement
  • Democratize infra via Inner Source
  • Service abstraction and strong contracts

The Paved path:

  • Models via Gateway
  • MCP Directory via a hub
  • Tool creation/deployment via PAAS
  • Demos are easy; production begets QC

From Arc to Dia: Lessons learned in building AI Browser#

Samir Mody / The Browser Company

Developer Experience in the Age of AI Coding Agents#

Max Kanat-Alexander / Capital One

Future proofing tools even when AI changes things.

Standardize development environments:

  • frameworks - use them and use them the same way everyone else does
  • and do it the same way across the company

CLIs and APIs:

  • that show these things talk!

Validation Everywhere:

  • deterministic control
  • clear actionable error messages

Code Simplicity:

  • agents that can reason about code will produce better results

Written words:

  • everything needs to be written down

Code Review Velocity:

  • assign reviewers, not Slack or group
  • make it clear whose turn it is to take action

Vicious vs virtuous cycle

AI Consulting in Practice#

NLW / Super.ai

Paying Engineers like Salespeople: How Tenex Rebuilt the Incentive Stack for Modern Engineering#

Arman Hezarkhani / Tenex

Pay engineers for story points. Incentivize them with direct cash.

Some on the team will earn $1M.

Successes across a number of clients/industries.

Leadership in AI-Assisted Engineering#

Justin Reock / DX

How to build an AI-native company (even if your company is 50 years old)#

Dan Shipper / Every

No playbook for AI company.

2 man teams.

15 person company with 6 apps and good ARR.

Big difference if everyone is “all-in” on AI vs only 90% - 10x difference.

From written culture to a demo culture!

Compounding engineering:

  • learning and sharing prompts as an organization

Steps:

  1. Plan
  2. Delegate
  3. Assess
  4. Codify ← this is the compounding

A single engineer should be able to build and maintain a complex, production product.


20251119 Opening Night#

Networking Notes:

  • Chatted with SWYX: He is planning on extending the Latent Space podcast with a science angle. Touch base with him next week re: guests
  • Met Valentin Bercovici, Chief AI Officer at Weka. They released some new tools. Lots of life science customers. Wants to demo.
  • Met the Jellyfish team: They are a startup doing observability for AI tools. Target engineering managers to view how well agentic tooling is delivering.
  • Met guys from PropertyPilot: Chatted a bit about agentic tooling in their business. Their agentic experience allows the tools to compose bits of SQL but not at the SQL level - a bit higher up in template space.
  • Met the Arize team: Fully loaded agent platform. Goals are selling internal capabilities to enterprise customers. Lots of similarity to node graph platform with UI similar to n8n although quite a bit of customizability at the node level.
  • Met the Modal team: They have been seeing uptick in business that ranges from occasional user to larger companies that do big inference. Have seen 2 new categories of use:
    • LLM Inference - a bunch more people hosting their own models
    • Code Execution in sandboxes
    • I told him that I loved their user experience story but that Modal is not a fit for us because we have pre-existing GPU commits. I told him he could chat with more of the technical team at Neurips.
  • Met Corey from Rexmore: An AI-native holding company that buys and builds businesses within verticals in Technology, Investments and Education. They buy companies and then run them for less using AI.
  • Met Jesus from Studio 3 Marketing: They are using AI agents for HTML conversion, new client onboarding and other major internal tech projects.
  • Met Gene Kim and Steve Yegge: Got a copy of their book Vibe Coding. Got a taste of Steve’s work at SourceGraph on AMP. Not sure if there’s an experiment there but he definitely has a much higher-level view of how agents will change and accelerate programming.

Steve Yegge blogs:


© 2007- 2025