Breaking Apps with AI: My BreakingAppsHackathon Journey with Passmark

When I started writing end-to-end tests for a demo e-commerce app, I thought it would be straightforward. Spoiler: it wasn't. But the lessons I picked up along the way completely changed how I think about AI-assisted testing. Here's everything I learned — mistakes included.

What is Passmark?

Passmark is an open-source AI testing library built on top of Playwright. Instead of writing brittle selectors and hardcoded waits, you describe what a user does in plain English, and the AI figures out how to execute it.

import { runSteps, configure } from "passmark";

configure({
  ai: {
    gateway: "openrouter"
  }
});

await runSteps({
  page,
  userFlow: "Login flow",
  steps: [
    { description: "Navigate to https://www.saucedemo.com/" },
    { description: "Enter username", data: { value: "standard_user" } },
    { description: "Enter password", data: { value: "secret_sauce" } },
    { description: "Click on Login" },
  ],
  assertions: [
    { assertion: "Products page is visible" },
  ],
  test,
  expect,
});

Clean. Readable. No selectors. Sounds perfect, right? Well, almost.

Lesson 1: Not All Elements Are Equal in the AI's Eyes

Passmark uses ARIA accessibility snapshots by default to find elements on the page. This works brilliantly for buttons, inputs, and links that have proper labels. But it completely falls apart for unlabelled elements.

On saucedemo.com, the shopping cart icon looks like this in the ARIA snapshot:

- generic [ref=e184]: "1"

Just a generic element with a badge count. No label. No role. The AI has nothing to grab onto — no matter how you describe it in plain English, it simply cannot find it.

The fix: Drop down to raw Playwright for these elements and use them alongside runSteps:

// Passmark handles what it can
await runSteps({ ... });

// Raw Playwright handles what Passmark can't
await page.locator(".shopping_cart_link").click();

This is not a workaround — it is the correct pattern. Think of it as a relay race: Passmark runs its leg, then hands the baton to raw Playwright.

Lesson 2: CUA Mode Exists, But Has a Catch

Passmark offers a cua (computer-use agent) mode that uses OpenAI's vision to literally see the screen instead of reading the ARIA tree. This sounds like the perfect solution for unlabelled elements.

configure({
  ai: {
    mode: "cua",
    gateway: "none", // CUA requires direct OpenAI access
  }
});

There are two important caveats:

It requires OPENAI_API_KEY set directly — it does not work with OpenRouter.
Even with vision, a tiny cart icon with no surrounding context is surprisingly hard for the model to locate reliably.

For saucedemo specifically, raw Playwright with .shopping_cart_link remains the most reliable approach.

Lesson 3: `waitUntil` Is Your Best Friend

One of the most powerful features in Passmark is waitUntil. It tells the AI to wait for a condition to be true before moving to the next step. Without it, steps can execute too quickly and fail silently.

{ description: "Click on Back to products", waitUntil: "Products is visible" },

This single addition fixed a whole class of flaky failures where the AI was trying to proceed before the page had finished transitioning.

Lesson 4: Never Nest Tests

This is a Playwright fundamental, but worth repeating. At one point, I accidentally nested one test() block inside another:

// ❌ Wrong — Playwright does not allow this
test("Add to cart", async ({ page }) => {
  // ...
  test("Checkout", async ({ page }) => {  // nested!
    // ...
  });
});

Playwright will throw immediately. Each test() must be at the top level. If you need multiple steps that share browser state, keep them in one test and split them across multiple runSteps blocks.

Lesson 5: Split `runSteps` Blocks by Page

This was the biggest architectural lesson. Passmark evaluates assertions at the point where steps end. If your steps navigate through multiple pages and you put all assertions at the end, some will run on the wrong page entirely.

Wrong approach — all assertions at the end:

await runSteps({
  steps: [
    // ... login, add to cart, checkout, logout (all in one block)
  ],
  assertions: [
    { assertion: "Cart badge shows 1" },          // products page
    { assertion: "Thank you message is visible" }, // confirmation page
    { assertion: "Login page is visible" },        // login page — but we're not there yet!
  ],
});

Correct approach — one block per page:

// Block ends on products page → assert cart state here
await runSteps({
  steps: [ /* login + add to cart */ ],
  assertions: [
    { assertion: "Cart badge shows 1 item" },
    { assertion: "Remove button is visible for Sauce Labs Backpack" },
  ],
});

// Block ends on checkout overview → assert order details here
await runSteps({
  steps: [ /* checkout form */ ],
  assertions: [
    { assertion: "Final total is exactly $32.39" },
    { assertion: "Payment shows SauceCard #31337" },
  ],
});

// Block ends on login page → assert logout here
await runSteps({
  steps: [ /* logout */ ],
  assertions: [
    { assertion: "User is redirected to login page" },
  ],
});

Lesson 6: Write Deep, Meaningful Assertions

One of the biggest advantages of Passmark's plain English assertions is that you are not limited to "element is visible." You can express real business logic:

assertions: [
  // Pricing accuracy
  { assertion: "Item total is exactly $29.99, not any other amount" },
  { assertion: "Tax is exactly $2.40" },
  { assertion: "Final total is exactly $32.39 including tax" },

  // State verification
  { assertion: "Shopping cart badge shows exactly 1 item, not 0 or 2" },
  { assertion: "Sauce Labs Backpack shows Remove button, Add to cart button is gone" },

  // Post-action verification
  { assertion: "Cart badge is gone after order is placed, cart is empty" },
  { assertion: "User is on login page, not products page, after logout" },
]

These catch real bugs: wrong prices, miscalculated tax, items not being removed from cart, sessions not being cleared on logout.

The Final Test Structure

After all these lessons, here is the clean, production-grade pattern I arrived at:

test("Add to cart and checkout", async ({ page }) => {
  test.setTimeout(300000);

  // Block 1 — Login + Add to cart (ends on products page)
  await runSteps({
    page,
    userFlow: "Login and add product to cart",
    steps: [
      { description: "Navigate to https://www.saucedemo.com/" },
      { description: "Enter username", data: { value: "standard_user" } },
      { description: "Enter password", data: { value: "secret_sauce" } },
      { description: "Click on Login" },
      { description: "Click Sauce Labs Backpack" },
      { description: "Click on Add to cart" },
      { description: "Click on Back to products", waitUntil: "Products is visible" },
    ],
    assertions: [
      { assertion: "Shopping cart badge shows exactly 1 item" },
      { assertion: "Sauce Labs Backpack shows Remove button" },
      { assertion: "Price of Sauce Labs Backpack is exactly $29.99" },
    ],
    test,
    expect,
  });

  // Raw Playwright — cart icon has no ARIA label
  await expect(page.locator(".shopping_cart_badge")).toHaveText("1");
  await page.locator(".shopping_cart_link").click();
  await expect(page.locator(".cart_item")).toHaveCount(1);
  await expect(page.locator(".cart_item_label")).toContainText("Sauce Labs Backpack");

  // Block 2 — Checkout form (ends on overview page)
  await runSteps({
    page,
    userFlow: "Fill checkout form",
    steps: [
      { description: "Click on Checkout" },
      { description: "Enter First Name", data: { value: "John" } },
      { description: "Enter Last Name", data: { value: "Doe" } },
      { description: "Enter Postal Code", data: { value: "482003" } },
      { description: "Click on Continue" },
    ],
    assertions: [
      { assertion: "Item total is exactly $29.99" },
      { assertion: "Tax is exactly $2.40" },
      { assertion: "Final total is exactly $32.39" },
    ],
    test,
    expect,
  });

  // Block 3 — Finish order (ends on confirmation page)
  await runSteps({
    page,
    userFlow: "Complete order",
    steps: [
      { description: "Click on Finish" },
    ],
    assertions: [
      { assertion: "Thank you for your order message is visible" },
      { assertion: "Cart badge is gone, cart is empty" },
    ],
    test,
    expect,
  });

  // Block 4 — Logout (ends on login page)
  await runSteps({
    page,
    userFlow: "Logout",
    steps: [
      { description: "Click on Back Home" },
      { description: "Click on Open Menu" },
      { description: "Click on Logout" },
    ],
    assertions: [
      { assertion: "User is redirected to login page" },
      { assertion: "Login button is visible" },
    ],
    test,
    expect,
  });
});

Bonus: Speed Up Tests with Redis Caching

One thing I noticed early on — every test run was slow because Passmark was calling the AI on every single step, every single time. This is where Redis caching changes everything.

Passmark caches successful step executions to Redis. On the first run, AI figures out how to execute each step. On every run after that, it replays the cached Playwright actions at native speed — zero LLM calls.

Setting it up is just one line in your .env:

REDIS_URL="redis://localhost:6379"

Your full .env should look like this:

# AI Keys — Passmark uses both Anthropic + Google for multi-model consensus
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENERATIVE_AI_API_KEY=AIza...

# Redis — enables step caching (highly recommended)
REDIS_URL="redis://localhost:6379"

# OpenRouter — if routing through OpenRouter instead of direct keys
OPENROUTER_API_KEY=sk-or-v1-...

Without Redis, you will see this warning on every run:

WARN: Redis not configured. Step caching is disabled — all steps will use AI execution.

This means every run goes through the AI, which is slow and burns API tokens. With Redis enabled, only the first run is slow. Every subsequent run is as fast as regular Playwright.

The tradeoff: if your UI changes, Passmark detects that the cached action failed and automatically re-runs the AI to discover the new selector and updates the cache. You get speed AND resilience.

Install Redis locally:

# Mac
brew install redis && brew services start redis

# Ubuntu/WSL
sudo apt install redis-server && sudo service redis start

For CI, add a Redis service to your pipeline and set REDIS_URL as an environment variable.

Going Further: Building a Full Test Suite

Once the core flow was working, I built out a complete test suite covering every major feature of the app. Here is what 20 tests across 4 files looks like.

Auth Suite — `auth.spec.js`

The most important edge cases to cover are the ones users actually hit. Beyond the happy path, saucedemo has several user types worth testing:

test("Login with locked out user", async ({ page }) => {
  await runSteps({
    page,
    userFlow: "Locked out user login",
    steps: [
      { description: "Navigate to https://www.saucedemo.com/" },
      { description: "Enter username", data: { value: "locked_out_user" } },
      { description: "Enter password", data: { value: "secret_sauce" } },
      { description: "Click on Login" },
    ],
    assertions: [
      { assertion: "Error message says user has been locked out" },
      { assertion: "User is not redirected to products page" },
    ],
    test,
    expect,
  });
});

The full auth suite covers: valid login, invalid password, invalid username, empty credentials, locked out user, and logout. Six tests, every auth scenario covered.

Products Suite — `products.spec.js`

Sorting bugs are sneaky — they rarely get caught manually. These tests verify that every sort option actually works:

test("Sort products by price low to high", async ({ page }) => {
  await runSteps({
    page,
    userFlow: "Sort by price low to high",
    steps: [
      // ... login steps
      { description: "Select Price (low to high) from the sort dropdown" },
    ],
    assertions: [
      { assertion: "Products are sorted by price from lowest to highest" },
      { assertion: "Sauce Labs Onesie at \(7.99 appears before Sauce Labs Backpack at \)29.99" },
      { assertion: "Sort dropdown shows Price (low to high) as selected" },
    ],
    test,
    expect,
  });
});

Five tests covering: full product listing, all four sort options, and product detail page accuracy.

Cart Suite — `cart.spec.js`

Cart logic has a lot of state to verify — badge counts, button states, item persistence:

test("Add multiple items to cart", async ({ page }) => {
  await runSteps({
    page,
    userFlow: "Add multiple items",
    steps: [
      // ... login steps
      { description: "Click Add to cart for Sauce Labs Backpack" },
      { description: "Click Add to cart for Sauce Labs Bike Light" },
      { description: "Click Add to cart for Sauce Labs Bolt T-Shirt" },
    ],
    assertions: [
      { assertion: "Cart badge shows exactly 3" },
      { assertion: "All three products show Remove button" },
    ],
    test,
    expect,
  });

  await page.locator(".shopping_cart_link").click(); // raw Playwright

  await runSteps({
    page,
    userFlow: "Verify cart",
    steps: [],
    assertions: [
      { assertion: "Cart contains exactly 3 items" },
      { assertion: "Combined items are all correctly listed with prices" },
    ],
    test,
    expect,
  });
});

Four tests: single item, multiple items, remove item, and empty cart verification.

Checkout Suite — `checkout.spec.js`

Validation errors are where most QA misses happen. These tests make sure every required field actually blocks submission:

test("Checkout with missing first name", async ({ page }) => {
  // ... add item to cart
  await page.locator(".shopping_cart_link").click();

  await runSteps({
    page,
    userFlow: "Submit checkout without first name",
    steps: [
      { description: "Click on Checkout" },
      { description: "Enter Last Name", data: { value: "Doe" } },
      { description: "Enter Postal Code", data: { value: "482003" } },
      { description: "Click on Continue" },
    ],
    assertions: [
      { assertion: "Error message says First Name is required" },
      { assertion: "User is still on checkout information page" },
    ],
    test,
    expect,
  });
});

Five tests: missing first name, missing last name, missing postal code, multi-item checkout with correct totals, and cancel flow.

Visual/UI Suite — `visual.spec.js`

Saucedemo has a special visual_user account that has intentional UI bugs baked in — wrong product images, misaligned buttons, broken layouts. This suite is specifically designed to catch those bugs that a standard user would never see.

test("Visual user - product images are correct", async ({ page }) => {
  await runSteps({
    page,
    userFlow: "Check product images for visual_user",
    steps: [
      { description: "Navigate to https://www.saucedemo.com/" },
      { description: "Enter username", data: { value: "visual_user" } },
      { description: "Enter password", data: { value: "secret_sauce" } },
      { description: "Click on Login" },
    ],
    assertions: [
      { assertion: "All 6 product images are visible and not broken" },
      { assertion: "Each product has a different image, not all showing the same image" },
      { assertion: "Sauce Labs Backpack image shows a backpack" },
    ],
    test,
    expect,
  });
});

This is where Passmark's AI vision really shines — catching visual bugs that a CSS selector or DOM check would completely miss. A traditional assertion like expect(img).toBeVisible() passes even if every product is showing the wrong image. Passmark actually looks at what is on screen.

Seven tests covering: product images, name and price accuracy, button alignment, product detail page, cart badge display, checkout form layout, and sorting UI.

Complete Suite Overview

File	Tests	What It Covers
`auth.spec.js`	6	Valid/invalid login, locked user, logout
`products.spec.js`	5	Listing, all sort options, product detail
`cart.spec.js`	4	Add/remove single and multiple items
`checkout.spec.js`	5	Form validation, multi-item totals, cancel
`visual.spec.js`	7	UI bugs, wrong images, misaligned elements
Total	27	Full app coverage

Run the entire suite with:

npx playwright test

What Happens When Tests Fail — And How Passmark Explains It

Not every test passes on the first run. In fact, some of the most valuable moments in this journey came from reading Passmark's failure messages carefully.

Here is a real example. I wrote this assertion:

{ assertion: "All products show a price" }

Instead of a generic "assertion failed" message, Passmark responded with:

"Both the screenshot and the accessibility snapshot confirm that every product listed (Sauce Labs Backpack, Bike Light, Bolt T-Shirt, Fleece Jacket, Onesie, and Test.allTheThings() T-Shirt) has a corresponding price displayed next to the Add to cart button."

This is one of the biggest differences between Passmark and traditional Playwright assertions. A raw Playwright failure looks like this:

Error: expect(received).toBe(expected)
Expected: true
Received: false

Passmark's failure looks like this:

Error: The assertion failed because the current page is the Login screen,
not the Checkout Overview page. There are no products or checkout
details visible in the screenshot or accessibility snapshot.

You immediately know what failed, why it failed, and where you are in the flow. No guessing, no digging through screenshots manually.

How to Use Failures Productively

1. Read the full failure message before changing anything. Passmark often tells you exactly which page you are on and what it sees. This is how I discovered that my assertions were running on the wrong page — the error said "current page is Login screen, not Checkout Overview."

2. Check the page snapshot in the error output. Every failure includes an ARIA snapshot of the current page state. This tells you exactly what elements are visible and what their labels are — which is also how I discovered the cart icon was just generic [ref=e184]: "1" with no accessible name.

3. Use failures to write better assertions. When Passmark says "every product has a price displayed next to the Add to cart button", that is the AI describing what it actually sees. You can use that language directly in your next assertion — it will match perfectly because it is describing the real UI.

4. Split your runSteps blocks when assertions fail on the wrong page. If the failure message mentions a different page than expected, that is a sign your steps span too many page transitions. Break them into smaller blocks, one per page.

Failures Are Data

Traditional test failures tell you something broke. Passmark failures tell you what the AI saw, what it expected, and where the mismatch was. Treat every failure as a detailed bug report, not just a red check mark.

Key Takeaways

Situation	Solution
Element has no ARIA label	Use raw Playwright selector
Steps moving too fast	Add `waitUntil` to the step
Assertions running on wrong page	Split into multiple `runSteps` blocks
Test timing out	Increase `test.setTimeout()` or set globally in `playwright.config.js`
Nested `test()` blocks	Keep all steps in one `test()`, use multiple `runSteps`
Shallow assertions	Write business-level assertions in plain English

The Real Journey — What Actually Happened

No article about testing is complete without the honest version. Here is exactly how this session went, mistake by mistake.

It started with a cart icon. I wrote { description: "Open shopping cart" } and the test timed out. I tried every variation — "Click the shopping cart", "Click the basket icon", "Click the cart in the top right." Nothing worked. I switched to CUA mode thinking vision would solve it. Still nothing. It took reading the raw ARIA snapshot to finally understand why — the element had no label at all, just generic [ref=e184]: "1". The AI had nothing to work with. Raw Playwright with .shopping_cart_link fixed it in one line.

Then I accidentally nested a test inside a test. Classic mistake. Playwright threw immediately. The fix was obvious in hindsight — one test() block, multiple runSteps blocks inside it.

Then assertions started failing on the wrong page. I had put all my assertions at the end of a block that spanned login, add to cart, checkout, and logout. By the time Passmark ran the checkout assertions, the test had already logged out and was back on the login page. The fix was splitting into four separate runSteps blocks, one per page.

Then there was the typo. After all that debugging, the cart click was failing because I had written .shopping-cart_link with a hyphen instead of .shopping_cart_link with an underscore. One character. Thirty minutes of confusion.

Then I discovered waitUntil. Adding waitUntil: "Products is visible" to the Back to products step fixed an entire class of timing failures I had been fighting. The AI was moving too fast between steps.

And finally — the timeout. 60 seconds is not enough for 7+ AI-executed steps going through OpenRouter. Bumping to 300 seconds and splitting long flows into smaller runSteps blocks solved it permanently.

Every one of these failures taught something concrete. That is the real value of building tests iteratively — the errors are the curriculum.

Final Thoughts

Passmark does not replace Playwright — it sits on top of it. The real skill is knowing when to use natural language steps and when to drop down to raw Playwright. Unlabelled elements, icon buttons, and anything without a proper ARIA role will always need raw selectors. Everything else? Let the AI handle it.

The result is a test suite that reads like a user story, catches real business bugs, and does not break every time a CSS class changes.

Happy testing. 🎯

This article is my submission for the #BreakingAppsHackathon. Built and tested using Passmark + Playwright on saucedemo.com.

All test suites from this article are available on GitHub: github.com/buildwithrenuka/passmark

Breaking Apps with AI: My BreakingAppsHackathon Journey with Passmark

What is Passmark?

Lesson 1: Not All Elements Are Equal in the AI's Eyes

Lesson 2: CUA Mode Exists, But Has a Catch

Lesson 3: `waitUntil` Is Your Best Friend

Lesson 4: Never Nest Tests

Lesson 5: Split `runSteps` Blocks by Page

Lesson 6: Write Deep, Meaningful Assertions

The Final Test Structure

Bonus: Speed Up Tests with Redis Caching

Going Further: Building a Full Test Suite

Auth Suite — `auth.spec.js`

Products Suite — `products.spec.js`

Cart Suite — `cart.spec.js`

Checkout Suite — `checkout.spec.js`

Visual/UI Suite — `visual.spec.js`

Complete Suite Overview

What Happens When Tests Fail — And How Passmark Explains It

How to Use Failures Productively

Failures Are Data

Key Takeaways

The Real Journey — What Actually Happened

Final Thoughts

Comments

More from this blog

How Instagram Stores Reels, Photos, and Drafts Behind the Scenes 📸🎥

How React Virtual DOM Works Under the Hood

How WhatsApp Works Without Internet: Offline Messaging and Sync Explained 📱

React Native Routing : React Navigation vs Expo Router

React Native App Architecture at Scale

Command Palette

What is Passmark?

Lesson 1: Not All Elements Are Equal in the AI's Eyes

Lesson 2: CUA Mode Exists, But Has a Catch

Lesson 3: waitUntil Is Your Best Friend

Lesson 4: Never Nest Tests

Lesson 5: Split runSteps Blocks by Page

Lesson 6: Write Deep, Meaningful Assertions

The Final Test Structure

Bonus: Speed Up Tests with Redis Caching

Going Further: Building a Full Test Suite

Auth Suite — auth.spec.js

Products Suite — products.spec.js

Cart Suite — cart.spec.js

Checkout Suite — checkout.spec.js

Visual/UI Suite — visual.spec.js

Complete Suite Overview

What Happens When Tests Fail — And How Passmark Explains It

How to Use Failures Productively

Failures Are Data

Key Takeaways

The Real Journey — What Actually Happened

Final Thoughts

Comments

More from this blog

Lesson 3: `waitUntil` Is Your Best Friend

Lesson 5: Split `runSteps` Blocks by Page

Auth Suite — `auth.spec.js`

Products Suite — `products.spec.js`

Cart Suite — `cart.spec.js`

Checkout Suite — `checkout.spec.js`

Visual/UI Suite — `visual.spec.js`