Technical

Flaky tests and AI: automatic classification in a second

In the average Cypress suite 8–20% of tests are flaky. When your build fails, you face the question: is it a real bug, or just another flakeout? Manual triage takes 10–15 minutes per fail. AI does it in a second.

This article is a deep dive into building an automatic classifier — from defining flakiness through training data to CI integration.

What a flaky test is exactly

A test is flaky when it behaves non-deterministically given the same input. Sources of flakiness:

  1. Race conditions — DOM renders asynchronously, test ran faster than the UI.
  2. Network flakiness — a 3rd-party API timed out, the test failed.
  3. Shared state — the previous test left the DB in a different state.
  4. Environmental — CI runner was overloaded, the animation took longer.
  5. Non-deterministic data — the test depends on the current date or a random ID.

Classifier: signals the AI uses

When a test fails, you have these data signals available:

  • Error messageTimed out retrying after 4000ms vs. expected 'ACTIVE' to equal 'PENDING'. The former is probably a flake, the latter probably a real bug.
  • Stack trace — fail in cy.wait() vs. fail v cy.contains().
  • Historical stability — has this test failed 12× in the last 30 builds? Flake.
  • Does the commit touch the area under test? — if the diff changes checkout.tsx and the checkout test failed, probably a real bug.
  • Screenshot and video — if you have Visual AI, compare before/after. If the UI changed, probably a feature commit.

Classifier architecture

For our production classifier we used this stack:

Zdroje dát:
  - Cypress Dashboard API (história behu)
  - Git commit hash + diff
  - JUnit XML výstup
  - Screenshots + video

Feature extraction (Python):
  - error_message_category (ML klasifikácia z textu)
  - test_stability_last_30d (historical ratio)
  - files_changed_relevance (Jaccard similarity)
  - timing_anomaly (z-score behu voči priemeru)

Model:
  - Gradient Boosting (XGBoost)
  - Tréningová množina: ~8000 manuálne označených failov
  - Accuracy: 88% / Precision flake: 92%

Inference:
  - REST endpoint (AWS Lambda)
  - Latency: ~120ms p50
  - Invoked z Jenkins post-build step

How to label the data (the hardest part)

Training data is the most critical step. Two proven paths:

  1. Historical manual triage — if you have 2+ years of Cypress Dashboard records, good QA teams often tagged failed tests as 'flaky' or 'bug'. Export + clean = dataset.
  2. Forward labeling via re-run — every fail is automatically retried 3×. If 2/3 pass, label = flaky. If 3/3 fail, label = real bug. After 4–6 weeks you have ~500 labeled samples.

CI integration

Our Jenkinsfile hook:

post {
  failure {
    script {
      def failedTests = readJSON file: 'cypress/reports/junit.json'
      def response = httpRequest(
        url: 'https://classifier.internal/classify',
        httpMode: 'POST',
        contentType: 'APPLICATION_JSON',
        requestBody: groovy.json.JsonOutput.toJson([
          failed_tests: failedTests,
          commit_sha: env.GIT_COMMIT,
          build_id: env.BUILD_NUMBER,
        ])
      )
      def results = readJSON text: response.content

      results.each { r ->
        if (r.classification == 'flake' && r.confidence > 0.85) {
          // Auto-retry, no human needed
          sh "npx cypress run --spec '${r.test_file}'"
        } else {
          // Create Jira ticket
          slackSend(channel: '#qa-alerts', message: "🐛 Real bug: ${r.test_name}")
        }
      }
    }
  }
}

Results at a client (6 months)

  • Manual triage time dropped from 12 min/fail to 1.5 min/fail (just reviewing the classifier).
  • False positive rate (classifier said bug, it was a flake): 4 %.
  • False negative rate (classifier said flake, it was a bug): 7 % — this is critical and should be minimised.
  • QA team time saved: ~35 hours per month.

When NOT to build your own classifier

If you have fewer than 200 Cypress tests and a flakiness rate below 10%, ROI is zero. We recommend:

  • Tackle flakiness at the test-code level (better waits, intercepts).
  • Use Cypress Cloud (if you pay for it) — their 'Flaky Test' detection has no ML, but it's enough for small teams.

Want the same approach at your company? Get in touch — dohodneme 30-minute discovery call.