Flaky tests and AI: automatic classification in a second
In the average Cypress suite 8–20% of tests are flaky. When your build fails, you face the question: is it a real bug, or just another flakeout? Manual triage takes 10–15 minutes per fail. AI does it in a second.
This article is a deep dive into building an automatic classifier — from defining flakiness through training data to CI integration.
What a flaky test is exactly
A test is flaky when it behaves non-deterministically given the same input. Sources of flakiness:
- Race conditions — DOM renders asynchronously, test ran faster than the UI.
- Network flakiness — a 3rd-party API timed out, the test failed.
- Shared state — the previous test left the DB in a different state.
- Environmental — CI runner was overloaded, the animation took longer.
- Non-deterministic data — the test depends on the current date or a random ID.
Classifier: signals the AI uses
When a test fails, you have these data signals available:
- Error message —
Timed out retrying after 4000msvs.expected 'ACTIVE' to equal 'PENDING'. The former is probably a flake, the latter probably a real bug. - Stack trace — fail in
cy.wait()vs. fail vcy.contains(). - Historical stability — has this test failed 12× in the last 30 builds? Flake.
- Does the commit touch the area under test? — if the diff changes
checkout.tsxand the checkout test failed, probably a real bug. - Screenshot and video — if you have Visual AI, compare before/after. If the UI changed, probably a feature commit.
Classifier architecture
For our production classifier we used this stack:
Zdroje dát: - Cypress Dashboard API (história behu) - Git commit hash + diff - JUnit XML výstup - Screenshots + video Feature extraction (Python): - error_message_category (ML klasifikácia z textu) - test_stability_last_30d (historical ratio) - files_changed_relevance (Jaccard similarity) - timing_anomaly (z-score behu voči priemeru) Model: - Gradient Boosting (XGBoost) - Tréningová množina: ~8000 manuálne označených failov - Accuracy: 88% / Precision flake: 92% Inference: - REST endpoint (AWS Lambda) - Latency: ~120ms p50 - Invoked z Jenkins post-build step
How to label the data (the hardest part)
Training data is the most critical step. Two proven paths:
- Historical manual triage — if you have 2+ years of Cypress Dashboard records, good QA teams often tagged failed tests as 'flaky' or 'bug'. Export + clean = dataset.
- Forward labeling via re-run — every fail is automatically retried 3×. If 2/3 pass, label = flaky. If 3/3 fail, label = real bug. After 4–6 weeks you have ~500 labeled samples.
CI integration
Our Jenkinsfile hook:
post {
failure {
script {
def failedTests = readJSON file: 'cypress/reports/junit.json'
def response = httpRequest(
url: 'https://classifier.internal/classify',
httpMode: 'POST',
contentType: 'APPLICATION_JSON',
requestBody: groovy.json.JsonOutput.toJson([
failed_tests: failedTests,
commit_sha: env.GIT_COMMIT,
build_id: env.BUILD_NUMBER,
])
)
def results = readJSON text: response.content
results.each { r ->
if (r.classification == 'flake' && r.confidence > 0.85) {
// Auto-retry, no human needed
sh "npx cypress run --spec '${r.test_file}'"
} else {
// Create Jira ticket
slackSend(channel: '#qa-alerts', message: "🐛 Real bug: ${r.test_name}")
}
}
}
}
}
Results at a client (6 months)
- Manual triage time dropped from 12 min/fail to 1.5 min/fail (just reviewing the classifier).
- False positive rate (classifier said bug, it was a flake): 4 %.
- False negative rate (classifier said flake, it was a bug): 7 % — this is critical and should be minimised.
- QA team time saved: ~35 hours per month.
When NOT to build your own classifier
If you have fewer than 200 Cypress tests and a flakiness rate below 10%, ROI is zero. We recommend:
- Tackle flakiness at the test-code level (better waits, intercepts).
- Use Cypress Cloud (if you pay for it) — their 'Flaky Test' detection has no ML, but it's enough for small teams.
Want the same approach at your company? Get in touch — dohodneme 30-minute discovery call.