Bot detection
Every event ClickStream ingests gets a bot verdict within milliseconds. The verdict is a composite of five independent signals, each with a distinct failure mode — so a bot trying to hide from one signal still tripping the others is the common case, not the exception.
Bot events are never dropped. They're marked with bot.isBot, bot.score, and bot.category, flow through the same ingestion path as human traffic, and show up in the dashboard's Traffic Quality view where operators can filter, export, or ignore them. You keep the data; we just label it.
Signals we use
| Signal | Source | What it catches |
|---|---|---|
| CF Bot Management score | Cloudflare on the incoming request | Generic ML score from Cloudflare's model. Weighted 80/100 when available. |
| Named-bot UA registry | In-memory pattern match | Googlebot, Bingbot, ChatGPT, Claude, Perplexity, SEO tools, monitoring, etc. |
| ASN + connection type | Cloudflare cf.asOrganization | Hosting ASN + no JS execution = scraper. |
| Behavioral human confidence | Session signal accumulator | Pageview rate, click diversity, form interactions, scroll depth, time on page. Low confidence = bot-like. |
| Stealth-bot score | Mouse entropy + cross-signal inconsistency + TLS JA4 mismatch | Camoufox, stealth-puppeteer, undetected-chromedriver, antidetect browsers. |
The composite bot.score is 0–100, higher = more bot-like. The dashboard's behavioralClass rollup bucket drops each visitor into one of four labels:
human—bot.score < 30andhumanConfidence >= 50.suspicious—bot.score >= 30orhumanConfidence < 50.likely_bot—bot.score >= 50andhumanConfidence < 30.bot—bot.isBot === trueandbot.score >= 70.
Bot categories
Named bots get a category matched against a curated registry:
| Category | Examples |
|---|---|
search_crawler | Googlebot, Bingbot, DuckDuckBot, YandexBot |
ai_agent | ChatGPT, Claude, Perplexity, CCBot, Google-Extended, FirecrawlAgent |
social_preview | facebookexternalhit, Twitterbot, LinkedInBot, Slackbot |
seo_tool | AhrefsBot, SemrushBot, MJ12bot, DataForSeoBot |
monitoring | UptimeRobot, Pingdom, StatusCake |
scraper | Scrapy, python-requests, curl, wget (generic) |
scanner | Nmap, ZAP, Nikto, security probes |
automation | HeadlessChrome, Playwright, Puppeteer (naive) |
stealth_bot | Stealth detector — Camoufox, stealth-puppeteer, undetected-chromedriver, antidetect browsers |
unknown_bot | Bot-shaped behavior with no registry match |
A detailed threat-model atlas — the 10 stealth tools we track, the signals they mask, and the counter-signals we prioritize — is available to Scale+ customers on request. Contact support@clickstream.com.
stealth_bot — detecting fingerprint-spoofers
Stealth tools explicitly try to land in the 30–50 bot-score band so no detector can confidently say "bot". That band itself is a signal — our stealth scorer combines three extra inputs:
- Mouse entropy from the SDK's mouse-dynamics tracker. Real humans sit in
[0.55, 0.90]. Ghost-cursor-driven bots produce unnaturally smooth paths (>= 0.92). Teleport clicks (CDP-driven, no mouse movement) produce unnaturally low entropy (<= 0.35). Either extreme fires. - Fingerprint consistency — 7 cross-signal rules. Mobile UA + desktop viewport. iPhone UA + non-iOS platform. macOS UA + Linux platform. Timezone vs geo mismatch. GPU vendor incoherent with claimed OS. Unusual pixel ratios. Each violation knocks 0.15 off a 1.0 starting score.
- TLS JA4 mismatch — when Cloudflare exposes the JA4 fingerprint and it starts with a known non-browser prefix (Python httpx, Go net/http, aiohttp) but the UA claims Chrome / Firefox / Safari / Edge, +35 points to the stealth score.
The composite stealthScore caps at 100. The >= 60 threshold promotes the visitor to the stealth_bot category in the registry.
A cross-session aggregator also runs on a 5-minute sliding window:
- IP rotation — same
visitor_idcrossing ≥ 5 distinct IP hashes. - Multi-country — same
visitor_idfrom ≥ 3 distinct countries. - Deep-link without history — 1-event session landing on
/cart//checkout//adminwith no referrer. Only fires alongside another signal.
Aggregated anomalies land in KV as session-anomalies:{clientId} and surface on the dashboard's Session Integrity tile.
How to use it
In the dashboard
Intelligence → Traffic Quality groups every visitor into one of the four behavioralClass buckets with counts + per-category rollups. Click any bucket to drill into the specific visitors. Filter by country, ASN, referrer, landing page.
Via the Signals API
import { getVisitor, isBot } from '@clickstream/signals';
const visitor = await getVisitor();
if (isBot(visitor)) {
// AI crawler or scraper — serve structured JSON-LD, skip personalization
return <CrawlerView />;
}
visitor.bot.category carries the specific label when known, 'unknown_bot' when generic bot signals fire without a registry match.
Via the Signals Feed (Scale+)
Subscribe to the WebSocket and filter by behavioralClass:
ws.addEventListener('message', (event) => {
const msg = JSON.parse(event.data);
if (msg.type !== 'event') return;
if (msg.behavioralClass === 'bot' || msg.behavioralClass === 'likely_bot') {
forwardToSecurityTeam(msg);
}
});
See Signals Feed for the full subscriber pattern.
Via raw events (batch export)
Bot fields appear in the Parquet export as blob7 (device), double18 (bot_score), double19 (is_bot), plus the composite bot.category on scored events in clickstream_scores. See Event schema for the blob layout.
Accuracy posture
- False-positive rate — the existing human corpus flips to
bot/likely_botat roughly 0.8 %. The bar for promoting asuspiciousvisitor tolikely_botis deliberately high so operators don't mis-categorize real users. - False-negative rate — naive bots (curl, plain Puppeteer, empty UA) are caught at >99 %. The stealth-bot population is smaller and harder to quantify — see the atlas for the open research questions gating a published false-negative figure.
- Tuning — per-tenant overrides to the
stealth_scorethreshold are on the roadmap. Until then, the threshold is global at 60.
Never blocked, always labeled
ClickStream does not drop bot traffic at the ingestion layer. Every event reaches Analytics Engine. Bot labels let operators decide how to treat the traffic downstream:
- Filter bots out of aggregate metrics in the dashboard.
- Route bot events to a separate queue for content-theft investigation.
- Whitelist known AI crawlers so their impressions show in a separate "AI Search" rollup.
- Pipe
likely_botevents into your security SIEM via the Signals Feed.
See also
- Signals API — read
visitor.bot.*from page code - Signals Feed — stream every labeled event
- Event schema — raw event shape