Bot detection

Every event ClickStream ingests gets a bot verdict within milliseconds. The verdict is a composite of five independent signals, each with a distinct failure mode — so a bot trying to hide from one signal still tripping the others is the common case, not the exception.

Bot events are never dropped. They're marked with bot.isBot, bot.score, and bot.category, flow through the same ingestion path as human traffic, and show up in the dashboard's Traffic Quality view where operators can filter, export, or ignore them. You keep the data; we just label it.

Signals we use

Signal	Source	What it catches
CF Bot Management score	Cloudflare on the incoming request	Generic ML score from Cloudflare's model. Weighted 80/100 when available.
Named-bot UA registry	In-memory pattern match	Googlebot, Bingbot, ChatGPT, Claude, Perplexity, SEO tools, monitoring, etc.
ASN + connection type	Cloudflare `cf.asOrganization`	Hosting ASN + no JS execution = scraper.
Behavioral human confidence	Session signal accumulator	Pageview rate, click diversity, form interactions, scroll depth, time on page. Low confidence = bot-like.
Stealth-bot score	Mouse entropy + cross-signal inconsistency + TLS JA4 mismatch	Camoufox, stealth-puppeteer, undetected-chromedriver, antidetect browsers.

The composite bot.score is 0–100, higher = more bot-like. The dashboard's behavioralClass rollup bucket drops each visitor into one of four labels:

human — bot.score < 30 and humanConfidence >= 50.
suspicious — bot.score >= 30 or humanConfidence < 50.
likely_bot — bot.score >= 50 and humanConfidence < 30.
bot — bot.isBot === true and bot.score >= 70.

Bot categories

Named bots get a category matched against a curated registry:

Category	Examples
`search_crawler`	Googlebot, Bingbot, DuckDuckBot, YandexBot
`ai_agent`	ChatGPT, Claude, Perplexity, CCBot, Google-Extended, FirecrawlAgent
`social_preview`	facebookexternalhit, Twitterbot, LinkedInBot, Slackbot
`seo_tool`	AhrefsBot, SemrushBot, MJ12bot, DataForSeoBot
`monitoring`	UptimeRobot, Pingdom, StatusCake
`scraper`	Scrapy, python-requests, curl, wget (generic)
`scanner`	Nmap, ZAP, Nikto, security probes
`automation`	HeadlessChrome, Playwright, Puppeteer (naive)
`stealth_bot`	Stealth detector — Camoufox, stealth-puppeteer, undetected-chromedriver, antidetect browsers
`unknown_bot`	Bot-shaped behavior with no registry match

A detailed threat-model atlas — the 10 stealth tools we track, the signals they mask, and the counter-signals we prioritize — is available to Scale+ customers on request. Contact support@clickstream.com.

`stealth_bot` — detecting fingerprint-spoofers

Stealth tools explicitly try to land in the 30–50 bot-score band so no detector can confidently say "bot". That band itself is a signal — our stealth scorer combines three extra inputs:

Mouse entropy from the SDK's mouse-dynamics tracker. Real humans sit in [0.55, 0.90]. Ghost-cursor-driven bots produce unnaturally smooth paths (>= 0.92). Teleport clicks (CDP-driven, no mouse movement) produce unnaturally low entropy (<= 0.35). Either extreme fires.
Fingerprint consistency — 7 cross-signal rules. Mobile UA + desktop viewport. iPhone UA + non-iOS platform. macOS UA + Linux platform. Timezone vs geo mismatch. GPU vendor incoherent with claimed OS. Unusual pixel ratios. Each violation knocks 0.15 off a 1.0 starting score.
TLS JA4 mismatch — when Cloudflare exposes the JA4 fingerprint and it starts with a known non-browser prefix (Python httpx, Go net/http, aiohttp) but the UA claims Chrome / Firefox / Safari / Edge, +35 points to the stealth score.

The composite stealthScore caps at 100. The >= 60 threshold promotes the visitor to the stealth_bot category in the registry.

A cross-session aggregator also runs on a 5-minute sliding window:

IP rotation — same visitor_id crossing ≥ 5 distinct IP hashes.
Multi-country — same visitor_id from ≥ 3 distinct countries.
Deep-link without history — 1-event session landing on /cart / /checkout / /admin with no referrer. Only fires alongside another signal.

Aggregated anomalies land in KV as session-anomalies:{clientId} and surface on the dashboard's Session Integrity tile.

How to use it

In the dashboard

Intelligence → Traffic Quality groups every visitor into one of the four behavioralClass buckets with counts + per-category rollups. Click any bucket to drill into the specific visitors. Filter by country, ASN, referrer, landing page.

Via the Signals API

import { getVisitor, isBot } from '@clickstream/signals';

const visitor = await getVisitor();
if (isBot(visitor)) {
  // AI crawler or scraper — serve structured JSON-LD, skip personalization
  return <CrawlerView />;
}

visitor.bot.category carries the specific label when known, 'unknown_bot' when generic bot signals fire without a registry match.

Via the Signals Feed (Scale+)

Subscribe to the WebSocket and filter by behavioralClass:

ws.addEventListener('message', (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type !== 'event') return;
  if (msg.behavioralClass === 'bot' || msg.behavioralClass === 'likely_bot') {
    forwardToSecurityTeam(msg);
  }
});

See Signals Feed for the full subscriber pattern.

Via raw events (batch export)

Bot fields appear in the Parquet export as blob7 (device), double18 (bot_score), double19 (is_bot), plus the composite bot.category on scored events in clickstream_scores. See Event schema for the blob layout.

Accuracy posture

False-positive rate — the existing human corpus flips to bot / likely_bot at roughly 0.8 %. The bar for promoting a suspicious visitor to likely_bot is deliberately high so operators don't mis-categorize real users.
False-negative rate — naive bots (curl, plain Puppeteer, empty UA) are caught at >99 %. The stealth-bot population is smaller and harder to quantify — see the atlas for the open research questions gating a published false-negative figure.
Tuning — per-tenant overrides to the stealth_score threshold are on the roadmap. Until then, the threshold is global at 60.

Never blocked, always labeled

ClickStream does not drop bot traffic at the ingestion layer. Every event reaches Analytics Engine. Bot labels let operators decide how to treat the traffic downstream:

Filter bots out of aggregate metrics in the dashboard.
Route bot events to a separate queue for content-theft investigation.
Whitelist known AI crawlers so their impressions show in a separate "AI Search" rollup.
Pipe likely_bot events into your security SIEM via the Signals Feed.