Machine LearningFraud DetectionEDA

Fraud Detection: Identifying Bad Actors with Behavioural Signals

How do you detect fraudulent users at scale when they look like everyone else? A walkthrough of using search-to-communication ratios, geographic patterns, and temporal behaviour to surface bot activity.

Fraud Detection: Identifying Bad Actors with Behavioural Signals
Contents

Fraud detection is rarely about finding a single smoking gun. It’s about finding users whose pattern of behaviour is statistically inconsistent with legitimate usage. This post walks through a case study of detecting fraudulent users on an internet platform using four data sources and a handful of behavioural signals.


The Dataset

Four data sources were combined:

  • Signup records — when and where users registered
  • Call logs — communication activity
  • Message history — messaging activity
  • Search activity — search queries

The first step was data quality: timestamp conversion, ID validation across tables, and joining the four sources into a single user-level feature set.


The Core Hypothesis

Legitimate users do things. They search, they call, they message — in roughly human proportions. Bots, scrapers, and fraudulent accounts tend to have one behaviour in abundance and others at zero.

The signal we build on this:

suspicion_score = search_normalized / (message_normalized + call_normalized)

Users whose ratio exceeds the 0.98 quantile threshold are flagged as suspects — they have an abnormally high number of searches compared to any communication activity.


Narrowing the Suspects

A high search-to-communication ratio alone isn’t enough. We refine further by excluding users who delayed their first search — real bots act immediately after signup. A genuine user might take days to start searching; a scraper starts at registration.

The final fraud detection workflow combines:

  1. Search-to-communication ratio — primary signal
  2. Zero communication users — users with searches but zero calls and messages
  3. Geographic patterns — Spain showed the highest fraud rate at 18.31% of signups in that cohort
  4. Temporal behaviour — time between signup and first search

Results

  • 1,228 fraudulent users identified
  • 500 users showed elevated search activity with near-zero communication
  • Spain flagged as highest-risk geography requiring additional scrutiny

Key Takeaway

Fraud detection works best when you combine multiple weak signals into a composite score rather than relying on any single rule. The search-to-communication ratio is powerful precisely because it’s relative — it adapts to each user’s own activity level rather than applying a fixed threshold.

The same principle applies in financial fraud: transaction anomaly detection, velocity checks, and device fingerprinting each tell part of the story. The model wins when it learns to weight them together.

Original analysis published on Medium.