Fraud Detection: Identifying Bad Actors with Behavioural Signals

Fraud detection is rarely about finding a single smoking gun. It’s about finding users whose pattern of behaviour is statistically inconsistent with legitimate usage. This post walks through a case study of detecting fraudulent users on an internet platform using four data sources and a handful of behavioural signals.

The Dataset

Four data sources were combined:

Signup records — when and where users registered
Call logs — communication activity
Message history — messaging activity
Search activity — search queries

The first step was data quality: timestamp conversion, ID validation across tables, and joining the four sources into a single user-level feature set.

The Core Hypothesis

Legitimate users do things. They search, they call, they message — in roughly human proportions. Bots, scrapers, and fraudulent accounts tend to have one behaviour in abundance and others at zero.

The signal we build on this:

suspicion_score = search_normalized / (message_normalized + call_normalized)

Users whose ratio exceeds the 0.98 quantile threshold are flagged as suspects — they have an abnormally high number of searches compared to any communication activity.

Narrowing the Suspects

A high search-to-communication ratio alone isn’t enough. We refine further by excluding users who delayed their first search — real bots act immediately after signup. A genuine user might take days to start searching; a scraper starts at registration.

The final fraud detection workflow combines:

Search-to-communication ratio — primary signal
Zero communication users — users with searches but zero calls and messages
Geographic patterns — Spain showed the highest fraud rate at 18.31% of signups in that cohort
Temporal behaviour — time between signup and first search

Results

1,228 fraudulent users identified
500 users showed elevated search activity with near-zero communication
Spain flagged as highest-risk geography requiring additional scrutiny

Key Takeaway

Fraud detection works best when you combine multiple weak signals into a composite score rather than relying on any single rule. The search-to-communication ratio is powerful precisely because it’s relative — it adapts to each user’s own activity level rather than applying a fixed threshold.

The same principle applies in financial fraud: transaction anomaly detection, velocity checks, and device fingerprinting each tell part of the story. The model wins when it learns to weight them together.

Original analysis published on Medium.