Contents
Fraud detection is rarely about finding a single smoking gun. It’s about finding users whose pattern of behaviour is statistically inconsistent with legitimate usage. This post walks through a case study of detecting fraudulent users on an internet platform using four data sources and a handful of behavioural signals.
The Dataset
Four data sources were combined:
- Signup records — when and where users registered
- Call logs — communication activity
- Message history — messaging activity
- Search activity — search queries
The first step was data quality: timestamp conversion, ID validation across tables, and joining the four sources into a single user-level feature set.
The Core Hypothesis
Legitimate users do things. They search, they call, they message — in roughly human proportions. Bots, scrapers, and fraudulent accounts tend to have one behaviour in abundance and others at zero.
The signal we build on this:
suspicion_score = search_normalized / (message_normalized + call_normalized)
Users whose ratio exceeds the 0.98 quantile threshold are flagged as suspects — they have an abnormally high number of searches compared to any communication activity.
Narrowing the Suspects
A high search-to-communication ratio alone isn’t enough. We refine further by excluding users who delayed their first search — real bots act immediately after signup. A genuine user might take days to start searching; a scraper starts at registration.
The final fraud detection workflow combines:
- Search-to-communication ratio — primary signal
- Zero communication users — users with searches but zero calls and messages
- Geographic patterns — Spain showed the highest fraud rate at 18.31% of signups in that cohort
- Temporal behaviour — time between signup and first search
Results
- 1,228 fraudulent users identified
- 500 users showed elevated search activity with near-zero communication
- Spain flagged as highest-risk geography requiring additional scrutiny
Key Takeaway
Fraud detection works best when you combine multiple weak signals into a composite score rather than relying on any single rule. The search-to-communication ratio is powerful precisely because it’s relative — it adapts to each user’s own activity level rather than applying a fixed threshold.
The same principle applies in financial fraud: transaction anomaly detection, velocity checks, and device fingerprinting each tell part of the story. The model wins when it learns to weight them together.
Original analysis published on Medium.