<< Quantum data
Advanced Heuristic Detection
Further reading: How DetectRight detects User Agents
The best of both worlds
DetectRight combines the best of expression matching and user-agent matching to provide the best detection accuracy even on new User Agents: a perennial weakness of existing systems such as DeviceAtlas.
Pure Algorithm doesn't work so well...
Pure algorithm based engines include OpenDDR, Browsercaps, and any small-script based detection such as that from DetectMobileBrowsers.mobi
The examples mentioned above are systems which do not ship with lists of User Agents: and so rely exclusively on algorithms/heuristics to do detection.
Expression matching for detection is often used simply to determine mobile/non-mobile, or to establish whether a useragent can be treated in some way (perhaps to reduce to an arbitrary ID that can be matched to a data record.
There are some major problems with a pure algorithm/heuristic approach:
- The chaos of User Agent design and the minor details that can impact on the order of information mean that you will inevitably either perform poorly, end up with unmaintainable spaghetti logic, or both.
- Often a successful detection will involve chains or two or three expressions in the correct order: detections relying on the appearance of one token in a string are doomed to a life of inaccuracy.
- Even worse, in many cases, the order of the expressions themselves will have an effect on accuracy: for instance, the appearance of "MSIE" in a string can only be assumed to be MS Internet Explorer when all other options have been exhausted: for instance, Pocket Internet Explorer, IEMobile, or any other browser that uses the Mozilla sequence and puts "MSIE" in to keep sniffers happy.
- Also, sometimes the jump between what's in a UserAgent and what it represents is too large to be patternised: and must be manually mapped as an exception.
- A hidden problem for a purely algorithmic system is this: the quality of the detection is dependent upon the number of useragents run through it: but the producers of these systems either have no access to the number of User Agents required, or do not have the ability to process them cost-effectively (or at all).
- Lastly, the accuracy is too poor to be adequate except for the most undemanding use cases: for instance, mobile/non-mobile detection where the cost of failure is trivial.
DetectRight's algorithmic engine has the ability to have up to seven expressions in one component detection alone: and runs over 800 of them on every useragent, to detect the type, manufacturer, model/name, version number and build of every single component it can identify. It then prioritises what it sees, resolves it to a clean list of components, combines that intelligently with the components in the default shipped build (thus filling the gaps), and uses that collection of entities to generate the data. But that's not all it does. It can also pass headers through the same engine, and pass the results of parts of the detections through a validation engine.
Pure User Agent detection is hit and miss too
Pure User Agent detection means having a list of previously seen User Agents mapped to a data record. If there's an exact match, then all is well with the world. If the useragent misses (even when it's been cleaned of language strings, for instance), then you are left with nothing.
- It's not really appreciated that there are simply too many useragents being released for anything but a dedicated system to deal with them. DetectRight has seen over 1,000,000 user agents, 200,000 of which are mobile. Any pure User Agent detection system that didn't contain all of them would be incomplete. But any system which did contain them would be so large or slow that it wouldn't be deployable.
- Applications can themselves change User Agents in small but meaningful ways: Facebook has a format all its own, for instance.
- Third party browser manufacturers are always altering User Agent formats.
- New User Agents enter the market every day. Hundreds of them, possibly thousands.
Partial Matching and Recovery strategies
In reality, no one writes a system for pure User Agent matching, precisely because of the problems mentioned. For instance, DeviceAtlas, WURFL, Volantis and others have mechanisms to fall back to if a user agent match fails.
This leads to another problem: the less User Agents a system has, the leaner and quicker it becomes: but the more it relies on its internal strategies.
It doesn't matter how good the rest of the database is, if those strategies are not highly effective, device detection will plummet in accuracy, since it will not be able to retrieve the specific information about the device already present.
These strategies vary from system to system. In DeviceAtlas, it's very infrastructure assumes that the meaningful information in a User Agent comes at the beginning of the string. It inherited this from the varous systems it evolved from: and it was a functional strategy back when only desktops used Mozilla format useragents: though based on the all-too-common assumption that seems to plague the mobile industry: "things will not change from this point" (despite being in an industry almost designed to change everything regularly).
This means that DeviceAtlas has to rely on knowing when to "jump over" things likely to break its detection: such as Android version numbers (a recent upgrade means that it keeps this previously discarded information, but that doesn't alter the profile). This method is an untidy hack, as is any method which is sensitive to character positioning in a User Agent, which is a collection of arbitrarily ordered tokens.
When an exact match is not forthcoming in WURFL, a "handler" is chosen: each handler runs some Regular Expressions to see what module should handle the string. Once that module is chosen, the system either uses the "Repeated Incrementing String" method (building up one character at a time until there's a partial match on another string), or LD, which compares strings for the "distance" between them in terms of character differences. In each case, the handler also has different start/end parameters to improve accuracy.
If that fails, then the system has to fall back to "Recovery Matching": the use of simple hardcoded terms to attach to a particular data entry, which will usually be a browser or operating system.
These techniques have one thing in common: they are general purpose string handling poorly suited to User Agents
DetectRight incorporates the best of both technologies: a vast collection of User Agents, a huge collection of algorithms, and a database which only includes UserAgents where the device it's attached to is different from the one the algorithms would choose. It even allows a user's own User Agents to supercede DetectRight's own mapping.
DetectRight also manages to ship with an engine tested on millions of User Agents, but to ship with a fraction of that total.
Before a DetectRight database is generated, every single one of its user-agents is passed through the detection engine: for any results where there is a disagreement between our manual data and the algorithm, the manual mapping takes priority, and the compiled database will contain it.
This means DetectRight comes with the equivalent of 2,000,000 detected useragents, without the downside.