Apache Drill

Introduction

This is UDF for Apache Drill. This function was originally created by Charles S. Givre and was imported into the main Yauaa project to ensure users would have a prebuilt and up-to-date version available.

This function is now also packaged as part of Apache Drill itself: documentation.

Notable changes

With Yauaa 7.0.0

  • the code for this UDF has been copied back from the Drill project to ensure it keeps working as expected.
  • the parse_user_agent_field has been removed and parse_user_agent supports the same input/output now.

Usage

I have copied/implemented the functions

parse_user_agent ( <useragent> )
parse_user_agent ( <useragent> , <desired fieldname> )

to support the analysis of the Client Hints it now also supports

parse_user_agent ( <useragent> , [<header name>,<value>]+ )

or the variant which requires the presense of a User-Agent header.

parse_user_agent ( [<header name>,<value>]+ )

Installation

By default, Apache Drill has a version of this udf built in (part of the drill-udfs-*.jar)

In order to update the version of Yauaa in your drill installation you’ll need to replace the yauaa related jars with the latest versions. This includes not only updating yauaa-7.28.1.jar and yauaa-logparser-7.28.1.jar but also the rellevant dependencies.

If you want to replace it completely you’ll also need to remove the drill-udfs-*.jar and copy the yauaa-drill-7.28.1.jar and all relevant dependencies to <drill-path>/jars/3rdparty.

Similar:

mvn -DgroupId=nl.basjes.parse.useragent -DartifactId=yauaa-drill -Dversion=7.28.1 dependency:get -DoutputDirectory="./deps/"

Make sure you replace <drill-path> with your actual path to your drill installation. NOTE: You do not need the yauaa-drill-7.28.1-sources.jar to run in drill.

Usage and examples

Working queries with a direct value

SELECT parse_user_agent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36', 'AgentNameVersion') AS ANV from (values(1));

+---------------------+
|         ANV         |
+---------------------+
| Chrome 48.0.2564.82 |
+---------------------+

Or for the entire set of all values as a Drill map

SELECT parse_user_agent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36') AS UA from (values(1));

which is shown as a JSon in the commandline client:

+----------------------------------------------------------------------------------+
|                                        UA                                        |
+----------------------------------------------------------------------------------+
| {"DeviceClass":"Desktop","DeviceName":"Linux Desktop","DeviceBrand":"Unknown","DeviceCpu":"Intel x86_64", "DeviceCpuBits":"64","DeviceFirmwareVersion":"??","DeviceVersion":"??","OperatingSystemClass":"Desktop","OperatingSystemName":"Linux","OperatingSystemVersion":"??","OperatingSystemVersionMajor":"??","OperatingSystemNameVersion":"Linux ??","OperatingSystemNameVersionMajor":"Linux ??","OperatingSystemVersionBuild":"??","LayoutEngineClass":"Browser","LayoutEngineName":"Blink","LayoutEngineVersion":"48.0","LayoutEngineVersionMajor":"48","LayoutEngineNameVersion":"Blink 48.0","LayoutEngineNameVersionMajor":"Blink 48","LayoutEngineBuild":"Unknown","AgentClass":"Browser","AgentName":"Chrome","AgentVersion":"48.0.2564.82","AgentVersionMajor":"48","AgentNameVersion":"Chrome 48.0.2564.82","AgentNameVersionMajor":"Chrome 48","AgentBuild":"Unknown","AgentLanguage":"Unknown","AgentLanguageCode":"Unknown","AgentInformationEmail":"Unknown","AgentInformationUrl":"Unknown","AgentSecurity":"Unknown","AgentUuid":"Unknown","WebviewAppName":"Unknown","WebviewAppVersion":"??","WebviewAppVersionMajor":"??","WebviewAppNameVersion":"Unknown ??","WebviewAppNameVersionMajor":"Unknown ??","FacebookCarrier":"Unknown","FacebookDeviceClass":"Unknown","FacebookDeviceName":"Unknown","FacebookDeviceVersion":"??","FacebookFBOP":"Unknown","FacebookFBSS":"Unknown","FacebookOperatingSystemName":"Unknown","FacebookOperatingSystemVersion":"??","Anonymized":"Unknown","HackerAttackVector":"Unknown","HackerToolkit":"Unknown","KoboAffiliate":"Unknown","KoboPlatformId":"Unknown","IECompatibilityVersion":"??","IECompatibilityVersionMajor":"??","IECompatibilityNameVersion":"Unknown ??","IECompatibilityNameVersionMajor":"Unknown ??","Carrier":"Unknown","GSAInstallationID":"Unknown","NetworkType":"Unknown","__SyntaxError__":"false"} |
+----------------------------------------------------------------------------------+

The function returns a Drill map, so you can access any of the fields using Drill’s table.map.key notation. For example, the query below illustrates how to extract a field from this map and summarize it:

SELECT uadata.ua.AgentNameVersion AS Browser,
COUNT( * ) AS BrowserCount
FROM (
   SELECT parse_user_agent( columns[0] ) AS ua
   FROM dfs.`/tmp/testcase.tsv`
) AS uadata
GROUP BY uadata.ua.AgentNameVersion
ORDER BY BrowserCount DESC;

+----------------------+---------------+
|       Browser        | BrowserCount  |
+----------------------+---------------+
| Chrome 48.0.2564.82  | 1             |
| Googlebot 2.1        | 1             |
+----------------------+---------------+

Full example analyzing with the User-Agent Client Hints

Assume an Apache Httpd webserver with the following LogFormat config:

LogFormat "%a %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Sec-CH-UA}i\" \"%{Sec-CH-UA-Arch}i\" \"%{Sec-CH-UA-Bitness}i\" \"%{Sec-CH-UA-Full-Version}i\" \"%{Sec-CH-UA-Full-Version-List}i\" \"%{Sec-CH-UA-Mobile}i\" \"%{Sec-CH-UA-Model}i\" \"%{Sec-CH-UA-Platform}i\" \"%{Sec-CH-UA-Platform-Version}i\" \"%{Sec-CH-UA-WoW64}i\" %V" combinedhintsvhost

Behind this Apache Httpd webserver is a website that returns the header

Accept-CH: Sec-CH-UA, Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Mobile, Sec-CH-UA-Model, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version, Sec-CH-UA-WoW64

With all of this in place: these are two of the lines that are found in the access log of this Apache Httpd webserver:

45.138.228.54 - - [02/May/2022:12:25:10 +0200] "GET / HTTP/1.1" 200 16141 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36" "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"100\", \"Google Chrome\";v=\"100\"" "\"x86\"" "\"64\"" "\"100.0.4896.127\"" "\" Not A;Brand\";v=\"99.0.0.0\", \"Chromium\";v=\"100.0.4896.127\", \"Google Chrome\";v=\"100.0.4896.127\"" "?0" "\"\"" "\"Linux\"" "\"5.13.0\"" "?0" try.yauaa.basjes.nl
45.138.228.54 - - [02/May/2022:12:25:34 +0200] "GET / HTTP/1.1" 200 15376 "-" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Mobile Safari/537.36" "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"101\", \"Google Chrome\";v=\"101\"" "\"\"" "-" "\"101.0.4951.41\"" "\" Not A;Brand\";v=\"99.0.0.0\", \"Chromium\";v=\"101.0.4951.41\", \"Google Chrome\";v=\"101.0.4951.41\"" "?1" "\"Nokia 7.2\"" "\"Android\"" "\"11.0.0\"" "?0" try.yauaa.basjes.nl

For this example the name of this file is access.hints

I start by doing a query on this data and ONLY use the User-Agent as the input:

SELECT  uadata.ua.DeviceClass                AS DeviceClass,
        uadata.ua.AgentNameVersionMajor      AS AgentNameVersionMajor,
        uadata.ua.OperatingSystemNameVersion AS OperatingSystemNameVersion
FROM (
    SELECT
            parse_user_agent(`request_user-agent`) AS ua
    FROM    table(
                dfs.`/tmp/access.hints` (
                    type => 'httpd',
                    logFormat => '%a %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Sec-CH-UA}i" "%{Sec-CH-UA-Arch}i" "%{Sec-CH-UA-Bitness}i" "%{Sec-CH-UA-Full-Version}i" "%{Sec-CH-UA-Full-Version-List}i" "%{Sec-CH-UA-Mobile}i" "%{Sec-CH-UA-Model}i" "%{Sec-CH-UA-Platform}i" "%{Sec-CH-UA-Platform-Version}i" "%{Sec-CH-UA-WoW64}i" %V',
                    flattenWildcards => true
                )
            )
) AS uadata;

which produces

+-------------+-----------------------+----------------------------+
| DeviceClass | AgentNameVersionMajor | OperatingSystemNameVersion |
+-------------+-----------------------+----------------------------+
| Desktop     | Chrome 100            | Linux ??                   |
| Phone       | Chrome 101            | Android ??                 |
+-------------+-----------------------+----------------------------+
2 rows selected (0.183 seconds)

The first example here does not have the exact version of the operating system as part of the User-Agent and this results in Linux ??.

The second example shows Android 10 but was recognized as being a reduced variant of the User-Agent, this means that the version 10 is an invalid standard value that is not true. So here you see Android ??.

Now let’s repeat the same and use the recorded User-Agent Client Hint header values:

SELECT  uadata.ua.DeviceClass                AS DeviceClass,
        uadata.ua.AgentNameVersionMajor      AS AgentNameVersionMajor,
        uadata.ua.OperatingSystemNameVersion AS OperatingSystemNameVersion
FROM (
    SELECT
            parse_user_agent(
                'User-Agent' ,                  `request_user-agent`,
                'sec-ch-ua',                    `request_header_sec-ch-ua`,
                'sec-ch-ua-arch',               `request_header_sec-ch-ua-arch`,
                'sec-ch-ua-bitness',            `request_header_sec-ch-ua-bitness`,
                'sec-ch-ua-full-version',       `request_header_sec-ch-ua-full-version`,
                'sec-ch-ua-full-version-list',  `request_header_sec-ch-ua-full-version-list`,
                'sec-ch-ua-mobile',             `request_header_sec-ch-ua-mobile`,
                'sec-ch-ua-model',              `request_header_sec-ch-ua-model`,
                'sec-ch-ua-platform',           `request_header_sec-ch-ua-platform`,
                'sec-ch-ua-platform-version',   `request_header_sec-ch-ua-platform-version`,
                'sec-ch-ua-wow64',              `request_header_sec-ch-ua-wow64`
            ) AS ua
    FROM    table(
                dfs.`/tmp/access.hints` (
                    type => 'httpd',
                    logFormat => '%a %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Sec-CH-UA}i" "%{Sec-CH-UA-Arch}i" "%{Sec-CH-UA-Bitness}i" "%{Sec-CH-UA-Full-Version}i" "%{Sec-CH-UA-Full-Version-List}i" "%{Sec-CH-UA-Mobile}i" "%{Sec-CH-UA-Model}i" "%{Sec-CH-UA-Platform}i" "%{Sec-CH-UA-Platform-Version}i" "%{Sec-CH-UA-WoW64}i" %V',
                    flattenWildcards => true
                )
            )
) AS uadata;

which produces

+-------------+-----------------------+----------------------------+
| DeviceClass | AgentNameVersionMajor | OperatingSystemNameVersion |
+-------------+-----------------------+----------------------------+
| Desktop     | Chrome 100            | Linux 5.13.0               |
| Phone       | Chrome 101            | Android 11.0.0             |
+-------------+-----------------------+----------------------------+
2 rows selected (0.275 seconds)

The improvement after adding the Client Hints is evident.