Apache Beam SQL
Introduction
This is a User Defined Function for Apache Beam SQL.
Getting the UDF
You can get the prebuilt UDF from maven central.
If you use a maven based project simply add this dependency to your project.
<dependency>
<groupId>nl.basjes.parse.useragent</groupId>
<artifactId>yauaa-beam-sql</artifactId>
<version>7.29.0</version>
</dependency>
Available functions
Getting a single value
To get a single value from the parse result use this one:
ParseUserAgentField(userAgent, 'DeviceClass') AS deviceClassField
to give
Phone
Getting several values as a Map (requires Apache Beam 2.30.0 or newer)
You can ask for all fields and return the full map with all of them in there.
ParseUserAgent(userAgent) AS allFields
If you want to make use of the support for the User-Agent Client Hints
you must call the function from your SQL with a list of header name
and value
. The header names must be the same as what a browser would send to the webserver (see: Specification).
Essentially two forms are now possible:
ParseUserAgent ( <useragent> , [<header name>,<value>]+ )
and the variant which requires the presense of a User-Agent
header.
ParseUserAgent ( [<header name>,<value>]+ )
For example:
ParseUserAgent(
'User-Agent', useragent,
'Sec-CH-UA-Platform', chPlatform,
'Sec-CH-UA-Platform-Version', chPlatformVersion
) AS parsedUseragent
If you need multiple fields but not all of them you can ask for only those specific fields:
ParseUserAgent(userAgent, 'DeviceClass', 'AgentNameVersion') AS someFields
With such a map you can then extract the field you really need with SQL syntax similar to this:
ParseUserAgent(userAgent, 'DeviceClass')['DeviceClass'] AS deviceClass
ParseUserAgent(userAgent)['AgentNameVersion'] AS agentNameVersion
Because a Beam SQL UDF cannot have a constructor there is no performance gain from limiting the required fields during the analysis phase. It does make the returned set smaller and thus in terms of transport and serialization a bit faster.
If you als want the client hints all of these parameters can be combined like this:
ParseUserAgent(
'user-Agent', userAgent,
'sec-CH-UA-Platform', chPlatform,
'sec-CH-UA-Platform-Version', chPlatformVersion
'DeviceClass',
'AgentNameVersionMajor',
'OperatingSystemNameVersion',
)
which is equivalent to (i.e. the ordering of the “wanted fields” and the “request headers with value” can be shuffled):
ParseUserAgent(
'DeviceClass',
'user-Agent', userAgent,
'AgentNameVersionMajor',
'sec-CH-UA-Platform', chPlatform,
'OperatingSystemNameVersion',
'sec-CH-UA-Platform-Version', chPlatformVersion
)
Getting several values as JSon
Assuming the input
Mozilla/5.0 (Linux; Android 7.0; Nexus 6 Build/NBD90Z) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.124 Mobile Safari/537.36
To parse this input value into all possible values and returns the complete result as a single JSon string use this:
ParseUserAgentJson(userAgent)
to give (single line)
{"Useragent":"Mozilla\/5.0 (Linux; Android 7.0; Nexus 6 Build\/NBD90Z) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/53.0.2785.124 Mobile Safari\/537.36",
"DeviceClass":"Phone","DeviceName":"Google Nexus 6","DeviceBrand":"Google",
"OperatingSystemClass":"Mobile","OperatingSystemName":"Android","OperatingSystemVersion":"7.0","OperatingSystemVersionMajor":"7","OperatingSystemNameVersion":"Android 7.0","OperatingSystemNameVersionMajor":"Android 7","OperatingSystemVersionBuild":"NBD90Z",
"LayoutEngineClass":"Browser","LayoutEngineName":"Blink","LayoutEngineVersion":"53.0","LayoutEngineVersionMajor":"53","LayoutEngineNameVersion":"Blink 53.0","LayoutEngineNameVersionMajor":"Blink 53",
"AgentClass":"Browser","AgentName":"Chrome","AgentVersion":"53.0.2785.124","AgentVersionMajor":"53","AgentNameVersion":"Chrome 53.0.2785.124","AgentNameVersionMajor":"Chrome 53"}
To get a JSon with only specific fields you can do (up to 10 fields van be requested this way)
ParseUserAgentJson(userAgent, 'DeviceClass', 'AgentNameVersion')
to give
{"DeviceClass":"Phone","AgentNameVersion":"Chrome 53.0.2785.124"}
Note that the Client Hints model also applies here so this works too:
ParseUserAgentJson(
'DeviceClass',
'user-Agent', userAgent,
'AgentNameVersionMajor',
'sec-CH-UA-Platform', chPlatform,
'OperatingSystemNameVersion',
'sec-CH-UA-Platform-Version', chPlatformVersion
)
Example usage
Assume you have a PCollection with your records.
PCollection<Row> input = ...
You can then put that through an SQL statement to transform it. You have to name the input (in this example we call it InputStream), and you have to register the UDF classes you want to use with the name you want to have in your SQL statement.
PCollection<Row> result =
// This way we give a name to the input stream for use in the SQL
PCollectionTuple.of("InputStream", input)
// Apply the SQL with the UDFs we need.
.apply("Execute SQL", SqlTransform
.query(
"SELECT" +
" userAgent AS userAgent, " +
" ParseUserAgent(userAgent, 'DeviceClass', 'AgentNameVersion') AS parsedUserAgentMap, " +
" ParseUserAgentJson(userAgent) AS parsedUserAgentJson, " +
" ParseUserAgentField(userAgent, 'DeviceClass') AS deviceClass, " +
" ParseUserAgentField(userAgent, 'AgentNameVersion') AS agentNameVersion " +
"FROM InputStream")
.registerUdf("ParseUserAgent", ParseUserAgent.class)
.registerUdf("ParseUserAgentJson", ParseUserAgentJson.class)
.registerUdf("ParseUserAgentField", ParseUserAgentField.class)
);
Limitations / Future
The ParseUserAgent
and ParseUserAgentJson
have a limitation of at most 10 fieldnames because Calcite does not yet support variable arguments for UDFs. If you need more than 10 fields you currently need to get all
fields and then extract the fields you need from there.