User Defined Function for Apache Pig

Getting the UDF

You can get the prebuilt UDF from maven central.

If you use a maven based project simply add this dependency


Example usage

-- Import the UDF jar file so this script can use it
REGISTER ../target/*-udf.jar;

-- Define a more readable name for the UDF and pass optional parameters
-- First parameter is ALWAYS the cache size (as a text string!)
-- The parameters after that are the requested fields.
-- If you simply want 'everything'
-- DEFINE ParseUserAgent  nl.basjes.parse.useragent.pig.ParseUserAgent;
-- If you just want to set the cache
-- DEFINE ParseUserAgent  nl.basjes.parse.useragent.pig.ParseUserAgent('10000');
-- If you want to set the cache and only retrieve the specified fields
DEFINE ParseUserAgent  nl.basjes.parse.useragent.pig.ParseUserAgent('10000', 'DeviceClass', 'DeviceBrand' );

rawData =
    LOAD 'testcases.txt'
    USING PigStorage()
    AS  ( useragent: chararray );

UaData =
    FOREACH  rawData
    GENERATE useragent,
             -- Do NOT specify a type for this field as the UDF provides the definitions
             ParseUserAgent(useragent) AS parsedAgent;

