User Defined Function (ingest processor) for Elastic Search
The ElasticSearch ingest plugin is very new.
And yes it is similar to https://www.elastic.co/guide/en/elasticsearch/reference/master/user-agent-processor.html
You can get the prebuilt ingest plugin from maven central for
You only need to install it into your Elastic Search once
On Elastic Search 7.x:
bin/elasticsearch-plugin install file:///path/to/yauaa-elasticsearch-7.28.1.zip
On Elastic Search 8.x
bin/elasticsearch-plugin install file:///path/to/yauaa-elasticsearch-8-7.28.1.zip
This plugin is intended to be used in an ingest pipeline
.
You have to specify the name of the input field
and the place where
the possible configuration flags are:
Name | Mandatory/Optional | Description | Default | Example |
---|---|---|---|---|
field_to_header_mapping | M | The mapping from the input field name to the original request header name of this field | - | "field_to_header_mapping" : { "ua": "User-Agent" } |
- | "useragent" | |||
target_field | M | The name of the output structure that will be filled with the parse results | "user_agent" | "parsed_ua" |
fieldNames | O | A list of Yauaa fieldnames that are desired. When specified the system will limit processing to what is needed to get these. This means faster and less memory used. | All possible fields | [ "DeviceClass", "DeviceBrand", "DeviceName", "AgentNameVersionMajor" ] |
cacheSize | O | The number of entries in the LRU cache of the parser | 10000 | 100 |
preheat | O | How many testcases are put through the parser at startup to warmup the JVM | 0 | 1000 |
extraRules | O | A yaml expression that is a set of extra rules and testcases. | - | "config:\n- matcher:\n extract:\n - '"'"'FirstProductName : 1 :agent.(1)product.(1)name'"'"'\n" |
Create a pipeline that just extracts everything using the default settings:
curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/_ingest/pipeline/yauaa-test-pipeline_basic' -d '
{
"description": "A pipeline to do whatever",
"processors": [
{
"yauaa" : {
"field_to_header_mapping" : {
"useragent": "User-Agent",
"uach_platform": "Sec-CH-UA-Platform",
"uach_platform_version": "Sec-CH-UA-Platform-Version"
},
"target_field" : "parsed"
}
}
]
}
'
In this example a pipeline is created that only gets the fields that are actually desired.
curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/_ingest/pipeline/yauaa-test-pipeline_some' -d '
{
"description": "A pipeline to do whatever",
"processors": [
{
"yauaa" : {
"field_to_header_mapping" : {
"useragent": "User-Agent",
"uach_platform": "Sec-CH-UA-Platform",
"uach_platform_version": "Sec-CH-UA-Platform-Version"
},
"target_field" : "parsed",
"fieldNames" : [ "DeviceClass", "DeviceBrand", "DeviceName", "AgentNameVersionMajor", "FirstProductName" ],
}
}
]
}
'
In this example a pipeline is created that includes an example of a custom rule. The hardest part is making the yaml (with quotes, newlines and the needed indentation) encode correctly inside a JSon structure.
curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/_ingest/pipeline/yauaa-test-pipeline_full' -d '
{
"description": "A pipeline to do whatever",
"processors": [
{
"yauaa" : {
"field_to_header_mapping" : {
"useragent": "User-Agent",
"uach_platform": "Sec-CH-UA-Platform",
"uach_platform_version": "Sec-CH-UA-Platform-Version"
},
"target_field" : "parsed",
"fieldNames" : [ "DeviceClass", "DeviceBrand", "DeviceName", "AgentNameVersionMajor", "FirstProductName" ],
"cacheSize" : 10,
"preheat" : 10,
"extraRules" : "config:\n- matcher:\n extract:\n - '"'"'FirstProductName : 1 :agent.(1)product.(1)name'"'"'\n"
}
}
]
}
'
I put a record in ElasticSearch using the above mentioned Advanced pipeline
curl -H 'Content-Type: application/json' -X PUT 'localhost:9200/my-index/my-type/1?pipeline=yauaa-test-pipeline_full' -d '
{
"useragent" : "Mozilla/5.0 (Linux; Android 7.0; Nexus 6 Build/NBD90Z) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.124 Mobile Safari/537.36"
}
'
which returns
{"_index":"my-index","_type":"my-type","_id":"1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}
then I retrieve the record from elasticsearch and the additional parse results are now part of the indexed record.
curl -s -H 'Content-Type: application/json' -X GET 'localhost:9200/my-index/my-type/1' | python -m json.tool
results in
{
"_id": "1",
"_index": "my-index",
"_primary_term": 1,
"_seq_no": 0,
"_source": {
"parsed": {
"AgentName": "Chrome",
"AgentNameVersionMajor": "Chrome 53",
"AgentVersion": "53.0.2785.124",
"AgentVersionMajor": "53",
"DeviceBrand": "Google",
"DeviceClass": "Phone",
"DeviceName": "Google Nexus 6",
"FirstProductName": "Mozilla"
},
"useragent": "Mozilla/5.0 (Linux; Android 7.0; Nexus 6 Build/NBD90Z) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.124 Mobile Safari/537.36"
},
"_type": "my-type",
"_version": 1,
"found": true
}
The ElasticSearch testing tools are quick to complain about jar classloading issues: “jar hell”.
To make it possible to test this in IntelliJ you’ll need to set a custom property
idea.no.launcher=true