Hive Built-in Function and user defined Function (UDF)

Let us discuss Hive Function: Built-in Function and user defined Function (UDF),

Hive Functions are built for a specific purpose to perform various operations like Mathematical, arithmetic, logical and relational operations on the operands.

Built-in Function

Hive supports some of the built in functions like,

  1. Collection Functions
  2. Date Functions
  3. Mathematical Functions
  4. Conditional Functions
  5. String Functions
  6. Functions

Collection Functions

These functions are used to find the size of the complex types like array and map. The following table describes the Collection Functions available in Hive.

Function NameReturn TypeDescription
size(Map<K.V>)INTIt will fetch and give the components number in the map type.
size(Array<T>)INTIt will fetch and give the elements number in the array type.
Map_keys(Map<K.V>)Array<K>It will fetch and gives an array containing the keys of the input map. Here array is in unordered.
Map_values(Map<K.V>)Array<V>It will fetch and gives an array containing the values of the input map. Here array is in unordered.
Sort_array(Array<T>)Array<t>It will sorts the input array in ascending order of array and elements and returns it.

Example

hive>select sort_array(array(‘ram’,’appu’,’zia’,’roy’)) from emp;

Above query sort the input array in ascending order.

hive>select size(Array(‘appu’)) from emp;

above query gives the elements number in the array type.

Date Functions

These functions are used to perform operations on date data types like adding the number of days to the date, conversion of Date types from one type to another type etc.

Function NameReturn TypeDescription
Unix_Timestamp()BigIntIt will get current Unix timestamp in seconds.
To_date (string timestamp)StringIt will return the date part of a timestamp string.
year(string date)INTIt will fetch and give the year part of a date or a timestamp string.
quarter (date/timestamp/string)INTIt will return the quarter of the year for a date, timestamp, or string in the range 1 to 4.
month (string date)INTIt will give the month part of a date or a timestamp string.
hour (string date)INTIt will return the hour of the timestamp.
minute (string date)INTIt will fetch and return the minute of the timestamp.
Date_sub (string starting date, int days)StringIt will get and Subtract the number of days from starting date.
Current_dateDateIt will return the current date at the start of query evaluation.
LAST _day (string date)StringIt will fetch and gives the last day of the month which the date belongs to.
trunc (string date, string format)StringIt will get and return date truncated to the unit specified by the format.

Example

hive>select year(‘2005-01-01 10:20:30’) from emp;

output: 2005

hive>select month(‘2005-05-01 10:20:30’) from emp;

output: 5

Mathematical Functions

These functions mainly used to perform mathematical calculations.

Function NameReturn TypeDescription
round (DOUBLE X)DOUBLEIt will get and returns the rounded BIGINT value of X.
round (DOUBLE X, INT d)DOUBLEIt will get and returns X rounded to d decimal places.
bround (DOUBLE X)DOUBLEIt will get and returns the rounded BIGINT value of X using HALF_EVEN rounding mode.
Floor (DOUBLE Y)BIGINTIt will get and returns the maximum BIGINT value that is less then or equal to Y value.
ceil (DOUBLE a), ceiling (DOUBLE a)BIGINTIt will returns the minimum BIGINT value that is equal to or greater than X value.
rand(), rand(INT seed)DOUBLEIt will get and returns a random number that is distributed uniformly from 0 to 1.

Example

hive>select round(3.6) from emp;

output:4

hive>select floor(3.6) from emp;

output:3

hive>select ceil(3.6) from emp;

output:4

Conditional Functions

These functions are used to returns a value based on whether the test condition is true or false i.e test conditions.

Function NameReturn TypeDescription
if (Boolean testCondition, T valueTrue, T valueFalseOrNull)TIt will get and return value True when Test Condition is of true, gives value False Or Null otherwise.
ISNULL( X)BooleanIt will fetch and gives true if X is NULL and false otherwise.
ISNOTNULL(X )BooleanIt will fetch and gives true if X is not NULL and false otherwise.

Example

IF(1=1, ‘working’, ‘not working’) returns ‘working’

String Functions

These functions are used to perform operations on strings like finding the length of a string etc.

Function NameReturn TypeDescription
reverse(string X)StringIt will give the reversed string of X
rpad(string str, int length, string pad)StringIt will get and return str, which is right-padded with pad to a length of length (integer value).
rtrim(string X)StringIt will fetch and returns the string resulting from trimming spaces from the end (right hand side) of X For example, rtrim(‘ results ‘) results in ‘ results’
space(INT n)StringIt will fetch and gives a string of n spaces.
split(STRING str, STRING pat)StringSplits str around pat (pat is a regular expression).
Str_to_map(text[, delimiter1, delimiter2])map<String ,String>It will split text into key-value pairs using two delimiters.

Example

hive>select reverse(‘Beyondcorner)

output: renrocdnoyeb

hive>select rtrim(‘Hadoop ‘)

output: Hadoop

Misc. Functions

NameReturn TypeDescription
hash(a1[, a2…])IntReturns a hash value of the arguments

User defined Function (UDF)

Hive supports the user to define their own functions to meet the needs of client is called as User defined Function.

Example

suppose i need to perform sequence analysis based on client request on the below table called “demo” . I will create my own UDF called “rank”, it will return the number of times it has previously seen the same value.

data
P1
P1
P2
P2
P2
P3
P4

hive>select data,rank(data) from demo;

output:
P10
P11
P20
P21
P22
P30
P40

There are three types of user defined functions. They are,

1. Regular UDF

* It works on a single row in a table and produces a single row as output.

* It is one to one relationship between input and output of a function.

e.g Hive built in LENGTH() function.

hive>select LENGTH(col1) from emp;

2. UDAF (User defined aggregate functions)

* It works on more than one row and gives single row as output (many to one relationship)

e.g Hive built in MAX() or COUNT() functions.

hive>select count(*) from emp;

3. UDTF (User defined tabular function)

* It works on one row as input and returns multiple rows as output.

* It is one to many relationship.

*UDTF can be used to split a column into multiple column.

e.g Hive built in EXPLODE() function.

Note:EXPLODE() function is used to display the lateral view.

Consider the below base table named “pageads” contains two columns like pageid,adid_list

Column nameColumn type
pageidSTRING
adid_listArray<int>

An example table with two rows:

pageidadid_list
front_page[1, 2, 3]
contact_page[3, 4, 5]

and the user like to count the total number of times an ad appears across all pages.

A lateral view with EXPLODE() can be used to convert adid_list into separate rows using the query.

hive> select pageid,adid from from pageads LATERAL VIEW EXPLODE(adid_list) adtables AS adid;

output:

Pageid(string)adid_list(int)
“front_page”1
“front_page”2
“front_page”3
“contact_page”3
“contact_page”4
“contact_page”5

 

Features               

  • User defined function supports reusability of code in application frameworks. i.e developer develop the functions in Java and integrate with UDFs in Hive.
  • It will provide high performance in terms of coding and execution.
  • It will accept and produce different numbers of input and output values depending on the use cases of the UDF.

References

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF