Querying for reading level with a simple UDF
Today, just like in this post from
the previous week, I’d like to discuss creating a simple custom SQL function for Drill that maps strings to float
values. Except this week’s function is even more simple because it can fit within a single file and requires no
instructions in the setup()
method. In fact, this may be the simplest example of a Drill UDF I’ve ever seen, so if
you’ve been struggling with how to go about writing your own, the source code I’m presenting today maybe a good way to
get some traction.
The raison d’être of today ’s function is to calculate the reading level (or, ‘readability’) of a single sentence. Many solutions to the problem of readability utilize syllable counts, which are notoriously difficult to arrive at computationally. It’s possible that a lookup table for those counts would provide satisfactorily speedy results, but the algorithm that I’ve chosen to implement, called the automated readability index or ARI, avoids this problem by altogether using a character count instead. As per the Wikipedia article, the ARI is arrived at via:
$$ ARI = 4.71 \frac{characters}{words} + 0.5 \frac{words}{sentences} - 21.43 $$
However, as I indicated earlier I’m only interested in the readability of single sentences in this particular application (check out the next article!), so I’m going to implicitly set the number of sentences to 1 in the source code that comes later.
But before I talk about source you should probably first get some UDF-creation boilerplate out of the way. I’ve discussed how to do this a couple times, but if you’re still unsure of what to do go ahead and follow the instructions in the “Downloading Maven and starting a new project” section near the beginning of this article.
Once that’s out of the way, place this single file (Readability.java
) in your project’s
main/java/com/yourgroupidentifier/udf
directory:
package com.yourgroupidentifier.udf;
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
import org.apache.drill.exec.expr.holders.NullableVarCharHolder;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
@FunctionTemplate(
name = "readability",
scope = FunctionTemplate.FunctionScope.SIMPLE,
nulls = FunctionTemplate.NullHandling.NULL_IF_NULL
)
public class Readability implements DrillSimpleFunc {
@Param
NullableVarCharHolder input;
@Output
NullableFloat8Holder out;
public void setup() {
}
public void eval() {
// The length of 'pneumonoultramicroscopicsilicovolcanoconiosis'
final int longestWord = 45;
// Initialize output value
out.value = 0.0;
// Split input string up into words
String inputString = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(input.start,
input.end, input.buffer);
String[] inputStringWords = inputString.split("\\s+");
float numWords = inputStringWords.length;
float numCharacters = inputString.length() - (numWords-1); // Accounts for spaces
// Adjust for things in the text that aren't words
// i.e., They are longer than 'longestWord'
for(int i = 0; i < inputStringWords.length; i++) {
if( inputStringWords[i].length() > longestWord) {
numWords--;
numCharacters = numCharacters - inputStringWords[i].length();
}
}
// Output 'NULL' if the number of words is zero
if(numWords != 0) {
out.value = 4.71 * (numCharacters / numWords) + 0.5 * (numWords) - 21.43;
}
else {
out.isSet = 0;
}
}
}
This is pretty straight forward compared to the other
examples I’ve discussed
before, right? Just about the
only 'trick’ here is that I’ve made the number the function returns a 'Nullable’ type. This is to insure that it has a
more sane output than 'Infinity’ when it encounters a field with zero words—especially useful for when the
function is used in conjunction with AVG()
, which disregards NULL values but would propagate any 'Infinity’ to the
final result.
In the next post, we’ll try this function out in the 'field’ on one of my favorite data sets!
(And I swear I didn’t make that pun intentionally.)