Converting CSV file data to Kudu storage
Last time I showed you how to enable support for Apache’s new Kudu distributed storage software in the upcoming Drill 1.5 release. In this post I’ll delve a bit farther into Drill’s interface to Kudu by demonstrating how to store some example data found in a CSV file to the format.
Today’s data comes from the U.S. Census Bureau and has to do with the number and kind of businesses in U.S. counties (the first .zip file listed on the page). Once the storage plugin is enabled (as per the previous article), just type
USE kudu;
at the Drill prompt to insure that any tables you construct will be written to Kudu.
When using a CTAS command to write to the Kudu system, you need to specify a set of unique values in the first column to be to be used as a key. I ended up using Drill’s built-in RANDOM() function to accomplish this, so my command to construct a table within Kudu from the census data looks like:
CREATE TABLE censustest AS SELECT CAST(1000000000000000*RANDOM() AS BIGINT), * FROM dfs.`/path/to/cbp13co.csvh`;
(Note how I’ve renamed the orginal file to have a ’.csvh’ extension—this tells Drill to parse it for column names!)
Now using this description of the data set, we can perform a quick sample query for the state, county, and industry codes, plus the corresponding total number of business establishments.
> SELECT fipstate, fipscty, naics, est FROM censustest LIMIT 10;
+-----------+----------+---------+------+
| fipstate | fipscty | naics | est |
+-----------+----------+---------+------+
| 55 | 107 | 321113 | 1 |
| 48 | 185 | 51213/ | 1 |
| 39 | 135 | 42491/ | 6 |
| 04 | 027 | 4411// | 30 |
| 08 | 085 | 541870 | 1 |
| 26 | 061 | 51---- | 17 |
| 28 | 049 | 519/// | 3 |
| 48 | 139 | 48849/ | 3 |
| 22 | 073 | 4853// | 3 |
| 48 | 139 | 424910 | 10 |
+-----------+----------+---------+------+
This is pretty cool! Support for Kudu means that Drill isn’t just keeping pace with current storage technology—it’s also looking out for new ones on the horizon.