SQL and Parquet: A simple demo
For data wranglers wishing to store large amounts of text or numeric entries in an efficient fashion, the Apache Parquet file format makes for an obvious choice. As a storage medium, Parquet has two important benefits afforded by its columnar structure: 1.) Access to data columns can be made on an ‘as needed’ basis, increasing the overall speed of queries, and 2.) Since all values in a column are serialized and compressed together, Parquet files take up much less room than similar plain text or row-wise compressed files.
In this article I’ll run through a simple SQL manipulation enabled by Apache Drill which results in the creation of data stored in Parquet format. Drill’s support for Parquet runs deep, and it’s the default storage format for files created as a result of a CTAS (CREATE TABLE… AS) command.
We’ll start by downloading some parking data from the city of Aarhus, Denmark (available via this
site), renaming the file so that it has a ’.csvh’ extention (as per
this article). Then we’ll start the Drill
prompt with the usual incantation (drill-embedded
for single-machine or drill-conf
on a cluster) and do a simple
SELECT *
to look at the data:
> SELECT * FROM dfs.`/path/to/aarhus_parking.csvh` LIMIT 10;
+---------------+--------------------------+------+--------------+----------------+----------------------+
| vehiclecount | updatetime | _id | totalspaces | garagecode | streamtime |
+---------------+--------------------------+------+--------------+----------------+----------------------+
| 0 | 2014-05-22 09:09:04.145 | 1 | 65 | NORREPORT | 2014-11-03 16:18:44 |
| 0 | 2014-05-22 09:09:04.145 | 2 | 512 | SKOLEBAKKEN | 2014-11-03 16:18:44 |
| 869 | 2014-05-22 09:09:04.145 | 3 | 1240 | SCANDCENTER | 2014-11-03 16:18:44 |
| 22 | 2014-05-22 09:09:04.145 | 4 | 953 | BRUUNS | 2014-11-03 16:18:44 |
| 124 | 2014-05-22 09:09:04.145 | 5 | 130 | BUSGADEHUSET | 2014-11-03 16:18:44 |
| 106 | 2014-05-22 09:09:04.145 | 6 | 400 | MAGASIN | 2014-11-03 16:18:44 |
| 115 | 2014-05-22 09:09:04.145 | 7 | 210 | KALKVAERKSVEJ | 2014-11-03 16:18:44 |
| 233 | 2014-05-22 09:09:04.145 | 8 | 700 | SALLING | 2014-11-03 16:18:44 |
| 0 | 2014-05-22 09:39:01.803 | 9 | 65 | NORREPORT | 2014-11-03 16:18:44 |
| 0 | 2014-05-22 09:39:01.803 | 10 | 512 | SKOLEBAKKEN | 2014-11-03 16:18:44 |
+---------------+--------------------------+------+--------------+----------------+----------------------+
Before we save this table as a Parquet file, we need to switch to a writable workspace inside Drill. The command
USE dfs.tmp;
will do the trick—it tells Drill to write files to the system’s /tmp directory. With that done all we need to do is run a CTAS
CREATE TABLE parking AS SELECT * FROM dfs.`/path/to/aarhus_parking.csvh`;
and then rename the resulting temporary file:
mv /tmp/parking/0_0_0.parquet /path/to/parking.parquet
Now we can query the new Parquet file with:
SELECT * FROM dfs.`/path/to/parking.parquet` LIMIT 10;
which yields the same results as before. But what about file size? How does the compressed Parquet file compare to the original CSV? Pretty darn favorably, as it turns out:
$ ls -lrth
total 8488
-rw-r--r-- 1 user staff 594K Nov 30 15:06 parking.parquet
-rw-r-----@ 1 user staff 3.6M Nov 30 15:07 aarhus_parking.csvh
In round numbers, it’s about 1/6th the size!
So that’s an extremely brief tour of how easily Drill interfaces with the Parquet file format. Since the two are an extremely popular combination, expect to see some more in-depth articles about using Drill with Parquet on the Dremio Blog in the near future.