Redshift Spectrum Initial Impressions

UPDATE: I was notified by AWS contacts that Spectrum does *not* use Athena. It shares the Athena catalog, but the nodes used for the S3 portion of Spectrum queries are a completely different technology developed by the Redshift team, not Presto as a service like Athena. My apologies to the Redshift team for getting it wrong! Fundamentally the message of this blog is the same, but I have modified some of the wording to better reflect my new understanding of the underlying technology.

The big news to come out of the AWS San Francisco 2017 Summit was the announcement of Redshift Spectrum. In this blog I bang on the tires!

What is it?

Redshift Spectrum is a new extension of Redshift that allows you to query data sets that reside in S3, by way of your database connection. In addition to querying the data in S3, you can join the data from S3 to tables residing in Redshift. This prevents you from having to load the data into Redshift, which in turn allows you to maintain a smaller physical cluster.

Storing a large dataset inside of Redshift is possible, but will require you to provision more/larger Redshift instances to provide the disk space and compute power required for querying. If your large tables are infrequently accessed, the larger Redshift investment loses value.

Redshift Spectrum allows you to query the data in S3 without having to worry about instances, disk storage, or computing power. AWS charges you $5 for every terabyte of data scanned from S3. If your dataset is infrequently accessed, it is likely that the occasional usage spike is still significantly cheaper than the ongoing price of a larger Redshift cluster.

I thought that I asked you what it is?

You did! Spectrum is a service that uses a dedicated fleet of servers to handle the S3 portion of your queries. The Athena catalog service is used to maintain the definition of these external tables.

Redshift loosely connects to S3 data by the following route:

Redshift → IAM → Athena/Hive Catalog → Spectrum → S3

External database, schema, and table definitions in Redshift use an IAM role to interact with the Athena catalog and Spectrum, which handles the S3 portion of the queries.

With Spectrum doing the heavy lifting of the S3 portion of the query, you can use a small Redshift cluster, though you will get better performance using a larger cluster. All of the testing performed for this blog took place on a single node, dc1.large Redshift instance, the smallest configuration possible.

Let’s do it!

In order to use Spectrum you may need to reboot your existing database. In addition to the reboot, you will need to create an IAM Role that can be used by Redshift to interact with Athena catalog resources.

Here is the AWS documentation on creating an IAM role for use by a Redshift cluster. Once the role is created, you can add the existing policy AmazonAthenaFullAccess, which will give your cluster the ability to manipulate the Athena catalog, as well as access to your data in S3.

Creating Redshift Spectrum Objects

Spectrum requires that you create an “external” schema in which you can define tables from S3 data sources. You will need to provide the ARN of the IAM role you created into this statement:

create external schema if not exists spectrum
from data catalog
database 'spectrum'
region 'us-west-2'
iam_role 'arn:aws:iam::1111111111111111:role/myRedshiftRole'
create external database if not exists;

Note that the database parameter refers to a database in your Athena catalog, but in our example create external database if not exists will create the database for us.

Now that the database and schema exist you can create the table. This is where the S3 data is mapped to columns in a Redshift table.

create external table spectrum.clickstream
(
    clickdate varchar(20)
    ,customer_name varchar(100)
    ,customer_email varchar(100)
    ,customer_age varchar(10)
    ,gender varchar(10)
    ,annual_income integer
    ,city varchar(50)
    ,state varchar(50)
    ,country varchar(50)
    ,page_name varchar(100)
    ,page_subdomain varchar(100)
    ,page_domain varchar(100)
    ,metric_val smallint
)
row format delimited fields terminated BY '|'
stored as textfile
location 's3://some-demo-data.full360.com/clickstream/';

My data set is a billion row clickstream stored in a pipe delimited text format. There are approximately 1000 compressed files for a total of 15GB. This data is not partitioned, so any query against this table will scan the entire data set. Each query of this data set will cost me 7.5 cents.

Note the stored as textfile option which can be one of textfile, sequencefile (Hadoop), RCfile, or Parquet. Parquet is the most interesting to me as it is a well supported columnar storage in S3 to match columnar in Redshift. Parquet S3 tables are easy to write to from Hive, Presto, Spark, or other Hadoop technologies.

There is a notable absence of JSON support in Spectrum. Even files containing flattened JSON structures in a format compatible with Redshift copyjson(auto) will not currently work.

Performance

The following statement takes ~40 seconds to execute:

select count(*) from spectrum.clickstream;

--40 seconds

This next query, an aggregate of record counts by state, takes around 50 seconds:

select 
  state,
  count(*) 
from 
  spectrum.clickstream 
group by 
  state;

--50 seconds

Of course… the simple examples above use only S3 data and could be performed using Athena without Redshift… so let’s go a bit deeper to understand the benefit of Spectrum.

New explain plan operators

Let’s look at the following query which joins S3 data to a small table in Redshift.

select 
  c.state,
  count(distinct c.page_name)
from 
  spectrum.clickstream c
  join
  states s
    on 
       c.state = s.state
group by 
  c.state;

I saw this query run consistently in the 80 second range, when running from a single dc1.large instance. In the explain plan below we see several new operators (cost removed for readability). These operators tell you which actions are taking place inside and outside of Redshift.

XN HashAggregate 
  ->  XN Subquery Scan volt_dt_0
    ->  XN HashAggregate
      ->  XN Hash Join DS_DIST_ALL_NONE
      Hash Cond: (("outer".derived_col1)::text =("inner".state)::text)
        ->  XN S3 Query Scan c
          ->  S3 HashAggregate
            ->  S3 Seq Scan spectrum.clickstream c location:"s3://some-demo-bucket.full360.com/clickstream" format:TEXT
          ->  XN Hash
            ->  XN Seq Scan on states_dist s

What is nice to see in the above plan is that the aggregation of the S3 data is taking place in the Spectrum nodes, not Redshift.

In this case we see DS_DIST_ALL_NONE is indicated by the hash join operator. This means that the join can be performed locally on each slice without the need for expensive network operations. The table we are joining to is diststyle all which means that it is replicated to every node in the cluster. This is a good distribution pattern for Spectrum queries that join to Redshift dimension tables.

I did another test joining to the same states table, but this time with distkey(state). The explain plan with the distributed table indicated the join would use DS_BCAST_INNER to replicate the states table to each node, insuring that the join is possible. This is expected behavior from Redshift. If you’re joining to small or medium tables that are distributed you shouldn’t have too much of a problem, but if you are joining a large table from S3 to a large, distributed table in Redshift, you should expect some slowdown.

What do I think?

I think it’s great! A natural extension of Redshift, and filling a need that Athena touches on, the ability to support infrequent, large queries efficiently. I’m sure there are limits and considerations that I haven’t encountered in my brief experiments, but the way it has been implemented seems like a good foundation.

Seeing S3 operators in the explain plans makes me happy!

Expect more blog posts articulating new use cases that are made possible by this, especially in the area of SQL based data integrations!