When it comes to storing data, serverless options are growing more and more popular among businesses every day. Going serverless reduces operational, developmental, and scaling costs, as well as eases management responsibility within your business. Google and Amazon have both done an outstanding job with their take on serverless operating, and have created two similar services in the process: Google BigQuery and Amazon Athena.
While both are great means of analyzing data, each has its own advantages and disadvantages. In this tutorial, we’ll explain more about BigQuery and Athena and do a comparison between the two.
BigQuery is a serverless data warehouse that supports super-fast SQL queries using the processing power of Google’s infrastructure. It combines a couple of other Google services, Dremel, Borg, Colossus, and Jupiter into a package that’s convenient for running ad hoc queries across very large databases. Simply load your data into BigQuery and let Google handle the hard work.
Athena is an interactive query service that allows you to conveniently analyze data stored in Amazon Simple Storage Service (S3) by using basic SQL. It’s completely serverless, meaning there’s no foundation that needs managing or set up, and it’s also fully portable. Athena is primarily used to analyze unstructured, semi-structured, and structured data stored in Amazon S3.
Now that you have a general understanding of both BigQuery and Athena, let’s talk about some key differences between the two. We’ll compare Google BigQuery and Amazon Athena on basics, performance, management, and cost.
For a detailed example of each product’s performance, check out this article from Logz.io.
BigQuery
Athena
Start Up
Start up can happen instantly
Can immediately begin queries on data already loaded in BigQuery
Start up can happen instantly
Requires no set-up
Can immediately begin queries on data in Amazon S3
Creating Tables
Uses Dremel, Google’s interactive query system to create tables
Supports the following table types:
Native- backed by native BigQuery storage
External- backed by external storage
Views- virtual tables defined by a SQL query
Wildcard- enable you to query multiple tables using concise SQL statements and represents a union of all the tables that match the wildcard expression
Uses Apache Hive to create tables
Supports the following table types:
External- backed by external storage
Views- virtual tables defined by a SQL query
Stores newly created table schema in a data catalog and uses it when you run queries
Uses schema-on-read approach
Query Speed
Uses Dremel to run queries
Built for running queries on massive datasets that are natively stored in BigQuery
Specific query performance:
Simple Select: Faster than Athena
Aggregated: Faster than Athena (Lightning speed due to use of native tables)
Uses Presto to run queries
Built for running queries on a smaller, single data source, regardless of data organization
Specific query performance:
Simple Select: Slower than BigQuery
Aggregated: Slower than BigQuery (Partitioning tables helps with faster queries and performance and converting data to columnar formats also helps with faster queries)
Storage costs are based on the amount of data stored:
Active- monthly charge for data stored in tables that have been modified in the last 90 days
Long-term- lower monthly charge for data stored in tables that have not been modified within the last 90 days
Query cost is per terabyte with two options:
On-Demand- based on usage; provides flexibility with minimum of 10 megabytes per query executed and no charge for failed queries or queries that use the cache
Flat-Rate- cost of all bytes processed is included in monthly rate (Beneficial for high-volume customers- stable monthly cost)
Storage costs are based on the amount of data stored in Amazon S3
Cost dependent on region
Four categories of storage:
Standard
Standard-Infrequent Access
One Zone-Infrequent Access
Amazon Glacier
Query cost is per terabyte of data scanned during a query execution
Minimum of 10 megabytes per query execution
No charge for failed queries
Most cost effective when data is compressed, partitioned, or converted to columnar format
After the above comparison, it’s clear that there’s no right or wrong answer when choosing between Google BigQuery and Amazon Athena; the choice ultimately depends on the needs of your business. Both products provide different functions and take a different approach to cloud-based services. BigQuery allows you to run SQL-like queries on multiple terabytes of data in a matter of seconds, and Athena allows you to quickly run queries on data from Amazon S3.
Google BigQuery excels when it comes to querying on petabyte-scale datasets. It’s created to query structured and semi-structured data using standard SQL, and its lightning fast speed can almost entirely be credited to its ability to work with native tables. The warehouse is cloud-based and fully managed, so operational overhead is nonexistent. Overall, BigQuery works best for running interactive and ad-hoc queries that involve sizeable datasets at a very fast pace.
Amazon Athena is noteworthy due to its simple yet efficient quality. No initial set up is required which makes ad hoc querying easy. It’s practical for simple and aggregated queries and is relatively cost effective. Generally, Athena works best for quickly and conveniently running queries at a low cost without needing to set up a complex infrastructure.