Metabase Presto



Data visualization dashboards (aka BI tools) are an essential piece for the success of every data analytics project - whether it is using big data technologies or traditional data warehousing approach. Earlier this space has been populated primarily by paid BI (Business intelligence) tools like Tableau, Micro Strategy etc, but lately lot of open source alternative are arising with noticeable ones of Redash, Superset and Metabase (Another notable tool is Kibana, but its backend support is limited to Elasticsearch and hence not a general purpose BI tool)

  1. Metabase Presto Jdbc
  2. Metabase Presto Digital
  3. Metabase Presto Sql
  4. Metabase Presto 3
  5. Metabase Presto
  6. Metabase Presto Vs

It is still early days for these open source dashboards, but they provide a very attractive proposition for internal dashboards already. Here is a quick comparison of Superset vs Redash vs Metabase.

Please note that lot of startups have already been successfully using these 3 dashboards :)

上一篇记录了怎么安装 presto 引擎集成 kudu 数据(presto 安装集成 kudu),现在需要将数据展示出来,供其他的研究人员做数据分析,这里比较好用的工具是 metabse,可以集成多种数据库。 使用起来也是很方便的,这里使用d.

We evaluate these 3 open source BI tools (dashboards) on 4 broader features - 1) Data backend support, 2) Authentication / Authorization support, 3) Support for Scheduled reports by email and Alerts, 4) extension support.

Super impressed with @metabase! We are using it internally for a dashboard and it really offers a great combination of ease of use, flexibility, and speed. Paavo Niskala (@Paavi) December 17, 2019. @metabase is the most impressive piece of software I’ve used in a long time. If you have data you want to understand give it a try. Difference Between MySQL and MSSQL. MySQL vs MSSQL is relational database management systems (RDBMS). RDBMS is a piece of software that stores information in a tabular format i.e. Rows and columns.

Superset vs Redash vs Metabase - Data Backend Support

All three tools now support all major sql backends used for data analytics workloads - e.g., Amazon redshift, Postgres, MySql, SQL Server, MongoDB and Oracle. Only few support big data processing backend like Presto, Hive, SparkSQL, Google BigQuery, Elasticsearch currently, but soon all three of them should have support for all these popular backends.

Data Backend Redash SuperSet Metabase
MySql Yes Yes Yes
PostgreSQL Yes Yes Yes
Oracle Yes Yes Yes
SQL Server Yes Yes Yes
MongoDB Yes Yes Yes
Amazon Redshift Yes Yes Yes
Cassandara Yes Yes ?
Presto Yes Yes ?
Hive Yes Yes ?
Impala Yes Yes ?
SparkSql ? Yes ?
Google BigQuery Yes Yes Yes
Graphite Yes ? ?
ElasticSearch Yes ? ?
Vertica Yes Yes Yes
Druid No Yes Yes

Superset vs Redash vs Metabase - Authentication support

How to get five nights at freddys 4 for mac. Currently Superset supports much richer authentication backend compared to Redash and Metabase who only support Google Oauth for authentication and single sign on.

So if you need to integrate with your in-house ldap or database based authentication backend, currently the only solution is SuperSet.

Authentication Backend Redash SuperSet Metabase
Google Oauth Yes Yes Yes
OauthNo Yes No
OpenID No Yes No
LDAPNo Yes No
Database No Yes No

Superset vs Redash vs Metabase - Authorization / Access control support

All three tools supports a decent permission (authorization) model to allow group of users access to particular data and queries. This allows organization to restrict data access based on different user roles.

Please note that in all three data access granularity is primarily based on database table level and can't be go beyond that. Though it is typically sufficient for most practical use cases. If not sufficient, existing data need to be split between tables to ensure different access level.

Both Redash and Metabase supports concept of users and groups and then allow one to control what level of database and SQL access those groups should have. A user can be a member of multiple groups.

Superset supports concept of Admins and Gamma users. Gamma users can be assigned multiple roles each controlling access to particular data and queries. Roles can be made quite intricate to who can access individual features and which. dataset

Metabase Presto Jdbc

Metabase Presto

Superset vs Redash vs Metabase - Support for Scheduled Emails and Alerts

Scheduled Emails with summary reports and Alerts are another very useful feature of Data dashboards.

Alerts Redash SuperSet Metabase
Summary Email (Scheduled) No No Yes
Alerts Support Yes No No
Slack Integration Yes No Yes

Currently only Redash supports alerts based on certain parameter crossing a particular threshold.

Superset vs Redash vs Metabase - Extending platform

Being open source, one can easily extend these tools if one need to.

Tool Tech
Redash Python
SuperSet Python
Metabase Clojure

Redash and SuperSet are developed in Python while Metabase is developed in Clojure. If you have a particular technology talent in-house, then this also can be a plus point in deciding the right tool for your organization.

Summary

This article is still in progress and we will update this article as these tool make progress. (+ add comparison on more features)

All three tools have been providing a decent dashboard experience and we are extremely thankful for their developers for creating a much needed open source BI visualization tool.

I had a SQL query that failed on one of the Presto 0.208 clusters with the “Query exceeded per-node total memory” (com.facebook.presto.ExceededMemoryLimitException) error. How can you solve this problem? I will consider a few possible solutions, but firstly let’s review the memory allocation in Presto.

Memory Pools

Presto 0.208 has 3 main memory areas: General pool, Reserved pool and Headroom.

All queries start running and consuming memory from the General pool. When it is exhausted, Presto selects a query that currently consumes the largest amount of memory and moves it into the Reserved pool on each worker node.

To guarantee the execution of the largest possible query the size of the Reserved pool is defined by query.max-total-memory-per-node option.

But there is a limitation: only one query can run in the Reserved pool at a time! If there are multiple large queries that do not have enough memory to run in the General pool, they are queued i.e. executed one by one in the Reserved pool. Small queries still can run in the General pool concurrently.

Headroom is an additional reserved space in the JVM heap for I/O buffers and other memory areas that are not tracked by Presto. The size of Headroom is defined by the memory.heap-headroom-per-node option.

The size of the General pool is calculated as follows:

There is an option to disable the reserved pool, and I will consider it as well.

Initial Attempt

In my case the query was:

and it failed after running for 2 minutes with the following error:

The initial cluster configuration:

With this configuration the size of the Reserved pool is 10 GB (it is equal to query.max-total-memory-per-node), the General pool is 35 GB (54 – 10 – 9):

The query was executed in the General pool but it hit the limit of 10 GB defined by query.max-total-memory-per-node option and failed.

resource_overcommit = true

There is the resource_overcommit session property that allows a query to overcome the memory limit per node:

Now the query ran about 6 minutes and failed with:

It was a single query running on the cluster, so you can see that it was able to use the entire General pool of 35 GB per node, although it still was not enough to compete successfully.

Increasing query.max-total-memory-per-node Cant shrink partition unmovable files.

Since 35 GB was not enough to run the query, so let’s try to increase query.max-total-memory-per-node and see if this helps complete the query successfully (the cluster restart is required):

With this configuration the size of the Reserved pool is 40 GB, the General pool is 5 GB (54 – 40 – 9):

Now the query ran 7 minutes and failed with:

Metabase Presto Digital

It is a little bit funny but with this configuration setting resource_overcommit = true makes things worse:

Now the query quickly fails within 34 seconds with the following error:

resource_overcommit forces to use the General pool only, the query cannot migrate to the Reserved pool anymore, but the General pool is just 5 GB now.

But there is a more serious problem. The General pool of 5 GB and the Reserve pool of 45 GB allows you to run larger queries (up to 45 GB per node), but significantly reduces the concurrency of the cluster as the Reserved pool can run only one query at a time, and the General pool that can run queries concurrently is too small now. So you can more queued queries waiting for the Reserved pool availability.

Native instruments kontakt 4 crack download. experimental.reserved-pool-enabled = false

To solve the concurrency problem and allow running queries with larger memory Presto 0.208 allows you to disable the Reserved pool (it is going to be default in future Presto versions):

Now you can set an even larger value for the query.max-total-memory-per-node option:

With this configuration all memory except Headroom is used for the General pool:

But the query still fails after running for 7 minutes 30 seconds:

So there is still not enough memory.

experimental.spill-enabled=true

Since we cannot allocate more memory on nodes (hit the physical memory limit), let’s see if the spill-to-disk feature can help us (the cluster restart is required):

Unfortunately this does not help, the query still fails with the same error:

The reason is that not all operations can be spilled to disk, and as of Presto 0.208 COUNT DISTINCT (MarkDistinctOperator) is one of them.

So looks like the only solution to make this query work is to use compute instances with more memory or add more nodes to the cluster.

In my case I had to increase the cluster to 5 nodes to run the query successfully, it took 9 minutes 45 seconds and consumed 206 GB of peak memory.

Checking Presto Memory Settings

If you do not have access to jvm.config and config.properties configuration files or server.log you can query the JMX connector to get details about the memory settings.

Metabase Presto Sql

To get the JVM heap size information on nodes (max means -Xmx setting):

To get the Reserved and General pool for each node in the cluster:

Metabase Presto 3

When the Reserved pool is disabled the query returns information for the General pool only:

Metabase Presto

Conclusion

Metabase Presto Vs

  • You can try to increase query.max-total-memory-per-node on your cluster, but to preserve the concurrency make sure that the Reserved pool is disabled (this will be default in Presto soon).
  • Enabling the disk-spill feature is helpful, but remember that not all Presto operators support it.
  • Sometimes only using instances with more memory or adding more nodes can help solve the memory issues.