talk-how-superset-and-druid-power-analytics-at-airbnb

https://www.youtube.com/watch?v=W_Sp4jo1ACg

Shishir comment: Superset seems to be the only visualization layer out on top of druid. If we are planning to invest in Druid, we should also aim for adoption of Superset.

Shishir comment: Airbnb has two HIVE clusters for all ETL jobs + data analysis. GOLD cluster is the main cluster which is used for business critical ETL processes. Silver cluster is used for all adhoc analysis by data analysts + data scientists. We can have a similar bifurcation for Redshift cluster here at Instacart.

Shishir comment: They use Thrift for schema management of messages in the queues. Avro seems to be a preferred choice now. The idea is that any message in the queue should be part of a schema, so that people can't stuff arbitrary data in a message. Also, this ensures that the messages are properly serialized and are much smaller in size. As we move to a more message based infrastructure in Instacart, enforcing schema on messages might be a good direction. It is used in data pipelines everywhere. Here is a good example from yelp : https://engineeringblog.yelp.com/2016/08/more-than-just-a-schema-store.html

Shishir comment: Druid gained more popularity in Airbnb as compared to other specialized datastores like Cassandra because with Druid one can do adhoc analysis. With specialized datastores like Cassandra/Hbase one needs to think of all query patterns upfront and ensure that proper indices are created to serve the desired queries.

Shishir also mentioned blog-post-schema-evolution

At 3:12 wrote a blog post called the rise of the data engineer

Apache airflow and Apache superset are both by the same guy.

At 8:20 and it has good support for sketches, which is probable a stick and lyrics.

At 9:40 a slide showing Airbnb's data infrastructure

At 10:20 my sepals grapes and event logs to live and they do have some customers.

At 10:40 the cold cluster as all the Rod data and I SLA jobs

talk-how-superset-and-druid-power-analytics-at-airbnb#we-use-hdfs-but-s31At 11:20 we use HDFS, but we increasingly use S3 because it is cheaper and low maintenance talk-how-superset-and-druid-power-analytics-at-airbnb#we-use-hdfs-but-s31

At 11:40 they increase in the use spark for batch processing

At 12:30 or using their pill though they are using presto with superset

At 13:30 "I wish we had used Avro, but we used Thrift"

At 14:53 they replay the MySQL bin log into Kafka, then use a thing called Spinal Tap and Spark.

talk-how-superset-and-druid-power-analytics-at-airbnb#snapshot-restore-too-slowAt somewhere around 16 he talks about how they used to do a snapshot of MySQL, then a restore, but as their data grew and became difficult for them to do that by 9 AM. Also having this data in a space allows presto to query against it is other data sources talk-how-superset-and-druid-power-analytics-at-airbnb#snapshot-restore-too-slow

At 24:00 dashboards have short life cycles so should be easy to create

Referring Pages

People

person-shishir-prasad person-maxime-beauchemin