Rethinking data pipeline dilemmas to boost speeds and decrease costs

This sponsored post is produced in association with Snowflake.

pipeline.shutterstock_311266298

Image Credit: Shutterstock

Our modern, technology-enabled lifestyles generate 2.5 trillion petabytesof digital information every day. 90 percentof that is unstructured.

It’s bad enough that the current ETL times for structured data is already a big data pipeline bottleneck. Now companies also need to contend with unstructured and semi-structured data streaming in from multiple sources simply because they have no choice.

Real time big data – these four words and how they correlate are forcing companies to change their approach to the way they handle their information, structured or otherwise.


Case in point: how DoubleDown did it

DoubleDown Interactive, an Internet casino games company, faced this issue firsthand. The company needed to integrate their continuous game data feeds with other information to create a holistic representation of game activity. They generated 3.5 terabytes of data every day – information that came from various data paths with individual ETL transformations. They took unstructured data into a noSQL database, ran database collectors and aggregators to get everything into their staging area, and have it all conform to the star schema. Only then can it go to the pre-existing EDW.

By rethinking the data pipeline and adopting a new, elastic cloud-based system with an architecture designed for big data, DoubleDown managed to process data 50 times faster at costs 80 percent cheaper.


The old ways are already obsolete

The rise of non-relational noSQL databases initially allowed more agile deployments and quicker integrations without reinventing legacy systems. For DoubleDown and many others who deal with big data in real time, however, this meant that the need to mix non-relational data bogged down the entire process further – even before getting the resulting relational information into the EDW (enterprise data warehouse).

DoubleDown used the information in their EDW for analytics, business intelligence, and reporting. Multiple departments and people need access to the processed event log data, which often meant waiting 11 to 24 hours for the entire data pipeline to complete.

The company’s dilemma is not unique. Every company relying on big data also relies on inflexible traditional data warehouses that are designed for predictable amounts of data that are easy to categorize. The limitations of these traditional data warehouses – even the ones with enough processing power to handle big data – are tied to their architecture.

The traditional data warehouse needs to be set up such that it knows how much data to expect. Its processing capabilities are inseparably tied to the databases it handles, which means the processing power allocation is fixed. Companies can neither allot more processing power for specific sets of data nor scale to handle larger volumes of data on-demand. They need to prepare everything months in advance.

In DoubleDown’s case, since their setup could not scale to handle more detailed game log data, they could not perform root cause analysis and complex ad hoc data explorations.

In order to scale to handle more data and gain more processing power, companies need to upgrade to expensive big data systems such as Hadoop, which come with a caveat: you can’t scale back down and you’ll need to likewise invest in data scientists and additional infrastructure to make sure your data analysts and business intelligence officers get the right relational information in their SQL based tools.

Essentially, in the traditional data pipeline, there was no agile scaling, and upgrading the EDW meant bulking up on infrastructure and even staff, which meant prolonging the cycle from data capture to actionable analytics.

Dig deeper: Read Snowflake’s entire DoubleDown case studyand find out how DoubleDown not only improved their data pipeline, but also improved game design and user experience through the right data.


Reinventing data warehouse architecture via the cloud

DoubleDown needed an elastic solution. They needed a data warehouse that can not only handle both relational and non-relational data, but also more efficiently handle the transformation of the data into structured SQL-compatible information. Finally, they needed on-demand scaling of processing power and database capacity.

DoubleDown found that cloud data warehouse Snowflake satisfied all these requirements. They implemented the new cloud-based system and yielded increases in data processing performance across the board.

The new system can load and flatten a JSON structure of 2.5 million elements in a little under two minutes. It handled structured and unstructured data, and eliminated data loss. It also helped reduce costs: the company can efficiently store data in the cloud, as well as eliminate the need to allocate resources for the constant monitoring and fixing of noSQL clusters and other specialized resources for MapReduce jobs that transform the game data.

DoubleDown put in place a system architecturally different from traditional data warehouses: it physically separates key parts while keeping them logically integrated. The system uses a persistent storage layer for shared data, on top of which are compute resources that execute data processing tasks. A cloud services layer allows users a collection of services that manage infrastructure, metadata, and security, among others.

DoubleDown can request additional compute resources where needed and close off these resources when no longer necessary. Everything sits in the cloud, so there are no on-premise complications and physical hardware limitations.


Slashing 24 hours down to 15 minutes

The result? They delivered data to their analysts 50 times faster – instead of taking 11 to 24 hours, the new process takes 15 minutes. The new system also allowed the company to increase and decrease compute power for different user needs on demand, which meant analysts could finally access full granularity of game log data.

All in all, DoubleDown managed to improve the latency, throughput, and reliability of their data pipeline — and reduce costs by 80 percent.

Sponsored posts are content that has been produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. The content of news stories produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact sales@venturebeat.com.