OLAP - Why, What and How?
Why OLAP Database?
Many of us are familiar with databases like Postgres or MySQL, which fall under the category of Online Transaction Processing (OLTP) databases. These databases are designed for high-concurrency, low-latency operations, typically handling simple queries that involve reading or updating a few records at a time. It’s important to note the emphasis on “few records,” as we’ll reference this throughout the article. NoSQL databases, like MongoDB, also belong to this class.
On the other hand, we have Online Analytical Processing (OLAP) databases, which are tailored for tasks such as reporting and analysis. They often deal with complex queries, including aggregates and joins, that require high-throughput scans over many records.
While OLTP databases can support basic analytics functionalities (such as aggregation, groupby, and window functions), they are not optimized for such tasks. Here are a few reasons why using OLTP databases for analytics may not be ideal:
- As your data grows, analytical queries that need to read many or all records can take hours to execute. While immediate execution might not be necessary for these analytical needs, the major concern here is performance degradation. Long-running queries can significantly impact the overall database performance, slowing down entire applications.
- OLAP workflows often require scanning a large number of records quickly. Additionally, since these records are usually read-only and not updated, they can be stored in formats optimized for reading. Data stored using these formats is typically immutable. If changes are required, a new copy must be created.
- Consistency models also play a role. OLTP databases are designed to support concurrent read/write operations, assuming that not all queries will attempt to read or write to the same few records simultaneously. However, for scenarios where multiple queries access the same records, transactions are introduced. While transactions are powerful, they come at a cost in terms of scalability. Scaling ACID-compliant databases can be challenging due to the inherent complexity. Refer CAP theorem for details. If you’re only reading data, transactions may not be necessary, simplifying the internal implementation and enabling databases to scale more effectively.
Since OLTP and OLAP databases serve different purposes, creating a single database that can effectively handle both types of workloads is not feasible. Instead, companies develop independent databases for each use case. Choosing between OLTP and OLAP is not an “either-or” problem; most companies utilize both types of databases. OLTP databases are commonly used as application databases, while data from various OLTP databases is aggregated in a single OLAP database. Although the same data exists in two places, it’s stored in different formats and in databases optimized for different workloads.
OLAP databases, being scalable, can easily store data from numerous OLTP databases without breaking a sweat. With this setup, your analytical workloads don’t hurt the performance of your OLTP databases. Another advantage is that both OLAP and OLTP can be optimized independently for their specific workloads.
Now that you understand “Why OLAP?”, let’s try to answer “What?”.
What is an OLAP Database?
The primary requirement for an OLAP database is scalability. To address the storage challenge, HDFS (Hadoop Distributed File System) was introduced. It functions like any other file system but is distributed across a cluster of machines. When executing a query, OLAP fetches records from HDFS, processes them, and returns the results. Results can also be written back to HDFS. It’s noteworthy that even with billions of records stored across thousands of machines, a small machine can execute simple operations; the size of the machine affects only the query execution time.
This leads to an important distinction: unlike OLTP databases, OLAP databases decouple compute and storage. This allows each system to scale independently as needed by the workload. For instance, if no queries are running, compute can scale to zero, and you only pay for storage. Here is day-to-day scenario to exemplifies decoupling of storage and compute. Imagine all your data stored in an S3 bucket; to train a ML model, you acquire a GPU machine on Paperspace, train your model on the data in the S3 bucket, push the trained model back to the S3 bucket, and then terminate the Paperspace instance. Even after terminating the compute, your data still persists, and you can access it from any other instance.
🚨 Imagine you have a server running PostgreSQL on a remote machine, and you access it from another machine via a client. In this scenario, when a client sends a request to the server, both the computation and data processing occur on the same device. This means that even when there are no requests, you can’t scale down the server to zero without losing access to your data. Unlike in a decoupled compute and storage system, where these processes are separate, here they are tightly intertwined.
Various organizations build their OLAP databases by independently choosing:
- Storage System: either running HDFS on a cluster of machines (commodity hardware) or using a cloud object store (e.g., AWS S3, Azure Data Lake Storage, Google Cloud Storage).
- Processing Engine: chosen based on the types of analytical workloads to be performed, whether batch processing engine (e.g., Spark, Presto, Apache Hive), stream processing engine (e.g., Spark, Apache Flink), or machine learning library (e.g., Spark MLlib, scikit-learn, R).
This flexibility—the ability to select the storage system and processing engine best suited to the workload—is a significant advantage of OLAP databases over OLTP databases. Managed solutions like Google’s BigQuery, Azure Synapse, AWS Redshift, Snowflake, Databrick’s Data Intelligence Platform, among others, simplify the complexity.
To illustrate the decoupled compute and storage, check the pricing pages for Google’s BigQuery, Snowflake, or other data warehousing solutions, where both storage and compute are billed separately.
How to Get Started with OLAP Databases?
If you’re transitioning from traditional databases like PostgreSQL and considering delving into warehousing solutions, here’s a step-by-step guide in increasing order of complexity:
- Google’s BigQuery: This is the easiest starting point, as you can continue using SQL to query your data. Plus, the first 1 TiB of query data processed per month is free. It’s an ideal platform to grasp the basics of data warehousing, including different data architectures, schemas, dimensions, fact tables, views, and more.
- Snowflake: Snowflake introduces you to various new concepts like virtual warehouses, micro-partitioning, stages, time-travel, zero-copy cloning, caching, query profiling, and others in a digestible manner. It offers few customization options, reducing complexity. Once you’re comfortable with Snowflake, consider learning DBT, which integrates well with Snowflake.
- Apache Spark: Having understood the components of data warehousing, you can leverage Apache Spark and its ecosystem tools to build customized data warehousing solutions. Spark is a distributed data processing engine that provides extensive control over storage, data format, caching, partitions, and more. While this level of control brings complexity, it allows for tailor-made solutions. Despite the challenges of maintaining and running a Spark cluster, platforms like Databricks simplify the process to some extent.
In my opinion, learning these tools in the specified order helps to avoid feeling overwhelmed by the multitude of tools and solutions in Data Engineering. There are two more definitions that one should know when starting with Data Engineering: Data Lake and Data Warehouse.
While Data Warehouses focus on structured data and predefined schemas for analytics, Data Lakes accommodate diverse data types in their raw form for exploratory analysis. A new type that is gaining popularity is called Data Lakehouse, which combines the strengths of both Data Warehouse and Data Lake approaches to provide a unified platform for modern multi-modal data analytics.