Apache Hadoop : OLTP vs. OLAP

bogotobogo.com site search:

row vs column store

Picture source: Column Vs Row Data Storage

A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data. In comparison, most relational DBMSs store data in rows. This column-oriented DBMS has advantages for data warehouses, customer relationship management (CRM) systems, and library card catalogs, and other ad hoc inquiry systems where aggregates are computed over large numbers of similar data items - from wiki

A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. in this way, any one of the columns more closely matches the structure of an index in a row-based system. This causes confusion about how a column-oriented store "is really just" a row-store with an index on every column. However, it is the mapping of the data that differs dramatically. In a row-oriented indexed system, the primary key is the rowid that is mapped to indexed data. In the column-oriented system, the primary key is the data, mapping back to rowids.

For more on column store, visit wiki - Column-oriented DBMS.

OLTP vs. OLAP

This tutorial will show how to use CDH5 APIs to start and stop Cloudera's services using Python's boto module and cron task.

OLTP (On-line Transaction Processing) - Processes a large number of short on-line transactions (INSERT, UPDATE, DELETE). Fast query processing while maintaining data integrity in multi-access environments. The effectiveness is measured by number of transactions per second.
OLAP (On-line Analytical Processing) - Processes relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas such as star schema.

Here are the comparisons between row-oriented vs column-oriented databases related to OLTP/OLAP:

Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.

In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).

Picture source: Data Warehouse and Business Intelligence: Systems Architecture and OLTP vs. OLAP

OLTP Operations - OLAP cube

Data source: OLAP cube

Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension. The picture shows a slicing operation: The sales figures of all sales regions and all product categories of the company in the year 2004 are "sliced" out of the data cube.
Dicing Dice: The dice operation produces a subcube by allowing the analyst to pick specific values of multiple dimensions. The picture shows a dicing operation: The new cube shows the sales figures of a limited number of product categories, the time and region dimensions cover the same range as before.
Drill-up and drill-down Drill Down/Up allows the user to navigate among levels of data ranging from the most summarized (up) to the most detailed (down). The picture shows a drill-down operation: The analyst moves from the summary category "Outdoor-Schutzausrustung" to see the sales figures for the individual products.
Roll-up: A roll-up involves summarizing the data along a dimension. The summarization rule might be computing totals along a hierarchy or applying a set of formulas such as "profit = sales - expenses".
Pivoting Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter. Pivoting could replace products with time periods to see data across time for a single product.