Closing the Analysis Gap

Every organization has trouble analyzing high volumes of transactional data – sales data, spend data, and repair data are just a few examples. The analysis tools that can be brought to bear on data of this type are either size-limited and poorly constrained, or weak and highly constrained. We show how technology built by Spendata fills this analysis gap.

Background

Analyzing transactional data of non-trivial size in the business world’s analysis tool of choice – Microsoft Excel – is impossible if there are more than a million transactions, and impractical at a hundred thousand due to performance issues. We must turn to databases for help.

But traditional database systems are not ideal for transactional analysis, because the joins required execute too slowly. The answer is to use an OLAP database to pre-calculate joins so that queries are resolved quickly and responsively.

In addition to performant queries, users must be provided with an approachable and customizable interface for generating queries. In the past, this interface was provided by independent tools. These days, the user interface and OLAP database are delivered together by business intelligence tools such as Power BI and Tableau.

A Path of Compromise

From Spreadsheet to Database

Database experts are fond of decrying analysis models built in Excel, because there are no constraints on data entry, no audit trail of changes, and no controls on the relationships between blocks of data. This means that the probability of error grows rapidly as models are extended and changed, casting doubt on the results.

But moving from spreadsheet to database imposes drastic limitations on capabilities that Excel users take for granted. Spreadsheet users can create a web of data dependencies that are maintained automatically and seamlessly. Changing one value will cause a cascade of changes, choreographed perfectly by the spreadsheet. An Excel workbook is not just a static data representation, it is a complex model of great power and flexibility.

Building the same model in a database system requires down-and-dirty programming – stored procedures accomplishing table updates behind the scenes, for example. Unlike Excel, the database doesn’t automatically prevent cyclic dependencies; those are left to the user to both diagnose and correct. Building a spreadsheet model using a database is theoretically possible, but it’s a fool’s errand.

From Database to OLAP Database

Moving from a relational database to an OLAP database imposes still more limitations. Updating data isn’t simple any more; instead, whenever anything changes, even a single record, all the affected pre-calculated joins must be re-calculated as well.

That is, of course, impractical – and it means that OLAP databases are intrinsically “read-only.” They are ideally used in a mode where the data seldom changes (and certainly never changes in real time), the data schema is fixed, and the queries run by users are limited by the constraints of the pre-built joins.

A Discouraging Journey

The constraining journey from spreadsheet to BI tool can be summarized as follows:

To overcome spreadsheet limitations, we turn to databases
To overcome database limitations, we turn to OLAP databases
To provide an accessible user interface, we turn to BI tools

In other words:

Issue	Resolution	Loss of function
Spreadsheet data size; lack of controls/auditability	Database	No dependency maintenance, and thus no hope of building responsive models
Database performance on transactional data	OLAP database	Data changes become offline/tedious, making what-if analysis non-viable
Database interfaces such as SQL are inadequate	BI tool	Schema freezes to support existing dashboards; database is read-only

A New Approach

Data analysts must routinely provide answers to questions that are unanswerable unless the data can be re-organized, re-categorized, and re-processed – all of which are impossible to accomplish inside a read-only database like a BI tool. Those things are possible in a spreadsheet, but only at small scale and with no controls or auditability.

The solution is obvious: we need BI-level controls and functionality, together with spreadsheet-equivalent dependencies and the ability to change data in real time.

From Spend Analysis: The Clue Stick

Spend analysis gains its power from “mapping” – a rules-based function absent from BI tools and databases that derives new columns (“dimensions”) of data and new information from existing columns. The simplest case of mapping is commodity mapping, where the combination of Supplier and General Ledger code can identify the Commodity that was purchased.

When you map, you alter a column of data in the dataset. That, in turn, requires the underlying OLAP database to re-calculate joins, a potentially lengthy and usually offline process. But, if you can map in real time, a very desirable feature, then you must be able to change OLAP data in real time – one of the key solution requirements.

Mapping also introduces the concept of dependencies. If the Commodity column depends on a combination of Supplier and General Ledger code, then there is an explicit dependency of Commodity on both Supplier and General Ledger. To make mapping work consistently and correctly, a change to Supplier or to General Ledger must force an immediate re-derivation of Commodity. Dependencies are another key solution requirement.

Generalizing the Solution: Spendata

Spendata drives these solution requirements to their logical limits. If a column can depend on another, whether original or derived, then any column can depend on any other; for example, a mapped column can depend on other mapped or unmapped columns, and so on. If a mapping rule is changed, then the mapped column is re-derived, along with all dependent columns. If the derivation of a column depends on a user input, then changing the user input will cause the column to re-derive, and all of its dependent columns to re-derive. If datasets are linked, and columns are derived using data pulled over the link, then dependencies will span datasets and trigger derivations wherever needed.

Spendata takes this much farther. If a View (think “pivot table”) contains a calculated result, that result can be fed back into the dataset as a new column of data (a “view-based Measure”), which can then be depended on by other columns and Views. So even the simple act of filtering a View can drive a cascade of dependent column derivations.

Spendata also introduces the concept of “script-only columns” that create intermediate results that are in turn depended on by other column derivations. A Spendata derivation result for a particular transaction can therefore depend on multiple transactions, even on all transactions, as well as on all previously-derived columns. In the general case, derivations can depend on intermediate results of arbitrary complexity, which in turn can depend on other intermediate results.

Script-only columns can also create new transactions and insert them into datasets as new tranches of data – triggering, in turn, a rebuild of all the columns in that dataset and any columns in any other dataset that depend on said dataset.

All of this occurs in real time, just as in a spreadsheet – and Spendata prevents the creation of cyclic dependencies.

The Result

Spendata users build spreadsheet-like models at database scale, yet with all the auditability and controls of a database system. The Spendata user interface provides BI-like simplicity for filtering and exploring data, along with the ability to create new Views, dimensions, and what-if analyses on the fly, as in a spreadsheet.

At last: analysis, not just canned reports.