One hindrance to inter-enterprise integration is management reluctance to expose internal data of uncertain quality to the harsh gaze of the outside world. Further, many managers are pessimistic that their data quality will improve. Nevertheless, like it or not, imperatives ranging from regulatory requirements such as those imposed by Sarbanes-Oxley to the implementation of commercially-rewarding supply chain relationships require that your data "travel" and travel well. Therefore, to foster support for cross-boundary integration a breakout data improvement strategy is needed, and one centered on "bucket reduction" and a back to basics exploitation of raw source data is offered below.


Concerns about the fitness of corporate data for wider, more intensive
use are better understood and remedial action better prioritized if
one considers that transaction-related (and event-related) data can be
divided into three categories - 1) source data, also known as "raw"
data,  2) synthetic data and 3) system overhead data

Source data is the informational "gold" supporting the enterprise,
and collecting and protecting that gold should be of high priority.  As
an example, the raw details of a transaction are "golden."

Synthetic data is created by applying "rules" and processes such
as  aggregation, categorization,  arithmetic functions and other
processes to populate data "buckets." Often synthetic data may be
the product of  multiple rules - e.g., to generate something as simple as a data bucket
holding a single product's monthly sales figure for June, one must define "June" (fiscal, calendar?), deal with returns and credits, multiple channels, sometimes multiple versions of "the product," etc., so definitional risks and boundary issues abound.

A company cannot live on a single synthetic data value such as  "sales of product X in June;" it needs many, many thousands. If  stored as canned "data buckets," the result is a corporate informational infrastructure that resembles the adjacent picture of real buckets. The infrastructure becomes  buckets stacked on buckets - some new, some old, some rusty and leaky, etc.

As an alternative to pre-stored "buckets" of synthetic data, what is proposed is a direct use of the informational "gold" that is the foundation of the corporate information structure. What is suggested is to stop creating stored "synthetic" data, but instead compute it "just in time" by applying rules to the raw data at time of use. The emergence of technologies such as "rules engines" has a natural fit to this proposal.

One payoff gained by generating synthetic data on a make-to-order basis directly from the source data is an improvement in fitness for intended use.  The make-to-order dynamic process - aimed at a known, proximate objective - is more likely to attain that objective than "canned" synthetic data. The person or process that generates the "make-to-order" synthetic data will control both the selection of the source data and the selection of the "rules," rather than being dependent on the "bucket brigade" that creates static data buckets.

At the somewhat unwelcome Sarbanes-Oxley signing ceremonies at which corporate officers attest to their institution's "truth in financial reporting," the designated scapegoats could be somewhat comforted if there is a discipined, accessible audit trail of both source data and the associated rules applied to that source data. Of course there are many other such signing ceremonies - e.g.,  regarding tax returns, environmental reporting, and various human resources matters at which the same benefits would apply.

The corporate "bucket brigade" has long been tasked to create predefined data buckets primarily because of computer performance constraints and limits on users' patience in waiting for results. It just took too long for computers to extract the source data and manipulate it appropriately at the time it was needed, so "canned" approaches were taken. Today we can employ emerging brute force computing resources to make it practical to generate synthetic data on demand.

A fortunate byproduct is the reduction in system overhead data and associated metadata, because synthetic data buckets themselves need to be indexed and stored, and doing so includes a heavy human cost in design, training and operation.

Some further description of the problem and the proposed approach is offered below.

Source Transaction Data

In the transaction data (or event) world, data is generated in the course of executing a given transaction or event - e.g., an order is created, a delivery is made, a vending machine "vends."

For example,  if customer X buys a book named "Metadata for Morons" for $75 in Colts Neck, New Jersey on a certain date, that raw data plus some related transaction data fields such as credit card number, credit authorization number, store number, sales clerk number and a few other data fields constitutes the source data created by the execution of that transaction. Of course, to get the book into the store, it had to be purchased from the publisher and shipped to the store via somebody's distribution center, so from an end-to-end supply chain perspective there is a trail of source data
originating upstream from that sales transaction. If the book is shipped to the customer or the book is for some reason returned by the customer, there are some downstream events as well. All of that source data is precious bedrock truth.

Synthetic Data and the "Bucket Brigade"

However, although source data is precious, until recently there was little choice for large entities other than to accumulate it into summarized data "buckets," because otherwise the data would not be readily accessible - there was too much source data for the available compute power.

However, the result is that the "synthetic" data in high level reports is often many layers of synthesis away from the source data, and at each layer new opportunities for errors, inconsistencies and conflict are injected. Also, what is counter-intuitive, but true, is that the greatest single cause of data proliferation is aggregation
, the aggregation of aggregations, and the allocation of aggregations. Companies are likely to have to store and manage several times the amount of source data because of endless "bucketing."

If one looks at a typical corporate spreadsheet, it will perhaps contain some cells containing source transaction data,  while the other cells will either contain synthetic data or the spreadsheet itself will create synthetic data through use of formulas. A cell containing source data - e.g., in the book sale example cited, a spreadsheet row containing the book's selling price of $75 and another cell with the clerk's commission on that book sale of, say, $3.25 for that transaction, both appear to contain data. However, the $3.25 is the "synthetic" result of the application of a rule that, in this example, commission equals 5% of selling price. If a Microsoft Excel cell value starts with an "=" it almost certainly contains synthetic data.

If one moves up the data analysis food chain, the regional sales manager's commission
is likely to be the result of multiple layers of synthetic data being manipulated by further rules (e.g., perhaps based on average selling price as a percentage of target selling price minus something about corporate sales multiplied by yet something else).
The piling of data bucket on data bucket sometimes creates mysteries such as "if sales are down, why are commissions up?" Meanwhile, there might be forty regional sales managers contemplating their individual incentive payments knowing that there is nobody with the time, energy and detailed drill-down information needed to explain why a certain amount is being paid and not some other amount.

Note that when such a commission is paid, the payment itself is a real transaction, so while source data manipulated by a succession of rules produces synthetic data, in turn synthetic data sometimes generates transaction-level source data.  If that synthetic data is flawed, it then pollutes downstream processes. For example, if it is exceedingly difficult for a sales manager to trace back the ingredients of his or her incentive payment, it also could be difficult to align that payment with one or more tax jurisdictions, attribute it to product lines, etc.

Spreadsheets are cited above because all readers will have experienced the creation of synthetic data in spreadsheets, but corporate mainstream systems also probably the largest source of pre-defined data "buckets."

Based on application logic and other configuration choices, the corporate application system often processes "real" data into synthetic data and then passes a pre-defined mix of real data and synthetic data from module to module. In turn, the synthetic data is used to generate reports - themselves loaded with synthetic data - or to create feeds into data analysis systems. Data analysis platforms often depend on predefined "cubes", and such "cubes" are sets of buckets - often based on synthetic data from upstream "buckets."

World Views in Collision

A particular problem with synthetic data buckets is that when the "canned" synthetic data travels to new settings, the rules that generated it may be erroneous, conflicting or inappropriate. Stay at home synthetic data gets gentler treatment than data that travels.

Inside a corporation or other organization, people may accept and even be quite devoted to their own synthetic data - despite its sometimes confusing and arbitrary nature. Local information consumers speak the same  "corporate"  language, perhaps personally have a hand in the synthesis rule-making and, not to be forgotten, are captive customers of one another. Although within the local business unit time may be wasted in intramural debates over "my data" versus "your data"  - and these debates are almost always about whose synthetic data bucket has a hole in it -  the waste may be accepted as a necessary cost of doing business.

However, these data problems become more serious as the data buckets travel.

As data buckets move to new settings and new uses, the rules embedded in the synthetic data they contain may be wrong or at least inappropriate. Additionally, because rules are embedded in the synthetic data held in these buckets rather than being explicitly stated, it is difficult to detect and remedy dysfunctional situations. If what to the recipient appears to be "low quality" data goes to a regulatory agency, a customer, another trading partner, or even another part of the originating company, the negative impact can be substantial.

If instead, as advocated here, if the recipient gets an appropriate subset of the source data plus the explicit rules, the recipient (or the recipient's system) can figure out what to do in adjusting the rules to fit the receiver's needs.
For example, if the analyst gets a transaction details including dates, he or she (or it, if a process) can define "June" and "sales" in some appropriate, uniform way.

To go back to the book sale example, if the selling entity transmits to a large corporate buyer a total representing "sales in June" to that buying corporation on the assumption that the data in that supplier "bucket" satisfies the buyer's need for a "purchases in June" figure, there may be an apples and oranges conflict.  The two entities may define "June" differently because one is on a fiscal calendar, the buyer may "bucket" some transactions differently based on delivery date, purchase order relationships, etc. If instead the seller sends (and the buyer accepts and processes) source data, the recipient can then apply the recipient's rules.

Applying "Brute Force" Computing

In the past, moving what might be hundreds or even many thousands of raw transactions in place of a single number was impractical because of computer and network resource limits. Like the real-world buckets shown in the picture, data "buckets" have a long history of good and useful service.

However, in a world with the informational analysis equivalent of running water, use of buckets can and should rapidly diminish.
The analog to running water is the use of brute force computing directly processing unbucketed original transaction data against a coherent, layered rule base.

A hypothetical company with a revenue of $1 billion per year typically would generate up to about 75 to 150 million transactions per year of all kinds. Computing the store clerk's weekly commission dynamically would therefore involve picking out the relevant 200-500 transactions out of the the 75-150 million transactions and applying to them the 5% rule commission. Computing the sales manager's commision might require picking out  5,000-50,000 relevant transactions and then applying layers of rules dynamically. In either case, an audit trail would be easy to retain to back up the end amount computed as commission - internally and externally.

One benefit would be a huge reduction in the number of tables and columns within database structures. Every additional table and column creates more "metadata," and more metadata describing data buckets creates cost, complexity and risk. A large corporation spends money and effort "discovering" and "rediscovering" its own metadata and then debating its reliability and fit. Less is more.

The means to deal with multi-millions of transactions are provided by the ongoing huge increases in compute power, plus the maturation of "grid" and "on demand" computing, which enables multiple computers to address suitably structured problems. At present the predominant use of "computing on demand" and "grid computing" (somewhat different, but overlapping terms) is either to consolidate hardware platforms or to address huge, otherwise intractable compute-intensive problems. However, these capabilities can also be used to restructure applications so that they generate synthetic data "to order" shaped by rules relevant to that order rather than resorting to "canned" synthetic data created in advance of demand.

For more information on "means" see the
brute force computing portion of this web site.

Given that the technological platforms exist, the question is one of design choice. Given suitable design choices, what becomes possible is to extract the "real" source data from wherever it is retained and put it into a coherent repository. Thereafter, new needs for applying rules to source data can be done directly from the source data. The rules themselves would then need to be structured as "objects" so that a high level synthetic number - e.g., total corporate revenue for product line X - can be generated based on a sequence of object calls.

Overhead Data Reduction

In storing source data, the file system or database management system has its own storage requirement, particularly for indexes.  As a rough approximation, providing space for system overhead can double storage requirement. Therefore, if data base design is needlessly prolix and complex because it retains avoidable synthetic data, not only does the synthetic data itself waste storage, but the synthetic data will in turn cause overhead data to grow.

Payoff - External Data Portability and Credibility

The incentive is that bucket reduction will help streamline supply chains, ease the burden of compliance, and facilitate the opportunity to link technology across organizational boundaries.

If  you can share intact, relevant source data with external parties, that is often an important step ahead, because they then can apply their rules if in fact they are destined to be the information consumer. Their rules are probably no better than the sending company's, but they speak their own language and are their own "captive" information customer.

If instead the external party is verifying your reported data - with respect to, say, contract compliance or regulatory compliance, you can provide them with the subset of relevant source data plus -  separately and carefully described -  the rules you used in creating synthetic data.

Although the above material contains criticism of today's canned synthetic data, the rules behind synthetic data, if rationalized and expressed properly, themselves often contain important "know-how" and might even be marketable assets.


A large company or other entity possesses two huge assets - "raw" or source transaction data and "know-how," and part of that know-how are the "rules" used to compute synthetic data and to stuff it into "buckets." By unbundling the raw data and the know-how, one or both assets become more "portable" and reusable.

Note that today's technology "silver bullet" is XML (eXtended Markup Language), but even the silver bullet needs some help in the form of data rationalization. Reducing reliance on canned "buckets"  lightens the burden of wrapping data in suitable XML descriptors and thereby facilitates the creation and transmission of self-describing data between entities. On the other hand, wrapping canned data buckets in XML metadata is an often fruitless exercise, because once the data arrives the definitional conflicts and confusion begin.

Quality improvement efforts should focus separately on source data quality and "rules" quality. The two require very different approaches and usually involve very different teams. In the end, overall quality depends on success in addressing both, but muddling the two is not likely to yield such success.

Implementing the suggested approach requires investment in hardware - whether bought conventionally or temporarily accessed "on demand" - to have sufficient power to provide quick response time for queries against huge sets of source data. More significantly, it requires a rethinking of application design to take advantage of brute force computing capabilities, probably starting with compliance-significant applications.

The prerequisite investment is becoming affordable and costs continue on a downward trend, so pessimism about data quality is not warranted. On the other hand, the cost of paying the people who make up the bucket brigade is increasing, as are the commercial and compliance risks, so conditions are right for doing what is suggested.

   Bucket Reduction
Focusing on Source Data