Data hubs are an important component in information architecture. However, they are rather diverse, and this diversity often means that the term “hub” means quite different things to different people. It also means that a definition of “data hub” is inevitably going to be rather generic. The following definition is used here:
A data hub is a database which is populated with data from one or more sources and from which data is taken to one or more destinations.
A database that is situated between one source and one destination is more appropriately termed a “staging area”. By contrast, a data hub must have more than one source populating it, or more than one destination to which data is moved, or both multiple sources and destinations.
Why Data Hubs?
The more that data is understood to be an enterprise resource that needs to be shared and exchanged, the more likely it is that data hubs will appear in enterprise information architectures. Data can indeed be shared and exchanged via point-to-point interfaces between pairs of applications, and this is a valid architectural alternative to data hubs. A point-to-point interface that moves data from one application to another is much simpler to implement than any hub, and it seems that all enterprises now have gigantic spiderwebs of such interfaces that have grown up over the years. However, data hubs provide – at least in theory – an attractive alternative to point-to-point interfaces. In my experience, point-to-points have the following issues (amongst others):
- They often needlessly replicate movement of the same data.
- They typically have poor controls around them, and minimal governance.
- They are typically difficult to modify.
- They are typically poorly documented and understood.
- They tend to promote coupling and fusion of applications into a giant monolithic enterprise silo that is very difficult to evolve in line with business changes.
- They rely on the application pair involved to do things like data integration and transaction integration.
With the exception of the last point, it could be argued that the other issues with point-to-points are really due to poor implementation, not inherent architectural limitations. There may be some truth in this, but it seems that, in general, these issues can be detected in the great majority of point-to-points.
Data hubs, therefore, may present a better alternative, although we need to be cautious. Just as it is possible to implement a good point-to-point interface, so it is possible to poorly implement a data hub. That said, what exactly do data hubs offer? The answer to that really depends on the type of hub being considered. Following, we briefly review six common styles of hub from the perspective of their architectures.
The Publish-Subscribe Data Hub
Figure 1 shows a representation of a simple publish-and-subscribe data hub.
Figure 1: Simple Publish and Subscribe Hub
One or more applications, called publishers, place data they produce in the hub. The hub can “pull” the data from the publisher, or the publisher can “push” the data to the hub. Other applications called subscribers take specific data sets from the hub. Again, the subscribers may “pull” the data out of the hub, or the hub may “push” the data to the subscribers.
Typically, there is no integration in a publish-and-subscribe data hub. Each publisher’s data set is just staged as is, and taken by a subscriber. What the hub can do is coordinate the pushing and pulling of data by recognizing when a publisher is ready to publish, and informing a subscriber when data is available. This can be tied into enforced service level agreements (SLAs).
However, governance by the group controlling this kind of hub is often weak and it runs the risk of becoming little more than a common area in which pairs of applications interact to transfer data from one to the other. This represents a danger in that the hub may simply serve to facilitate the growth of virtual point-to-point interfaces.
The Operational Data Store (ODS) for Integrated Reporting
Figure 2 shows this hub architecture.
Figure 2: ODS for Integrated Reporting
This style evolved from the need to shift reporting out of transactional applications because reporting typically degraded their performance. Initially, the databases of transaction applications were simply replicated and the reports run off the replicas. Then, it was realized that data from several applications could be integrated into a hub and integrated reporting run from the hub.
The term “operational data store”, or ODS, is applied to this kind of hub; but the term “operational data store” is not precise anymore as it has shifted in meaning over the years.
What is important about the ODS for integrated reporting is that it does implement integration. However, this architecture is today at odds with real-time data warehouses and marts. These components are probably better served in terms of the methodologies used to build them and the toolsets and products available for them, compared to the ODS for integrated reporting. Furthermore, the issue of “history”, and how much and what kind of history to keep in this kind of ODS, seems to be a perennial issue that has no theoretical answer.
If the current trends continue, it does seem that real-time warehouses and marts will be the location of integrated operational reports (i.e., reporting on current activities) as well as historical analyses. If so, the ODS for integrated reporting may become legacy architecture.
The ODS for Data Warehouses
Figure 3 illustrates this kind of hub.
Figure 3: ODS for Data Warehouses
Again, data from many transactional applications is integrated in the hub. It is then sent by multiple feeds to the data warehouse layer where there may be more than one warehouse. Of course, further integration of data may occur in the warehouse layer, with additional data coming from applications whose data may not be integrated in the ODS.
This architecture is challenged by a couple of trends. The first is that warehouse architectures themselves typically contain integration areas, so it is questionable as to why an ODS would be needed upstream of the warehouse. Secondly, the move to real-time warehouses and marts means that transactional application can now send data in messages to the warehouse layer in real time (or near-real time) and have integration happen there.
The Master Data Management (MDM) Hub
Figure 4 shows the MDM hub.
Figure 4: MDM Hub
Integration still must happen with MDM. However, a major variation point is that some of the applications producing data are likely to be external data providers with whom the enterprise has subscriptions. To some extent subscription management may need to be undertaken in the hub.
Perhaps the biggest difference with the MDM hub is that data content management is necessary. This means there has to be functionality to permit human operators to analyze and update the data. Most master data domains are simply too complex for automated management, and human intervention is required. Modern MDM tools have features like workflow automation to meet these requirements.
Message Hub
Figure 5 shows the message hub.
Figure 5: Message Hub
A message hub manages the integration of data that is contained in real time (or near-real time) messages flowing though some kind of middleware, such as may be implemented for an enterprise service bus (ESB).
This kind of hub typically has very explicit “command and control” functionality that orchestrates the messages by making them conform to message models that hopefully correspond to real business events. This may require switching messages from one queue to another, or waiting until a set of messages arrive before processing them as a whole logical unit of work. There may also be complex failure recovery scenarios that the hub manages.
In reality, transaction integration needs often exist in other hub architectures too, where batch data is moved by ETL jobs. They’re just not nearly as visible as they are in the “hub and spoke” architecture of a message hub.
Integration Hub
Figure 6 shows the integration hub.
Figure 6: Integration Hub
This kind of hub serves to integrate data flowing via batch movement and/or messaging. It also supplies the warehouse layer, and the warehouse layer does not repeat the integration that has already been performed. The principle that underlies this hub is that integration will only be done once in one place. Quite often, transaction applications actually do integration, and often the integration is redundant. Each application does the integration in its own way and subtle differences mean that the data is inconsistent. Further downstream integration (e.g., in a warehouse) brings out the integration inconsistencies in the form of data quality issues.
Conclusion
This brief overview shows how data hubs can be quite varied. There are undoubtedly other kinds of hubs, and variants and hybrids of what has been discussed. Also, many characteristics of hubs have been omitted in this overview. What is important is to think about what style of hub fits where in the enterprise architecture and how it relates to the totality of the enterprise architecture.
source: http://www.b-eye-network.com/view/8109