Data Modeling Is Hard vs. Data Modeling is Hard

Originally published by The Open Group as part of the March 2017 Technical Interchange Meeting of the Future Airborne Capability Environment (FACE™) Consortium.

This paper outlines and discusses several real and perceived aspects of data modeling when following a highly prescribed and connected data architecture. We explore where difficulties and effort is currently being expended and how that affects the desire and projected benefit of building these syntactic and semantic models.

The path to building, maintaining, and adopting these data models and the support data architecture is not to make it easier to build multiple models from scratch. How many times do we need to build a conceptual representation of an air vehicle entity? Each time we build an avionics related component?  No, we simply need one. The only real path to easing the burden on software providers is to get to the point where they can begin with the 90% solution allowing the rest can be integrated into the model with the smallest increment of effort.

This topic is complex and mired in technical details and nuances. We would rather this paper not be used as a sleep aid, so please, join us for a conversation on data modeling approaches, processes, tooling, and the resulting ramifications of having a ‘good’ model.

1. Introduction

The BITS (BALSA (Basic Avionics Lightweight Source Archetype) Integration & Test Session) out-brief at the December 2016 meeting of the Future Airborne Capability Environment (FACE™) Consortium was a fascinating affair. The presenters lined up, showcasing their excellent work with the BALSA framework and how they were able to quickly and easily integrate their solutions using the FACE architecture. They demonstrated how they developed a modular solution, insulating their system against obsolescence concerns and providing forward compatibility. It was even shown how they could replace pieces of the
BALSA framework with their own implementations. They extolled the clear boundaries and how easy it was to connect their modules to these interfaces. They lavished praise on the FACE architecture and utility of BALSA. And they produced exciting demonstrations of capability that any team using the FACE architecture would be proud to include in their portfolio of accomplishments.

Then the other shoe dropped. These comments were common throughout the presentations.

  • “But we spent so much time on the data model.”
  • “Data modeling is hard.”
  • “Why are we even bothering with this data model?”
  • “It took me too much time to add my content.”
  • “The tools were not useful.”

These comments came as a bit of a surprise. It is not surprising that someone found data modeling to be difficult. Most anyone who has worked with large data models are painfully aware of how challenging it can be. Decomposing our view of the world is quite challenging, especially when there are several engineers working to develop a consistent product. This is further compounded because each member of the team is predisposed to thinking about the information in a particular way.

Furthermore, our thinking has been heavily influenced by decades of software development where the messages we use for communication are often considered to be the data model. We mistake the messages for the interfaces and when we make this assumption, we end up with single use data models that are useful only as conformance artifacts. Data models developed this way have little additional use.

While each one of these concerns is significant and deserves to be addressed at length, they fail to capture a far more fundamental challenge: one expressed using the exact same words, but with a completely different meaning. Data modeling is hard.

Consider, for a moment, the simplicity of the BALSA data model. It is not filled with hundreds of entities.  It is not even filled with dozens of entities nor are the entities in the data model cross connected with complex relationships that need to be navigated.

How then, if the BALSA data is such a simple data model, can data modeling be so hard?

Therein lies the disconnect. Creating large data models is difficult for a multitude of reasons. It is difficult for a team to build a large data model with any degree of consistency. It is hard to decompose entities to a comparable level of granularity. It is hard to develop a rigorous naming convention.

But this is not a large model being worked on by a large team. What is the challenge?

The actual mechanics of data modeling.

Forget, for a moment, about the challenges that stem from developing large models. These users are not struggling with modeling at scale. They are having a hard time with modeling basics. How is the data entered? Why is the data entered three different times at three different levels of abstraction? How many ways are these different levels of abstraction connected? Why is it nearly impossible to define a new measurement? Why do I have to click twenty times to add a single property to my entity?
Without drilling down to a deeper understanding, we will continue to talk past east other about why data modeling is truly difficult and fail to move past this aspect of the FACE Technical Standard forward.

2. Mechanical Difficulties

After hearing of the difficulties the BITS participants were having with data modeling, we set out to test this idea within our team2. As long time members of the Data Architecture Working Group (for this and other standards) with years of experience modeling with Enterprise Architect by Sparx Systems, we fully expected the data modeling process to be relatively trivial. Our goal was to build a small, valid data model with a limited set of measurements and entities for use by our test team.

To accomplish this, we needed to create five new entities each with an average of four properties. Some of these properties were observables and some were other entities. Some of the observables were represented by numeric values, some with text, and some with a list of values (enumerations).

Modeling at the conceptual level was relatively pain-free. It was trivial to create the entities and point them to the corresponding observables and entities. The amount of effort required was completely in line with expectations.

Surprisingly, it took around twenty mouse clicks to add the first property to an entity. This may not be the most useful metrics since it does not take long to run up a tab of twenty mouse clicks, but it is indicative of the amount of attention given to adding an attribute. This number of mouse clicks captures the effort required to add a new attribute (2 clicks or Control-N), set its visibility property (3 clicks), connect it to its corresponding observable or entity (4 clicks + navigating through model to find the observable or 6 clicks if you use the search options), and set its stereotype (6 clicks). It may not have been necessary to expend all of those mouse clicks getting the settings exactly right, but these properties were set to have comparable documentation to those already in the model.

The next level of modeling – the logical level – was far more difficult than was expected. While reusing an existing measurement was relatively easy, creating a new measurement was difficult. Creating the individual pieces of the measurement and measurement system was not terribly challenging since the meta-model (the rules for how the FACE Technical Standard says things should be glued together) describes how these pieces are intended to fit together. The primary difficulty arose in fitting the pieces together such that they produced a valid model when exported from the tool.

This same pain was experienced while modeling at the platform level, as well. There were many values that needed to be set to properly to wire the model together. A clear understanding of the meta-model does not directly translate into how things are done in the tooling.

The tools from Vanderbilt University were used to export the data model and the error messages generated by that tool were used to ensure a valid model was exported. This step turned out to be a challenge as well. While the error messages were very accurate, our team had a difficult time correlating the error messages with the offending entities in the model. We developed a couple of custom SQL queries to mine the model for additional information but that failed to find the offending errors.

Ultimately, we found that the best source of understanding came from inspecting the (partially) exported XMI file for missing connectors. This reinforces the difficulty in the mechanics of data modeling. The graphical tools are intended to make it easier to build models and to abstract the esoterica that is the XMI representation. In this case, our team found it more useful to reference the user-unfriendly text-based format for debugging.

It is worth mentioning that the XMI format is the only officially recognized data model representation. The comment about its lack of user-friendliness is not a criticism of the standard, but rather a statement of the difficulty required in parsing a verbose text file instead of viewing diagrams in a software tool. Our team frequently works directly from the XMI, but this came at the cost of many hours of studying data models in this format.

3. Other Difficulties

Data Modeling mechanics are not the only challenges encountered when building a data model. While the practice of data modeling has been in wide practice in the world of databases for decades, the data models proscribed by the FACE Technical Standard are a slight twist on traditional practices. Adjusting to different modeling concepts may not necessarily be a challenge for seasoned modelers, but it does compound the learning curve for first-time modelers. Not only are there are less experts to consult, there are few sources (beyond the standard itself) that explain the novelties of these modeling practices.

There is also a small perception that data modeling is merely busy work that we must do to achieve conformance. While a data model is necessary to achieve conformance, it does not represent a complete picture of its utility. The Conformance Test Suite (CTS) analyzes the data model along with the corresponding software’s object code in order to ensure that the software in constructed as advertised. It does not verify the logical behavior of the software, but it does verify that the software interfaces interact with only precisely what is documented in the data model. Although the data model could be shelved after conformance is achieved, it has much more utility.

First and foremost, the data model can replace an interface control document (ICD). If there are behavioral aspects captured in your ICD, you may still need some traditional documentation, but the data model is an unambiguous and consistent documentation of your interfaces. Not only is the data model machine readable (most XML is), the data model is also machine understandable. It is possible to write software that can interpret the documentation contained within the data model.

So what is that useful for? I’m glad you asked. Since the data model captures both the syntax (this is what most ICDs represent) and the semantics (this is what the data actually means and is typically represented in the ICD prose) of the data used in interfaces, this data can be leveraged to facilitate (or even automate) integration between systems.

4. The Coup de Grâce

Up to this point, the mechanical difficulties have really been focused on the creation aspect of the data model, but there are many reasons we might consider changing our models. In some cases, there are actual defects (errors) in the model that need to be corrected. In others, we may come to a better understanding of what our data means allowing us to increase our semantic specificity. We may also need to change the relationships between certain entities to account for a new use case.

Consider the following, relatively simple, three element data model that shows two entities connected by an association. All levels of abstraction plus a unit of portability are depicted.

Now, imagine that we wish to take attribute ‘b’ out of entity A and move it into its own entity with a different measurement representation. The move in the conceptual space is as follows:

As you will see from the discussion below, this is a very typical data refactoring pattern that we frequently employ as we construct more and more accurate data models. Let’s take a look at the impact that simple change has on the overall model:

All of the nodes with a red dot show places where the modeler must make an update. Moving a single attribute at the conceptual level caused a ripple that affected 11 other points in the model. While this is just an example, it is very indicative of the challenge one simple change can make. This makes data modeling hard.

Maslow’s Hammer

Let’s face it. We get used to certain tools. Master builders have a favorite chisel. Software engineers often prefer a
particular coding environment. And data modelers are no exceptions. We have our tools and after fighting with them
for many, many, many hours, have finally learned to bend them to our will…mostly. In many cases, our companies have
institutionalized these tools and built their in-house tool chains around them. Enter Maslow’s Hammer, the overreliance on a familiar tool.

“I call it the law of the instrument, and it may be formulated as follows: Give a small boy a hammer, and he will find that everything he encounters needs pounding.” – Abraham Maslow

Have you ever tried driving a screw with a hammer or driving a nail with a screwdriver? It’s absurd, yet we do this with our tooling environments.

Just because we can make it work and it prevents us from having to reformulate our entire toolchain and we don’t have to invest in a new set of tools, we take a huge hit in productivity (or produce poorer quality data models) because the existing tools are just not fit for purpose. We bend our UML modeling tools to work with our semantic data models with clever scripts and profiles but keep ourselves bounds to these incompatible workflows.

Before we go any further, it is necessary to point out something extremely important.

Just because it’s hard to do, doesn’t make it wrong.

All of this difficulty does not make data modeling wrong. And the compounding of the difficulty does not mean that data modeling leads to diminishing returns. On the contrary. As mentioned above, the more accurately the semantics are documented in a data model, the more likely we are at being able to automate integration.

It is possible to address virtually all of these difficulties with tooling. Further, it is possible to develop a customized workflow that makes data modeling very easy. The following shows a tool capable of drag-and-drop entity building.

These kinds of tools can also simplify the complex data model refactoring previously discussed. When the
tool is aware of the data modeling conventions, a change at one level of the data model can easily be
percolated through the other levels of abstraction.

5. Addressing the Difficulties

There are several approaches to addressing the difficulties associated with data modeling. First and foremost, we can start with education.  It would be valuable to have a go-to source that explains:

  • How to build a data model from beginning to end
  • Why we build data models at three levels of abstraction
  • Examples of each data model level
  • How to construct a new measurement
  • How to construct a new measurement system
  • How to connect the different levels of the data model together
  • How to build measures with continuous values
  • How to build measures with discrete values (enumerations)

Education alone, however, will not simply make data modeling a pain-free exercise. Until better tools emerge that allow us to develop these canonical data (i.e. not message) models, we will continue using our hammers for everything.

Tooling aside, there are some basic practices we can embrace in order to build better data models. These principles apply regardless of the tooling you find yourself using (hammer or scalpel alike) and can start your organization on the path to better data models.

The goal is toward model reuse and extension rather than one-off development and abandonment.

6. Data Modeling Basics

Start with a Shared Data Model

Building data models from scratch is a daunting task. Imagine having to recreate the entire set of building blocks for each model. This only needs to be done the first time. Subsequent projects can (and should) benefit from this effort by reusing this groundwork.

This is precisely what the Data Architecture Working Group (DAWG) has done by creating the Shared Data Model (SDM). The Shared Data Model is the starting point for a new data model. It obviates the need for teams to start from scratch and provides them an approved (and managed) set of observables and measurements. For most users, this will be a wholly sufficient starting point, and users who find the model lacking can follow the FACE Problem Report/Change Request process for getting the necessary information added.

Further efficiencies can be realized when one also starts with a Domain Specific Data Model which details entities and relationships germane to your system. As stated earlier, how many times does the concept of a sensor need to be captured and documented in a model?

What Makes a Good Data Model?

Taken as a whole, model quality is a subjective measure. As such, this information will be subject to the biases and best practices of the authors. Although data science has shown that certain modeling practices are superior to others, the community has not determined the extent to which those practices extend to this domain. Additionally, model quality may be impacted by cost or schedule drivers and traditional metrics of “goodness” may not suffice.

While this standard effectively pushes data modeling into the spotlight, it is not a new concept. This is a well-studied and well-understood discipline. In the world of databases, model quality is usually discussed in terms of the “normal forms.” The normal forms are formal representations of how the data is represented and stored in order to reduce redundancy and promote data integrity.

While a complete presentation of normal forms is beyond the scope of this paper, we can still talk about some basic concepts. Consider the following example of how an email contact could be modeled:

The early personal organizers that only allowed a single email address used this type of representation.  In the early years of email, this was probably sufficient.  After all, who would ever have more than one email address?

While this may be a convenient implementation, it is not very flexible.  First, in order to allow a person to have two email addresses, you would have to add them twice (raise your hand if you remember doing that – now put your hand down, you’re reading a paper).  Second, what happens if you know two people with the same name but different email addresses?  Or, what is much worse, what happens if one person transfers their email address to another person?

An implementation that offers more flexibility follows:

In this example, the name is coupled to an email address using a third reference.  This type of relationship is called an association.  In this case, Email Assignment only has two properties (its so-called associated entities), but it could have additional attributes that further describe the relationship such as when the email address was assigned to the contact.

7. Composition & Association…of Bodies and Kidneys

When data modeling, modelers are constantly assessing how two entities should be related. What is the nature of their relationship?  Is it an “always” relationship?  If so, that implies a relationship known as composition.  If it is a “sometimes” relationship, that implies something called association.

Although this is a useful rubric for discriminating between composition and association, it is not entirely sufficient because we may need different fidelity in data models in different domains.  Does that mean we are actually talking about different things?

Consider the following simplistic model of the human body.

Very much like the “single email” example from above, this model indicates that the human will always have a kidney and the kidney will always be a part of the human.  For many domains, this model is entirely sufficient and is a robust enough way to describe a human being.  However, what happens when we try to use this model for a transplant surgeon?

As you probably suspected, the model must change.  In the domain of a surgeon, horror movies, and other certain circumstances, a kidney may not be a part of the same human forever.

This relationship increases the complexity of the data modeling process, if only slightly.  This additional relationship does not make the data model invalid for the simpler (non-surgical) application which raises an interesting question.  Should we start building models with additional complexity since it doesn’t seem to hurt?  We often base this decision on the difficultly of changing models later, but is this right?

The answer is: it depends.  There is always a cost, typically in time, to creating a more sophisticated model.  Models with more complicated relationships can also be more difficult to navigate (e.g. it might be easier to start looking for a kidney entity in the human body entity).  And it depends on your domain.  If it is likely that you will need more sophistry in the future, then it might make sense to add it.  If you are not likely to ever need it, then don’t bother.  If it isn’t part of your domain and your systems and the system with which you integrate) do not talk about these concepts, there is no need to add any additional complexity.

What if requirements change?  What if you find that you need this additional sophistication in the future?  Don’t worry about it.  It is possible to version control data models and update all of the associated documentation (i.e. modeled entities and documented views) as the data modeling changes.  But, this shouldn’t be managed manually, because as shown above there is a lot of structure in the model.

As you start developing models, you will face many decisions about what goes where.  Here are a few guiding principles that shape where and how we place model content.  Again, strive not to let tooling and process-related ‘hardness’ shape the models content.

Selecting Attributes

How do you know if you have a misplaced attribute (e.g. Engine Temp)?  There are a couple of good guidelines to use for this one.  First, does the attribute make sense without adding information in the label?  Consider the following example:

If temperature is intended to refer to the temperature of the aircraft, then this construction follows recommended practices.  Why?  This style of modeling reduces ambiguity because it does not rely on information in the label (the name of the attribute) to give it any additional meaning.

On the other hand, what if the model looks like this:

This temperature is not intended to refer to the aircraft at all!  This temperature is clearly referring the temperature of the engine.  This construction forces a relationship between an aircraft and an engine temperature and leads to several semantic questions.

Will an aircraft always have an engine temperature?  What if the aircraft is a glider?  What if the aircraft has to jettison its engines?  What if engines are removed for maintenance?  Does an engine have a temperature even if it is not part of an aircraft?

These questions hint toward the real source of the problem.  The temperature is a property of an engine and not an aircraft.  When we consider conceptual modeling, does an aircraft really have the concept of an engine temperature?

Perhaps the most common argument in support of this modeling construct is that the messages our system sends has the engine temperature in the same message as the vehicle status.  Isn’t it a good idea to keep this information together?

Keeping the data together is an excellent practice especially if that representation is a reflection of your system’s implementation.  However, the data model is not intended to reflect a single implementation nor is it intended to mirror message packing structures.  The purpose of the data model is to provide a rigorous, consistent, and unambiguous structure for documenting our interfaces.

For this example, the recommended modeling approach would be as follows:

Do Not Put Semantics in the Labels (Continued)

As discussed in the previous section, when identifying information is put into the attribute’s label, we have effectively put the semantics into the label itself.  This then leads us down the path of requiring a human (data modeler or systems engineer) to be involved in the integration and, once again, we have ambiguity in the data model.  This is exactly the situation we are trying to avoid by introducing the data model in the first place.

This touches on a discussion regarding entity uniqueness in the model.  While that remains a recommended practice, consider what happens when semantics are placed in the labels.  Once again, when there is only single violation (engine temperature), it isn’t too hard to determine what the modeler intended, but what happens when there are many such attributes?

At some point, it just becomes easier to add another attribute than to even try to figure out what a previous engineer intended.  This is the pinnacle of model erosion and the model loses the ability to be of much use beyond conformance (and there is even some argument as to how useful it is as an input to conformance).

Model for characteristic uniqueness.

Characteristics are intended to describe the properties of the entity to which it is added.  What does it mean if a property is added to the entity more than once?

Can a person have more than one position?  It certainly makes sense to represent the position in multiple ways (this occurs at the logical level where uniqueness is not required), but the person has only a single position.  Try as I often do, I cannot be in two places at once.

Does a person have more than one kidney?  Some do, some don’t, but this is a matter of multiplicity.  A person only has one concept of a kidney.  If there is something fundamentally different from the left and right kidney, the kidney entity should be further decomposed and modeled with the difference clearly documented.

When modelers start adding non-unique properties, the only way to differentiate them is to add meaning to their names.  This was covered in the previous section.

Model for entity uniqueness.

An entity is characterized by its properties.  How then are two entities with the same set of properties differentiated from one another?  Once again, we look to the labels to discern a difference.

Per this sample model, a car and a truck are effectively the same entity.  Therefore, at the level captured by the model, there is no difference.

Clearly there are differences in cars and trucks, but this is not entirely captured by this set of properties.  Specifically, how would an El Camino be characterized?  This is why we have introduced the concept of a basis entity in Edition 3.0.  This allows modelers to declare a singular unique property that then makes the entity unique.

As with the other uniqueness constraint, the problem with this liberty is not when we have one or two duplicate entities or even when there are distinctly different entities that happen to have similar representations.  The problem occurs as new entities intended to represent the same thing are added.  Once this occurs, the ability to mechanize integration is reduced and it necessary to reengage a human-in-the-loop to provide the discrimination between the similar entities.

By requiring uniqueness in the characteristics, these problems can largely be avoided.  It does not prevent bad modeling, but it limits the ability of entities to grow in an unconstrained manner.

8. Conclusion

We simply have to look at past modeling efforts to see that actions taken to make it easy often lead to degradation of quality and eventual model abandonment. If we want to avoid the same fate, we must explicitly design the data model and the governing data architecture to take advantage of as many opportunities for automated decision aids when constructing a model.  These include robust Domain Models, tooling for maintenance and construction, and mechanisms to promote discovery and extension of existing models.

Finally, the modeling practices recommended herein are our opinions based on experience.  If you adopt any of these practices, you will produce better models.  Can you produce good models without these recommendations?  Absolutely, but if you do, please share your experience with others so we can continue to improve the quality of our data, our data models, and our ability to integrate complex systems-of-systems, at scale.

2 thoughts on “Data Modeling Is Hard vs. Data Modeling is Hard

  • What are your thoughts on “refactoring” seeming to take center stage in the DevOps community and modeling distilled to DBA working with Liquibase/Maven/Jenkins etc… I see a limited context for making usual adjustments after a model is spit out by your favorite tool. What issues/concerns might you express in this regard. Thanks

    • This is an excellent question and points to one of the challenges with the traditional way we think about (and manage) data models. One of the ways we can measure our effectiveness is to look at the network effects of our data models. If we use a closed ecosystem, we can still achieve integration leverage with a data model, but it will be far less effective than if we are a part of an open ecosystem. The ability to trace a model to another is not enough. We need to track models as the evolve over time. This will allow us to keep our own models (and interfaces) interoperable for many versions to come. And, once a model achieves this level of traceability, it can be used with great leverage across the entire integration space.

Leave a Reply

Your email address will not be published. Required fields are marked *