Data Standardization: The Crux of a Functioning Relational Database

Imagine you are at your computer early in the morning, coffee in hand and running a query on all the crosses made in the past five years with genotype AG12ca5. The result is 20 crosses. You know that can’t be right; there should be upward of 50. So you run the query again only to get the same result.

This is where the nuances of data standardization become absolutely critical. After doing some digging and running additional queries using other characteristics, it was found that AG12ca5 was entered into the database three different ways (AG12ca5, AG12cz5, AG12C5), either intentionally or unintentionally.

If you don’t standardize certain points of data, a functioning relational database is simply not possible. I like to say “garbage in equals garbage out” when we are talking about the integrity of databases, and a relational database is dependent on quality data.

It takes time, planning and consensus up front to work through the details. How should we notate trait names? Is it yield or grain yield? Hectares or bushels? It’s not uncommon for some teams to take several days to come up with the naming convention they want to use; it really is an investment in the future.

Once the naming conventions have been determined and there’s consensus among those working in the database, it’s important to have a system of checks and balances to keep the integrity of the database. For instance, a breeder maybe trying to enter AG12ca5 but their finger accidentally hits the 4 key. The database should alert you that this is a “new” genotype and ask if that’s what you really want to do. At that point, you’ll be able to correct your mistake.

It’s a way of validating the information, and this type of feature is absolutely essential in keeping a highly functioning relational database. A rigorous detection system must be a part of the software. It’s simply not an option.

I hear stories all the time from breeders who are new to a plant breeding program and are sorting through 15 years of data on different spreadsheets, in different formats and where the trait names have changed. These are the types of situations that keep plant breeders up at night; they truly are nightmares to deal with and make sense of.

Consistency is key. Consistency from breeder to breeder, consistency from year to year and consistency within crop species. As companies do more work in more locations, the need for uniformity increases.

Every piece of data coming into that system must be scrutinized and interrogated upfront, or you risk losing that uniformity and thus, your functionality. Yes, it can be tedious but it will cost you if these measures aren’t done.

No matter the system, be diligent and take the time to standardize your data, and then work to keep it that way. Your breeders will be thankful you did, and so will your bottom line due to the increased efficiency, minimized duplication and improved analytical plus decision-making power given to your breeders.