Recently a good friend of mine posted his thoughts on data subsetting. As you would expect from Kyle, it’s a well-reasoned discussion (well, till he went down the path of using Delphix 😉 ) but I wanted to add a few extra comments. It’ll probably take longer than a quick comment on his post, so let me add my 2 banana skins worth here then link to this discussion back on his post.
So Why Subset?
Well, there are a number of reasons for why you might want to subset. Let’s look at each of those in no particular order of importance:
- Insufficient storage volume to clone to another environment – this is the reason most people think of straight away. Your production database simply uses more storage space than you have available in the environment you want to copy it to. In today’s world, both the data volumes and number of applications continues to increase, and there’s no sign of that letting up. The end result of that is there’s a lot more data that people want to copy to a non-Production environment (for whatever reason that may be – some of the reasons are covered in rest of this list). If you don’t have the storage space to copy the data, then you simply have to subset – there is no other option. That, of course, leads to another discussion on HOW to subset. More on that later too.
- Even if you HAVE the storage to be able to copy your Production environment, you may still not WANT to. For example, copying a multi-terabyte database (which is not an uncommon requirement nowadays!) may simply take too long compared to copying a subset. And with many sites now taking MULTIPLE copies of their Production environment, this issue can simply be exacerbated. Just this morning, for example, I was talking to a customer that regularly copied their production environment to 7 or 8 non-Production environments. Without subsetting, they were taking most of a DBA’s time each week simply in running the copies and making sure it all worked.
- Building a functional test environment – if you are building a functional test environment on a regular basis to test new releases of applications against, for the same storage and time reasons just discussed you may want to subset the data as that is done. At the same time, you would also want to mask any sensitive data as you don’t want that available outside your Production environment. More again on this point later. NOTE: There is one issue to be aware of here, and that’s the concept of edge cases.
- Building a training environment – when you are training staff on how to use an application, the training becomes much more useful if the database contains realistic data. However, you normally don’t need anywhere need Production data volumes in your training environment, and it will generally perform better if it only contains a fraction of the Production data. Of course, once again you need to ensure that you have masked any sensitive data from your Production environment.
- Proofs of concept – whether you are testing a new idea around an existing application, or trying to determine whether a new vendor’s application will suit your business requirements, it is again best to test that with meaningful data, such as a subset of your Production data.
- When you need to clone the same data multiple times – there are a variety of reasons why you would want to clone the same data multiple times. These might include:
- where you have multiple developers building distinct sections of an application, and you want them to each have their own data sets
- when you are using an agile development methodology and want to test incremental application changes quickly
and so on. If you do this sort of operation frequently enough, you simply may not have the time to clone a full copy of your Production data. As a general guideline, if you are cloning the same data set multiple times, it is wise to look at subsetting the data.
So Why NOT To Subset?
OK, so there are a bunch of reasons why you would want to look at subsetting your data when cloning to another environment. Isn’t it beneficial all the time then?
Well, no. There are at least two important scenarios where you would want to take a full clone if at all possible.
Load / Performance Testing
Call it what you will, the most accurate of way of testing the performance of an application under load is to have a full-sized, realistic data set to test against. Those two terms (“full-sized” and “realistic”) are both important and need to be satisfied AT THE SAME TIME. Let me explain why.
Those of us that have been around databases for any length of time will almost certainly have run up against the scenario where an application functions perfectly well in Development, but then runs like a dog (and a three-legged one at that) in Production. The reason is almost always that Development data sets are generally much smaller than Production data sets, so there is little need to worry about performance tuning issues in Development. Data volumes tend to be small enough that the lack of important indexes in Development that might be needed in Production can be missed simply because full table scans across small data sets performs sufficiently well. So you need to have a Production size data set to work with.
Having said that, just having a Production size data set simply isn’t enough. There are loads of tools out there in the marketplace today that can be used to generate a data set of a specific size. But the nature of the data has to be similar between the two environments as well. If it is not, any particular query may result in two completely different execution plans, so you’ll be comparing apples with oranges while trying to validate performance.
When you are masking the data, particularly in the case where you are moving data to an environment that will be used for performance testing, you need to ensure the tool that you use to mask the data doesn’t change the basic characteristics of the data. Let’s take the example of a social security number for US data sets. Social security numbers have a specific format in the US – XXX-YY-ZZZZ – where each of those characters is a number. If you replace a 9 digit number by a one or two digit number across the board, the performance characteristics of retrieving those numbers will be completely different. So the data you end up with in your performance testing environment has to be REALISTIC, as well as being an equivalent volume of data.
The other scenario where it helps to have a full data set is when you need to test edge cases. Edge cases are also known as boundary conditions and a variety of other terms. If you’re not familiar with the term, here’s a simple example I’ve lifted straight out of the pages of a book called Software Testing, by Ron Patton. He uses a simple BASIC program to demonstrate a boundary condition:
1: Rem Create a 10 element integer array
2: Rem Initialize each element to -1
3: Dim data(10) As Integer
4: Dim i As Integer
5: For i = 1 To 10
6: data(i) = -1
7: Next i
The idea of this code is to create a 10-element array and initialize each element of the array to -1. The problem with the code is the first element created is 0, not 1. Since our for loop starts with 1, we don’t initialize data(0) correctly. The array values looks like this:
data(0) = 0
data(1) = -1
data(2) = -1
data(3) = -1
data(4) = -1
data(5) = -1
data(6) = -1
data(7) = -1
data(8) = -1
data(9) = -1
data(10) = -1
As you can see, the value for data(0) is just not what we wanted. So why is subsetting a problem here? Why would you prefer to have a full data set? Well, even looking at that small data set above, you can see that the edge case only affects less than 10% of the data. With more realistic data volumes, you can imagine that edge cases often affect a very, very small proportion of the data. Subsetting the data means the chances of hitting those edge cases decreases dramatically.
Before we go onto how to subset your data, let’s look more at something that’s necessary whether you subset or not. Whenever you copy data from your Production environment, you need to ensure that data is secured. There are dangers in simply copying data from Production to a non-Production environment. By its very nature, you simply don’t want to have sensitive data being copied from one environment to another without some form of data obfuscation (a fancy word for masking). This could be simply for privacy reasons, or for more complex security and auditing concerns.
Before attempting to mask data, you first need to define what sensitive data is in your environment. Oracle’s Data Masking Pack provides comprehensive data discovery capabilities by allowing security administrators to define data search patterns, such as 15- or 16-digits for credit card numbers, 9-digit formatted US social security numbers or UK national insurance numbers, to automatically discover sensitive data. The search results are ranked based on how closely they match the search patterns and security administrators can then designate the column as sensitive for inclusion in the data masking process.
However, defining and identifying sensitive data to mask is only the first part of the solution. The next and equally challenging task is to preserve referential integrity of the data after masking. The Data Masking Pack automatically detects data dependencies such as foreign key constraints and ensures that referential integrity is maintained during masking. For example, if a sensitive column such as employee number is a primary key in a table relationship, then all associated tables containing dependent columns will be automatically included in the masking process so that the masked value will be consistent across the related tables thus enforcing referential integrity.
How to Subset
OK, so once you had made the decision that you need to subset your data, how do you actually go about the process of doing it? Before Oracle released the Test Data Management pack with Enterprise Manager, most people resorted to some form of manual process. One of the main problems with using a manual process (other than the fact you have to build and maintain it yourself, of course!) is knowing just what data to select. Creating referentially intact data subsets of production data for modern enterprise applications is a daunting task to any organization, even with highly skilled DBAs. These enterprise applications are incredibly complex, spanning multiple schemas containing thousands of tables governed by a myriad of business rules. The reason for the difficulty lies in the large and often complex data models that govern the relationships between the columns of the tables. Oracle’s Test Data Management Pack automatically discovers these relationships and can store them within an entity called an Application Data Model. The pack also ships with pre-defined drivers to capture the data relationships for Oracle Applications (such as Oracle Fusion Applications and Oracle eBusiness Suite Applications) directly from the application meta-data tables. As a part of the discovery process, the Application Data Model also captures key table attributes, such as table types. Table type information can help guide the subsetting process by indicating which ones are candidates for subsetting. For example, transaction tables usually contain a large amount of data, which therefore make them ideal candidates for data subsetting. A lookup or reference data table is usually small and therefore may be included in its entirety without requiring subset operations.
Once an Application Data Model is defined, an administrator can define different types of subset operations. For example, subsets may be time-based, e.g., extracting one fiscal year out of all the fiscal year data, or other dimensions, such as geographical region or product lines, or alternatively they may be derived by application, e.g. extract General Ledger data from an ERP application suite.
Given the limited storage available to application developers, DBA’s face the problem of having identify the subset criteria in advance that would result in a database that would fit within the storage constraints of a developer’s system. Test Data Management allows the administrator to define the subset criteria as a parameter without having to specify the exact value. Then, the administrator use the subset estimation function to determine the expected size of the test database for different values of the subset parameters before executing the subset process. In addition, the subset criteria also support a percentage based database creation process by which a defined percentage of the data is randomly sampled and extracted to create the test database.
Test Data Management provides multiple options for creating test databases from production. In the subset at-source option, Oracle Test Data Management uses an innovative approach to create a portable Oracle Data Pump file containing the reduced size dataset directly from production without requiring a production copy. Now, this file can be transported into any test database and imported to create the test database, thus saving storage costs significantly. In the clone-and-subset option, IT administrators can use the database cloning facility in Oracle Enterprise Manager to copy production data into a test database using RMAN-based live cloning or recreating a test database from a production backup. Then, administrators can execute the subset operation on this production copy to create a reduced size database.
So there you are – some reasons why you would or would not subset, and some of the associated issues you need to consider!