Tableau Prep Tip – Getting Group And Replace Right At One Go For Enormous Data Sets
Group and Replace feature in Tableau Prep is one of my favorite as it can standardize fields using many of the built-in algorithms. But it has its caveats;
- Algorithms are applied only to the sampled data and not to the entire data set
- Prep flow stores a look up table of replacements based on sampled data and applies that look up table to the entire data set when running the flow
- This is justifiable as grouping applied to unseen combinations can be a bad thing. For example, we do not want Prep’s common character grouping algorithm to group words with same characters like ‘God’ and ‘Dog’ together.
- It is extremely difficult to get the sampling right for huge data sets.
- Default sampling limit in Prep is 1 million rows, but it may or may not be enough depending on the size and variety of the data set.
- There is a great post on Tableau’s blog about improving the sampling efficiency
- Multiple iterations are required to get the group and replace right due to above issues
But how can we do this effectively without sampling all data or performing multiple iterations? I will provide the steps below, but here is the data set to clean if you would like to attempt the challenge yourselves.
|Row ID||Order ID||Order Date||Product ID||Category||Sales|
|1||AB-123456||24/7/2016||SS -5489||Office Supplies||9134.00|
|4||AB-123459||30/7/2016||SS -3707||Furniture - Chairs||3123.00|
|5||AB-123460||7/7/2016||SS -2278||Office Supplies - Binders||4452.00|
|6||AB-123461||17/7/2016||SS -1868||Technology - Machine||3735.00|
|7||AB-123462||27/7/2016||SS -8368||Furniture-Book Cases||5325.00|
|10||AB-123465||28/6/2016||SS -1239||Office Supplies - Paper||2563.00|
- Final data set should group the existing categories in to 3 categories(Office Supplies, Furniture and Technology).
- You are allowed to sample only up to 50% the data- up to 5 rows in this case (This is to simulate real life scenarios where the data size does not allow to sample entire data set)
- You are not allowed to manually add a category that doesn’t exists in the data sample. (This wont be practical in large data sets with millions of rows)
Have you tried it? How did it go? Please share your solution in comments. I am sharing the steps of my solution below:
- Open the data file and set the sampling size to one. (Yes, we only need one row)
- Create an aggregate step and drag category to grouped fields. Write it to a csv by creating an output step. This will create a file with all unique categories in the original file
- Union the above category file with the flow and perform group and replace as normal for category field
- Create an output step after group and replace and delete all unnecessary steps
- Run the Flow
Please refer to below screen grab for detailed illustration. (You can click on the image to open it in a new window).
Please note that multiple columns can be added to the aggregation in step 2 if you are planning to apply group and replace to multiple fields.