Group and Replace feature in Tableau Prep is one of my favorite as it can standardize fields using many of the built-in algorithms. But it has its caveats;Algorithms are applied only to the sampled data and not to the entire data setPrep flow stores a look up table of replacements based on sampled data and applies that look up table to the entire data set when running the flowThis is justifiable as grouping applied to unseen combinations can be a bad thing. For example, we do not want Prep’s common character grouping algorithm to group words with same characters like ‘God’ and ‘Dog’ together.It is extremely difficult to get the sampling right for huge data sets.Default sampling limit in Prep is 1 million rows, but it may or may not be enough depending on the size and variety of the data set.There is a great post on Tableau’s blog about improving the sampling efficiencyMultiple iterations are required to get the group and replace right due to above issuesBut how can we do this effectively without sampling all data or performing multiple iterations? I will provide the steps below, but here is the data set to clean if you would like to attempt the challenge yourselves. Row IDOrder IDOrder DateProduct IDCategorySales1AB-12345624/7/2016SS -5489Office Supplies9134.002AB-12345721/7/2016SS -3305Furniture9500.003AB-1234588/7/2016SS -6994Technology9686.004AB-12345930/7/2016SS -3707Furniture - Chairs3123.005AB-1234607/7/2016SS -2278Office Supplies - Binders4452.006AB-12346117/7/2016SS -1868Technology - Machine3735.007AB-12346227/7/2016SS -8368Furniture-Book Cases5325.008AB-1234636/7/2016SS -9690Technology-Copiers5544.009AB-12346429/6/2016SS -1166Technology-Machines5812.0010AB-12346528/6/2016SS -1239Office Supplies - Paper2563.00 Rules:Final data set should group the existing categories in to 3 categories(Office Supplies, Furniture and Technology).You are allowed to sample only up to 50% the data- up to 5 rows in this case (This is to simulate real life scenarios where the data size does not allow to sample entire data set)You are not allowed to manually add a category that doesn’t exists in the data sample. (This wont be practical in large data sets with millions of rows)Have you tried it? How did it go? Please share your solution in comments. I am sharing the steps of my solution below: Open the data file and set the sampling size to one. (Yes, we only need one row)Create an aggregate step and drag category to grouped fields. Write it to a csv by creating an output step. This will create a file with all unique categories in the original fileUnion the above category file with the flow and perform group and replace as normal for category fieldCreate an output step after group and replace and delete all unnecessary stepsRun the FlowPlease refer to below screen grab for detailed illustration. (You can click on the image to open it in a new window).Group And Replace in One Go Please note that multiple columns can be added to the aggregation in step 2 if you are planning to apply group and replace to multiple fields. Post navigationTableau Prep – Connecting to REST API Tableau Prep – Future Proofing Group and Replace
[…] the last post, I shared a tip to to ensure guaranteed grouping on all existing data in a data set. But what if your flow is scheduled to run every week or every month? Will the new data break your […]Reply