Tableau Prep Tip – Getting Group And Replace Right At One Go For Enormous Data Sets

Group and Replace feature in Tableau Prep is  one of my favorite as it can standardize fields using many of the built-in algorithms.  But it has its caveats;

  1. Algorithms are applied only to the sampled data and not to the entire data set
    • Prep flow stores a look up table of replacements  based on sampled data and applies that look up table to the entire data set when running the flow
    • This is justifiable as  grouping  applied to unseen combinations can be a bad thing. For example,  we do not want  Prep’s common character grouping algorithm to group words with same characters like ‘God’ and ‘Dog’ together.
  2. It is extremely difficult to get the sampling right for huge data sets.
    • Default sampling limit in Prep is 1 million rows, but it may or may not be enough depending on the size and variety of  the  data set.
    • There is a great post on Tableau’s blog about improving the sampling efficiency
  3. Multiple iterations are required to get the group and replace right due to above issues

But how can we do this effectively without sampling all data or performing multiple iterations?  I will provide the steps  below, but here is the data set to clean if you would like to attempt the challenge yourselves.

 

Row IDOrder IDOrder DateProduct IDCategorySales
1AB-12345624/7/2016SS -5489Office Supplies9134.00
2AB-12345721/7/2016SS -3305Furniture9500.00
3AB-1234588/7/2016SS -6994Technology9686.00
4AB-12345930/7/2016SS -3707Furniture - Chairs3123.00
5AB-1234607/7/2016SS -2278Office Supplies - Binders4452.00
6AB-12346117/7/2016SS -1868Technology - Machine3735.00
7AB-12346227/7/2016SS -8368Furniture-Book Cases5325.00
8AB-1234636/7/2016SS -9690Technology-Copiers5544.00
9AB-12346429/6/2016SS -1166Technology-Machines5812.00
10AB-12346528/6/2016SS -1239Office Supplies - Paper2563.00

 

Rules:

  1. Final data set should group the existing categories in to 3 categories(Office Supplies, Furniture and Technology).
  2. You are allowed to sample only up to 50% the data- up to 5 rows in this case  (This is to simulate real life scenarios where the data size does not allow to sample entire data set)
  3. You are not allowed to manually add a category that doesn’t exists in the data sample. (This wont be practical in large data sets with millions of rows)

Have you tried it? How did it go? Please share your solution in comments. I am sharing  the steps of my solution below:

  1.  Open the data file and set the sampling size to one. (Yes, we only need one row)
  2. Create an aggregate step and drag category to grouped fields.  Write it to a csv by creating an output step. This will create a file with all unique categories in the original file
  3. Union the above category file with the flow and perform group and replace as normal for category field
  4. Create an output step after group and replace and delete all unnecessary steps
  5. Run the Flow

Please refer to below screen grab for detailed illustration. (You can click on the image to open it in a new window).

Group And Replace in One Go

Group And Replace in One Go

 

Please note that multiple columns can be added to the aggregation in step 2 if you are planning to apply group and replace to multiple fields.

You may also like...

1 Response

  1. April 25, 2019

    […] the last post, I shared a tip to to ensure  guaranteed grouping on all existing data in a data set. But what if your flow is scheduled to run every week or every month? Will the new data break your […]

Leave a Reply

Your email address will not be published.