Tableau Prep Tip - Getting Group And Replace Right At One Go For Enormous Data Sets

Group and Replace feature in Tableau Prep is one of my favorite as it can standardize fields using many of the built-in algorithms. But it has its caveats;

Algorithms are applied only to the sampled data and not to the entire data set
- Prep flow stores a look up table of replacements based on sampled data and applies that look up table to the entire data set when running the flow
- This is justifiable as grouping applied to unseen combinations can be a bad thing. For example, we do not want Prep’s common character grouping algorithm to group words with same characters like ‘God’ and ‘Dog’ together.
It is extremely difficult to get the sampling right for huge data sets.
- Default sampling limit in Prep is 1 million rows, but it may or may not be enough depending on the size and variety of the data set.
- There is a great post on Tableau’s blog about improving the sampling efficiency
Multiple iterations are required to get the group and replace right due to above issues

But how can we do this effectively without sampling all data or performing multiple iterations? I will provide the steps below, but here is the data set to clean if you would like to attempt the challenge yourselves.

Row ID	Order ID	Order Date	Product ID	Category	Sales
1	AB-123456	24/7/2016	SS -5489	Office Supplies	9134.00
2	AB-123457	21/7/2016	SS -3305	Furniture	9500.00
3	AB-123458	8/7/2016	SS -6994	Technology	9686.00
4	AB-123459	30/7/2016	SS -3707	Furniture - Chairs	3123.00
5	AB-123460	7/7/2016	SS -2278	Office Supplies - Binders	4452.00
6	AB-123461	17/7/2016	SS -1868	Technology - Machine	3735.00
7	AB-123462	27/7/2016	SS -8368	Furniture-Book Cases	5325.00
8	AB-123463	6/7/2016	SS -9690	Technology-Copiers	5544.00
9	AB-123464	29/6/2016	SS -1166	Technology-Machines	5812.00
10	AB-123465	28/6/2016	SS -1239	Office Supplies - Paper	2563.00

Rules:

Final data set should group the existing categories in to 3 categories(Office Supplies, Furniture and Technology).
You are allowed to sample only up to 50% the data- up to 5 rows in this case (This is to simulate real life scenarios where the data size does not allow to sample entire data set)
You are not allowed to manually add a category that doesn’t exists in the data sample. (This wont be practical in large data sets with millions of rows)

Have you tried it? How did it go? Please share your solution in comments. I am sharing the steps of my solution below:

Open the data file and set the sampling size to one. (Yes, we only need one row)
Create an aggregate step and drag category to grouped fields. Write it to a csv by creating an output step. This will create a file with all unique categories in the original file
Union the above category file with the flow and perform group and replace as normal for category field
Create an output step after group and replace and delete all unnecessary steps
Run the Flow

Please refer to below screen grab for detailed illustration. (You can click on the image to open it in a new window).

Please note that multiple columns can be added to the aggregation in step 2 if you are planning to apply group and replace to multiple fields.

One thought on “Tableau Prep Tip – Getting Group And Replace Right At One Go For Enormous Data Sets”

Tableau Prep – Future Proofing Group and Replace – Techno Whisp says:
April 25, 2019 at 5:25 am
[…] the last post, I shared a tip to to ensure guaranteed grouping on all existing data in a data set. But what if your flow is scheduled to run every week or every month? Will the new data break your […]

Tableau Prep Tip – Getting Group And Replace Right At One Go For Enormous Data Sets

ByJose Cherian

By Jose Cherian

Related Post

Tableau Prep Python Integration – Installing and Setting Up TabPy

Tableau Prep – Aggregating Strings

Tableau Prep – Iterating Group And Replace

One thought on “Tableau Prep Tip – Getting Group And Replace Right At One Go For Enormous Data Sets”

Leave a Reply Cancel reply

Techno Whisp