
I want to analyze how members are using our debit cards. Monthly, I receive LARGE files of transactional data indicating the amount spent at a particular merchant. The challenge is that a merchant like Kroger, for instance is identified as Kroger on Elm St. or Kroger on Maple St. How can I identify top 10 merchants that customers are spending money with if they all appear so differently? I can guess companies and then manually Group them if they Contain (Kroger), but 1. this is a TON of data, and 2. I could be missing some since I'm guessing companies to group.
Hi Kathleen,
So...again, not perfect but better than what you've got. Unfortunately this will take some maintenance on someone's part going forward - but see if the concept here help with that to some degree in terms of process.
1) Attached is a packaged flow, modified from your original flow (didn't use your groupings).
2) Used two separate flows within the master flow then unioned them together prior to Output.
3) Essentially, the two columns labeled Cleaned from each of the above flows are doing two separate things depending on what was found within the structure of the data. The upper portion addresses the 'business name' which may actually contain a numeric to the business name - i.e., I saw lots of 365 Market values which is the business name. Does it address each and every instance of that particular business and related permutations? No, but it will be easier now to identify them. The lower one extracts out those business names between a numeric value and what would appear to be an address. The calculations use Regular Expression functions to extract out as much of the 'business name' as possible.
What I did above was to take the Output of what was done in Prep, created a new workbook in Desktop, then counted the results of the Clean field output. As you can see this instance for 365 Market it still has too many permutations to 'group' them automatically even after having extracted out the 'business name'.
Still what I'd do here is use this effort from Desktop to then go back into Prep and find all related values and then group them. So for example with the above, the common relationship for 365 Market is their address.
Going back to Prep now, click on the final Union Step, then then using the column for Cleaned, click on the search icon, search on 365 and then obviously select the values found in Desktop and group:
And name:
4) However, don't be too disappointed with this result, as these are going to be some of the exceptions. Quite a bit of the results were well grouped - i.e., through the use of the calculations.
5 Guys Burgers
7-Eleven Stores
Audible
...and so on. Again, I'd use the initial output from Desktop to refer back to Prep as to what needs some attention and help. Know this, that once you do this initial cleaning/grouping, generally you will only have the really irregular ones to deal with. Lastly, as mentioned you could use Prep's method of grouping by Pronunciation, by Spelling, or Common Characters.
The concern here being that some values would be grouped in that shouldn't be, and again would require human intervention/maintenance to ensure...for example, by invoking by Pronunciation we see on the far left side the results that some 40,000 values were grouped down into just 5200+ which is great...but then each group would need to be reviewed to ensure it belong...if that makes sense. For example, Amazon Mktplace picked up all these values, including some that should not be in the list. Just a matter of unchecking those values, then clicking Done:
Attached is the packaged flow and separately will attached the Desktop version as well. Note: I did the pronunciation grouping at the very end so both the attached flow and desktop versions should contain the results of that step. So, technically you should have even less to review and group thereafter.
Hope this helps point the way!
Best, Don
(Please, don't forget to click Select as Best or Upvote !)