I have collected sports data from over 40 different sources. I now need all the games to match, be grouped, grouped or aligned so that all of the information for a game is in a list. This turns out to be very difficult as there are many spellings for team names, leagues and countries.
The data is a list of 4 strings in an association. The data structure is as follows:
<|"awayTeamLookup" -> String||Missing(), "homeTeamLookup" -> String||Missing(), "countryLookup" -> String||Missing(),
"leaugeLookup" -> String||Missing()|>
I cleaned and standardized the strings with this function
teamName === Missing(),
A larger sample (2000 lines +) of the data can be found here in .wl format.
Here is a small sample:
I have tried several ways to do it to no avail. There are thousands of games at a time, so matching models won't work.
FindClusters(((All,"awayTeamLookup"))) It only gives me away team games and the clusters are not that good.
- Find the source with the most games and then
Nearest(dataToGroup((All,"awayTeamLookup")),#)&/@longestSource((All,"awayTeamLookup")). It only gives me games in the away team. I guess I should do it on the other columns and then find an algo to combine them? The algo must be smart enough that if the away team doesn't match but the home team does, put those two games together. Likewise with the league and the country.