I am collecting a large number of URLs. I am not responsible for the websites in question, and I want to remove tracking parameters that do not affect the content of the website. With the tracking parameters, it's impossible to identify two URLs that should be considered equal.
For example, if I have the following links:
http://example.com/blog/post1?utm_xyz=1234http://example.com/blog/post1?utm_xyz=5678http://example.net/viewblog?post_id=2&utm_xyz=9999
I want to convert to the equivalent canonical-type URLs:
http://example.com/blog/post1http://example.com/blog/post1http://example.net/viewblog?post_id=2
The first two are for the same content, but have different tracking parameters. The last example illustrates why I can't just remove all query parameters.
The most common of these are the utm_ ones, but I have also found:
- Piwik:
pk_campaignandpk_kwd - WebTrends:
WT.nav,WT.mc_id - unknown, maybe Apple:
campaign_id - Wikimedia:
wprov - HootSuite:
hootPostID
Is there a well-known list of these query parameters that I can safely remove?
(I am using the canonical URLs where they are supplied in the HTML metadata, but I want to use this approach when none is supplied.)