9

I am collecting a large number of URLs. I am not responsible for the websites in question, and I want to remove tracking parameters that do not affect the content of the website. With the tracking parameters, it's impossible to identify two URLs that should be considered equal.

For example, if I have the following links:

  1. http://example.com/blog/post1?utm_xyz=1234
  2. http://example.com/blog/post1?utm_xyz=5678
  3. http://example.net/viewblog?post_id=2&utm_xyz=9999

I want to convert to the equivalent canonical-type URLs:

  1. http://example.com/blog/post1
  2. http://example.com/blog/post1
  3. http://example.net/viewblog?post_id=2

The first two are for the same content, but have different tracking parameters. The last example illustrates why I can't just remove all query parameters.

The most common of these are the utm_ ones, but I have also found:

  • Piwik: pk_campaign and pk_kwd
  • WebTrends: WT.nav, WT.mc_id
  • unknown, maybe Apple: campaign_id
  • Wikimedia: wprov
  • HootSuite: hootPostID

Is there a well-known list of these query parameters that I can safely remove?

(I am using the canonical URLs where they are supplied in the HTML metadata, but I want to use this approach when none is supplied.)

MrWhite
  • 43,224
  • 4
  • 50
  • 90
Joe
  • 261
  • 1
  • 6

2 Answers2

5

Part of my RewriteCond used to deduplicate URL for a more efficient caching: utm_(?:source|medium|campaign|term|content)|gclid|fbclid|msclkid|emci|emdi|ceid|sourceid|hootPostID|__s

You can also get some more from Brave source code and Firefox query stripping service

drzraf
  • 151
  • 1
  • 3
1

I guess your intention is to clean the scraped URLs.

You can refer to articles on best practices of using UTM. Commonly used keywords for utm_medium are based on the naming conventions used in Google Analytics such as, social, referral, email.

At the end of the day there is no good way if you’re doing this based on a fixed list of keywords. Because the parameters can be anything.

You will have a better chance of sanitising your results by using regex to detect and remove any UTM parameters.

For a URL like https://example.com?utm_source=facebook&utm_medium=social&utm_campaign=book-launch-2014 you need to search and replace the parameters with nothing:

  • utm_source
  • utm_medium
  • utm_campaign
  • utm_term
  • utm_content
Stephen Ostermiller
  • 99,822
  • 18
  • 143
  • 364
Scott
  • 111
  • 2