Creating a popularity index from multivariate data

Question

I am given data from an ecommerce website with features like product_name, product_category product_link, product_id, free_delivery(1 or 0), price, discount, avg_rating, number of reviews, search_rank, date where search_rank is position of the product when a category webpage is opened.

I want to create a popularity_index based on above mentioned features.

My approach till now is to normalize the columns search_rank, ratings and avg_rating and assign weights a,b,c to these and assign popularity_index the value $ax+by+cz$ for each category.

Can I do it in a better way? Do I incorporate some common statistical techniques that I am missing?

Update from comments:

It is a single metric or an index which we can look at to compare two products based on those 3 variables. For example, a product with popularity_index 44.5 is way more popular than some product with popularity_index 1.5. Something on the lines of a socio-economic index or happiness index of countries based on various variables.

Can you give more details on what you would like the popularity index to achieve? Is it supposed to create a single metric which describes variation in those 3 variables? In that case I would look at using PCA on those variables. Or is it supposed to be used for predicting some other 4th variables? — Adam Kells, Sep 15 '21 at 12:48
@AdamKells It is a single metric or an index which we can look at to compare two products based on those 3 variables. For example, a product with popularity_index 44.5 is way more popular than some product with popularity_index 1.5. Something on the lines of a socio-economic index or happiness index of countries based on various variables. — AmanArora, Sep 16 '21 at 06:54
In this case, PCA is probably your best bet, it will find the linear combination of the features which has the maximal variance and is best able to distinguish high/low popularity. See [here](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) for a classic answer explaining the method. — Adam Kells, Sep 16 '21 at 08:19
Please do not give new information only in comments, edit your question to add the new information. We want posts to be self-contained, comments can be deleted, and anyhow, information in comments are not well organized. Also, many people do not read comments. — kjetil b halvorsen, Sep 16 '21 at 14:38
That passive start, “I am given data”, is unclear. If the e-commerce website is giving you data, then an obvious popularity index is the number of times that the product has been purchased. Perhaps “I have scraped data” would be clearer. Meanwhile how have you chosen your default $a,b,c$? — Matt F., Sep 24 '21 at 13:19
I will just comment briefly that a PCA approach with the raw (or linearly) transformed data is probably NOT the appropriate approach here. This is because there is usually a non-trivial interaction term in these type of rating metrics. (This is elaborated a bit more in my answer below.) — Gregg H, Sep 30 '21 at 18:02

score 1 · Answer 1 · answered Sep 28 '21 at 13:23

It really depends on what you define as popular, and what you want to do with it. Also what is your optimization goal? In e-com for example, a common definition for popular is the revenue driven from this product, which incorporates the number of times the item has been purchased and its price. And this is also what you are trying to optimize for, maximize the revenue.

Regarding your definition: It makes some sense, but it can have some problems. For example: do you prefer a product with 100000 ratings with avg of 2 stars than a product with 49999 with rating of 4? Maybe this functional form is too simple, and cannot capture a more complex function. For example, function that takes into account a min number of ratings and min avg rating. Or to have more complex definition using some thresholds etc.

You can think about it as follows: Suppose that you mark manually popularity for each case and then have a model to learn the function, it makes sense that non linear models will perform better. i.e. it will be more complex than the function you suggested. Another problem with taking the numbers of rating will be that old products can have an un fair advantage and you may want to take it into account as well.

Finally, i think search rank can be problematic. It can cause the following feedback loop:

The algorithm put items that it thinks is popular in high ranks -> Your definition for popular is based on the rank, which makes the algorithm thinks it is even more popular.

Take for example a good product, that the algorithm mistakenly thought it is bad and put it on a bad rank. The fact that this product succeeded even though it was on a bad position, just makes it a better, more popular product. So you want to understand the products popularity, regardless to their position. One way to solve it, is to normalize the popularity score by the relative effect of the rank.

score 1 · Answer 2 · answered Sep 30 '21 at 14:43

I am going to propose a very pragmatic protocol based on the following rationale:

a high number of high average ratings should be ranked very high
a low number of high average ratings probably should be ranked high
a low number of low average ratings probably should be ranked low
a high number of low average ratings probably should be ranked very low

Step 1: convert the average ratings to standardized values (z-scores): $z \in \mathbb{R}$
Step 2: convert the number of ratings to percentiles: $p \in (0,1)$.
Step 3: set the rating quality measure to $r = (e^z)^{1+p} \in (0,\infty)$
Step 4: if needed, convert $r$ to its relative rank $R$.

I have implement similar rescaling strategies in other projects, but I do not have any references that I can easily provide (sorry about this).

Creating a popularity index from multivariate data

2 Answers2