Is Machine Learning viable for Extracting product Information from webpages?

Question

I have a task to extract product information from a certain set of websites for price analysis. The product group I'm trying to harvest data is well defined, I could easily provide a set with all the product names and brands I want to obtain price data from. My current strategy is write a scraper specific to each website I want to monitor (based on xPath/css queries/regex/Scrapy) and change it whenever it starts failing (for any changes on the interface).

I'd like to have an algorithm more robust than this and that could find out the prices of this selection of products from any site regardless of its structure. Is there any machine learning technique that could fit my needs which I'd be able to implement without needing to write a doctoral thesis about?

Please consider that I'd have a list of urls with the product pages so the problem would only be extracting the same information from a lot of different pre-defined pages (crawling the web in search of product pages and diffeentiating them from others is not part of the problem).

Thanks in advance!

I have a list of products I need to follow. For example, a dictionary:

{{Name: 'Samsung Galaxy S5', manufacturer: 'Samsung'}
{Name: 'Samsung Galaxy S4', manufacturer: 'Samsung'}
{Name: 'Samsung Galaxy S6', manufacturer: 'Samsung'}
{Name: 'Iphone 5s', manufacturer: 'Apple'}
{Name: 'Iphone 6s', manufacturer: 'Apple'}}

I'd like to be able to find their prices on different sites such as Amazon, TigerDirect, BestBuy, Frys, etc... with an adaptive algorithm or machine learning which could make sense out of the text of the individual product pages and find the price of the product.

input: product page url

output: product price

This seems to broad & tangentially connected to ML to me. Can you make it more focused & concrete? I suspect this may end up getting closed. — gung - Reinstate Monica, May 13 '16 at 18:25
Maybe this is still too broad but if it still is the case I might need to go somewhere else to look for a hint on the best strategy to attack this problem. I'll wait for the judgement of the 'admins'... — Guarita, May 13 '16 at 18:49
@Guarita: If you want a "magic" extractor, check out: http://www.import.io — Alex R., May 13 '16 at 19:26
I tried one of the pages I'm targeting and it only returned images and their titles, it ignored the tables with data in it. Not really sure how to configure it. — Guarita, May 15 '16 at 12:08

score 6 · Accepted Answer · answered May 13 '16 at 18:25

6

Standard crawling and screenscraping is the way to go for such tasks (tools like scrapy are extremely powerful if you learn their full capabilities). Very often, the information you are looking for is fairly easy to crawl, based on div classes and so on (even across webpages). Writing a sufficiently generic crawler that requires little tuning for new webpages should not take too long.

Machine learning methods for such tasks are typically not fully automated either, or make a lot of mistakes (which means you have a lot of work postprocessing anomalies manually). Overall, it would take you much much longer to perform such tasks with machine learning at the moment. The first problem would be a lack of training sets, and if you want to construct those manually you end up writing screenscrapers anyway.

answered May 13 '16 at 18:25

Marc Claesen

17,399
1
49
70

Hmmm, isn't a unsupervised learning strategy viable in this scenario? I already have some results form the scraper I wrote with scrapy but they're only applicable to a single website... – Guarita May 13 '16 at 18:35
It just takes way way longer. You could use unsup learning to analyse / cluster your data that your scrapper obtained tho. – FisherDisinformation May 13 '16 at 18:42
@ArtificialBreeze Hmmm, I think I might need to reduce this problem a little more by making a first filter through webscraping technique then.... – Guarita May 13 '16 at 18:54
Although I still don't have a proper solution to my problem, I believe this is a valid answer to my question. – Guarita May 17 '16 at 15:04

Is Machine Learning viable for Extracting product Information from webpages?

1 Answers1