I have a task to extract product information from a certain set of websites for price analysis. The product group I'm trying to harvest data is well defined, I could easily provide a set with all the product names and brands I want to obtain price data from. My current strategy is write a scraper specific to each website I want to monitor (based on xPath/css queries/regex/Scrapy) and change it whenever it starts failing (for any changes on the interface).
I'd like to have an algorithm more robust than this and that could find out the prices of this selection of products from any site regardless of its structure. Is there any machine learning technique that could fit my needs which I'd be able to implement without needing to write a doctoral thesis about?
Please consider that I'd have a list of urls with the product pages so the problem would only be extracting the same information from a lot of different pre-defined pages (crawling the web in search of product pages and diffeentiating them from others is not part of the problem).
Thanks in advance!
I have a list of products I need to follow. For example, a dictionary:
{{Name: 'Samsung Galaxy S5', manufacturer: 'Samsung'}
{Name: 'Samsung Galaxy S4', manufacturer: 'Samsung'}
{Name: 'Samsung Galaxy S6', manufacturer: 'Samsung'}
{Name: 'Iphone 5s', manufacturer: 'Apple'}
{Name: 'Iphone 6s', manufacturer: 'Apple'}}
I'd like to be able to find their prices on different sites such as Amazon, TigerDirect, BestBuy, Frys, etc... with an adaptive algorithm or machine learning which could make sense out of the text of the individual product pages and find the price of the product.
input: product page url
output: product price