I am struggling with determining the best way to guess which table (if any) on a given web page is the summary table. Examples would be the first, right-side tables on these pages.
- http://wikitravel.org/en/China#quickbar
- http://gameofthrones.wikia.com/wiki/Howland_Reed
- http://lionking.wikia.com/wiki/Simba
- https://en.wikipedia.org/wiki/James_Wong_Howe
- https://es.wikipedia.org/wiki/Markdown
- http://dreamworks.wikia.com/wiki/Trolls
As you might infer, these are from wikias and similar templates, but that may not be the case always. Also, only the structure of the table and not the content would be mostly common across the datasets.
I am wondering what would be the best approach to fix my problem.
- Is it classification - I create features manually such that is_table, is_aside, width, text_to_tag_ratio etc and treat this is as binary classification? If classification sounds best, what is the best way to create features (manual or by some unsupervised method like clustering) ?
- Is it a structure finding problem - I use CRF to identify the structure?
- Is it clustering - Is clustering on web page attributes the way to go?
I really want to at least get a lead in the right direction with one approach in stead of 3 (or more) above. Thanks in advance :)