I am interested in building a search capability on a large text corpus (such as Australian Newspapers)
to answer queries such as:
- which prime ministers have visited the Tumut district of NSW?
- who were the most prominent antagonists in the margarine quota discussions during the 1940’s and 50’s?
- what poems by members of the Jindyworobak movement were published in newspapers?
Such an approach requires:
- fairly clean OCR
- entities (such as people, organisations, places) can be identified and useful attributes assigned (such as “Gough Whitlam is a Prime Minister”)
- there is an easy way for normal people to express such queries, or iterate towards them
- ways to deal with ambiguity (For example, what are the boundaries of the “Tumut district” and have they changed? Is Harold Wilson a “Prime Minister” in this context? Does a poem written by a Jindyworobak member before they joined the movement count?)
I’m fairly confident about how the first two requirements can be met, but I am most interested in ways that campers think the third and fourth requirement could be addressed.