I am interested in building a search capability on a large text corpus (such as Australian Newspapers)
to answer queries such as:

  • which prime ministers have visited the Tumut district of NSW?
  • who were the most prominent antagonists in the margarine quota discussions during the 1940’s and 50’s?
  • what poems by members of the Jindyworobak movement were published in newspapers?

Such an approach requires:

  1. fairly clean OCR
  2. entities (such as people, organisations, places) can be identified and useful attributes assigned (such as “Gough Whitlam is a Prime Minister”)
  3. there is an easy way for normal people to express such queries, or iterate towards them
  4. ways to deal with ambiguity (For example, what are the boundaries of the “Tumut district” and have they changed? Is Harold Wilson a “Prime Minister” in this context? Does a poem written by a Jindyworobak member before they joined the movement count?)

I’m fairly confident about how the first two requirements can be met, but I am most interested in ways that campers think the third and fourth requirement could be addressed.

  1. I am interested in steps 2 and 4 as there is a difficulty in identifying entities when a person’s name is also a place name so identifying what type of entity in an automated way is difficult. The only way I can think of dealing with this problem is going through the data manually.

