regex - Information Extraction from a document, with not much training set -


  • what want do: extract basic biographic information text document. (relation extraction specific)
  • explanation: have n text documents containing biographies of n different people. want extract information corresponding names, age, qualifications, affiliations , interests.
  • what able do: used stanford ner extract name, age , organization in cases. however, there many false positives false negatives-specially "organization" tag.
  • why difficult: biographic document, contains text associated concerned person. can't use other documents training classifier things totally different person. yes, surely can write rules. however, restricting domain considerably. example, wrote rules extract qualification..simple ones being: if of degrees (in pre-specified dictionary) present in sentence, can extract entities sentence , try find relation.
  • my question: there way of making task automatic? since analyzing 1 document each time, please don't suggest me use bootstrapping based approaches. tried learning patterns collecting specific sentences each document , applying bootstrap-based algorithms snowball, failed miserably. aware parsing might me on here, trying learn patterns dependency parse of specific sentences..however not sure how proceed it. thought of applying distant supervision learning, requires large dataset.
  • personal take (till now): such problem solved rule-based approaches augmented parsing-based methods. however, not yet able incorporate probabilistic or statistical model generalize different types of biographies.

ps: want change latter sentence of "personal take". hence, seeking help.

an example:
document containing following text:
tim obtained phd stanford university in 2010. did bachelor (hons) massachusetts institute of technology in 2004. currently, working in abc company.

should extract facts in form: [entity1, relation, entity2]
ex: [tim, affiliation-phd, stanford university],
[he(resolved tim), affiliation-bachelor(hons), massachusetts institute of technology] and
[he(resolved t tim), affiliation-works, abc]

an example help. instance, if biography structured can use awk or grep in bash script. if haven't considered option, post example chew on.

another option use amazon turk or human microtask tool. relatively little money can have humans extract information you. tools such crowdflower provide statistical analysis of results takes account past performance of workers. can use redundancy , voting further refine results. i've used crowdflower in past , have gotten goo results. they've changed business model focuses on large accounts may no longer option. start turk.


Comments

Popular posts from this blog

java - UnknownEntityTypeException: Unable to locate persister (Hibernate 5.0) -

python - ValueError: empty vocabulary; perhaps the documents only contain stop words -

ubuntu - collect2: fatal error: ld terminated with signal 9 [Killed] -