Announcements

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

General

Published:

Views: 0 | Pages: 0

Extension: PDF | Download: 0

Share
Related documents
Description
Announcements. Friday, April 23, 2004, 2:30pm, 4623 Wean Hall, Marti Hearst, UCB
Transcript
Announcements
  • Friday, April 23, 2004, 2:30pm, 4623 Wean Hall, Marti Hearst, UCB
  • Abstract: New methods and tools are needed to improve how bioscience researchers search for and synthesize information from the bioscience literature. Towards this end, we are building a flexible, efficient, platform-independent database system infrastructure specifically geared  towards supporting the advanced and particular search needs of bioscience researchers. We are using this infrastructure to support the development and deployment of statistical approaches to natural language processing  which identify entities and relations between them in the bioscience literature. The results of the text analysis will be accessed via an intuitive, appealing search user interface that will be developed using the appropriate human-centered design methods. The resulting system will support new ways of asking scientific questions, and new tools for assembling the pieces of biosciences puzzles.
  • Hardening Soft Databases Cohen, Kautz, McAllester A familiar problem: clustering names
  • Given a group of strings that refer to some external entities, cluster them so that strings in the same cluster refer to the same entity.
  • Cornell Univ. Bart Selmann Bart Selman Critical behavior in satisfiability Critical behavior for satisfiability B. Selman BLACKBOX theorem proving Assumption: identical strings from the same source are co-referent (one sense per discourse?) A variation of this problem: clustering Tuples from Databases Database S1 (extracted from paper 1’s title page): Database S2 (extracted from paper 2’s bibliography): A variation of this problem: clustering Tuples from Databases So this gives some known matches, which might interact with proposed matches: e.g. here we deduce... Example: end result of clustering “Soft database” from IE: “Hard database” suitable for Oracle, MySQL, etc Overview of Hardening Paper
  • Definition of “hardening”:
  • Find “interpretation” (maps variant->name) that produces a compact version of database S.
  • Probabilistic interpretation of hardening:
  • Original “soft” data S is version of latent “hard” data H.
  • Hardening finds max likelihood H.
  • Hardness result:
  • Optimal hardening is NP-hard.
  • Greedy algorithm:
  • naive implementation is quadratic in |S|
  • clever data structures make it P(n log n), where n=|S|d
  • Definition of Hardening
  • Soft DB contains tuples of references of the form R(r1,...,rk). (The max arity of relations is k.)
  • A reference is a string naming an entity, tagged with the source from which it came.
  • Definition of Hardening
  • Possible matches are given as a set of weighted arcs Ipot between references (think –logP).
  • We’d like to assume there are not too many of these: e.g., that at most d touch any reference.
  • 200 0.01 0.1 Definition of Hardening
  • An interpretation is an acyclic subset I of Ipot suchthat for any r, there is at most one are r->r’ in I.
  • “r’ is the hard version of r”.
  • Interpretation arcs are chained, and I(r) is the final interpretation of r:
  • Definition of Hardening
  • An interpretation is an acyclic subset I of Ipot suchthat for any r, there is at most one are r->r’ in I.
  • “r’ is the hard version of r”.
  • Interpretation arcs are chained, and I(r) is the final interpretation of r: “evolutionary model”
  • r1 r1 r2 r2 r3 r3 r0 r0 r4 r4 w(I)=7 w(I)=4 2 3 2 3 t=0 1 t=0 1 Definition of Hardening
  • An interpretation is an acyclic subset I of Ipot suchthat for any r, there is at most one are r->r’ in I.
  • “r’ is the hard version of r”.
  • Interpretation arcs are chained
  • If S is a soft database, I(S) is the hard version of S
  • Apply I to each tuple, discard duplicates
  • w(I) is sum of weights of arcs in I.
  • Cost of I is linear combination of |I|, |I(S)|, w(I)
  • Definition of Hardening
  • I (well-formed) subset of Ipot
  • Interpretation arcs are chained
  • I(S) is the hardened version of S
  • Goal of hardening: optimize cost c(I):
  • Given Probabilistic motivation:Hardening is ‘almost’ finding ML hard DB
  • Assume joint Pr(H,I,S) over hard db H, “corruption process” I, resulting soft db S
  • Natural goal: pick H,I to max Pr(H,I|S)
  • Assume a particular generative model:
  • U is fixed set of (hard, real-world) entities => fixed number of possible hard tuples N
  • H is generated by picking possible hard tuples uniformly at random, stopping with fixed prob eH
  • Probabilistic motivation
  • Assume:
  • I is generated by picking possible arcs, with probability exponential in weight w, stopping with fixed prob eI
  • (Also assume we discard an ill-formed I at the end of the process)
  • Probability of well-formedness Probabilistic motivation
  • Assume:
  • Pr(H,I,S) = Pr(S|I,H) * Pr(I) * Pr(H)
  • S is generated by picking tuples at random, until we decide to stop, from I-1(H)
  • Probabilistic motivation
  • pick H,I to max Pr(H,I|S)
  • ok to mix H,I to max Pr(H,I,S)
  • given S,I the best H is H=I(S), so ok to pick I to max Pr(I(S), I, S), or min c’(I) = –lg Pr(I(S),I,S)
  • nonlinear term Why not use probabilistic inference?
  • Proof sketch: by reduction from vertex cover (subset of vertices of G that covers all edges).
  • Each edge (x,y) => soft tuple R(e); Ipot contains e->x, e->y
  • A well-formedI is a vertex cover; the minimal one is NP-hard to find (so “well-formed” is significant!)
  • Fast Greedy Hardening
  • Simple greedy algorithm:
  • start with empty I
  • for t=1,....,T
  • add to I the single edge in Ipot that reduces cost most
  • Intuitions:
  • similar to (fast) greedy agglomerative clustering
  • perform low-risk merges first
  • reduction in database size generally means similar names appear in many similar contexts
  • merge to balance similarity of merged items and reduction in database size
  • Problem: naive implementation is O(|S|2)
  • value of an arc r->r’ changes over the course of the algorithm
  • Fast Greedy Hardening Updating priorities: main idea
  • Key idea: track tuples that will be collapsed together as the result of incorporating an arc.
  • Summary of Hardening Paper
  • Hardening:
  • given S, find low-cost, small, latent H and I
  • new inferences possible in H
  • evolutionary model of I
  • minimizing a|H|+b|I|+w(I) is “almost like” MAP H,I
  • optimal hardening is NP-hard.
  • Greedy algorithm:
  • naive implementation is quadratic in |S|
  • possible in O(n k3log n), where n=|S|kd
  • Further work...
  • Experimental validation?
  • Other generative models?
  • Type constraints?
  • “Real” Bayesian MAP inference?
  • MCMC approach, similar to greedy search
  • We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    SAVE OUR EARTH

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!

    x