论文《learning to link with wikipedia》
Posted dhname
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了论文《learning to link with wikipedia》相关的知识,希望对你有一定的参考价值。
learning to link with wikipedia
目标:It explains how the topics mentioned in unstructured text can be automatically recognized and linked to the appropriate Wikipedia articles to explain them.
大致技术(通过例子来阐述):a somewhat dated news story about Iranian prisoners of war left in Iraq after the first Gulf War, which has been automatically augmented using our techniques with links to pertinent topics such as the International Committee of the Red Cross and Baghdad. This process is known as wikification, and our approach differs from previous attempts in that we use Wikipedia not only as a source of information to point to, but also as training data for how best to create links. This gives large improvements in both recall and precision.
两大步骤:link disambiguation and link detection
一:Link disambiguation:
Commonness and Relatedness
- The commonness of a sense is defined by the number of times it is used as a destination in Wikipedia:
2.Our algorithm identifies these cases by comparing each possible sense with its surrounding context. This is a cyclic problem because these terms may also be ambiguous
where a and b are the two articles of interest, A and B are the sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia.
Some context terms are better than others
We can determine how closely a term relates to this central thread by calculating its average semantic relatedness to all other context terms, using the measure described previously.
These two variables—link probability and relatedness—are averaged to provide a weight for each context term. This is then used when calculating the weighted average of a candidate sense to the context articles.
Combining the features
balance commonness and relatedness, we take into account how good the context is. If it is plentiful and homogenous, then relatedness becomes very telling. In Figure 2, for example, the most common sense of tree is entirely irrelevant because the document is clearly about computer science. However, if tree is found in a general document with ambiguous or confused context, then the most common sense should be chosen. By definition, this will be correct in most cases. Thus the final feature—context quality—is given by the sum of the weights that were previously assigned to each context term. This takes into account the number of terms involved, the extent they relate to each other, and how often they are used as Wikipedia links
it considers each sense independently, and produces a probability that it is valid
二.link detection:
We are able to gain much better results by only using link probability as one feature among many.
Features of these articles—and the places where they were mentioned—are used to inform the classifier about which topics should and should not be linked:
Link Probability:
Mihalcea and Csomai’s link probability(the link probability of a phrase is defined as the number of Wikipedia articles that use it as an anchor, divided by the number of articles that mention it at all.)
These are combined into two separate features: the average and the maximum.
Relatedness.:
one would expect that topics which relate to the central thread of the document are more likely to be linked.
a second feature: the average relatedness between each topic and all of the other candidates.
Disambiguation Confidence:
The disambiguation classifier described earlier does not just produce a yes/no judgment as to whether a topic is a valid sense of a term; it also gives a probability or confidence in this answer. We use this as a feature to give those topics that we are most sure of a greater chance of being linked.
multiple confidence values for each instance because several different terms may be disambiguated to the same topic.
average and maximum values
Generality.:
We define the generality of a topic as the minimum depth at which it is located in Wikipedia’s category tree.
Location and Spread.:
Frequency first occurrence last occurrence
the distance between first and last occurrences, or spread, is used to indicate how consistently the document discusses the topic.
以上是关于论文《learning to link with wikipedia》的主要内容,如果未能解决你的问题,请参考以下文章
论文学习-《Learning to Compose with Professional Photographs on the Web》
论文学习——《Learning to Compose with Professional Photographs on the Web》 (ACM MM 2017)
论文阅读之Improved Word Representation Learning with Sememes(2017)
论文笔记之:Deep Reinforcement Learning with Double Q-learning
论文理解 —— Adversarial Imitation Learning with Trajectorial Augmentation and Correction
Introduction to Learning to Trade with Reinforcement Learning