用机器学习算法解读消失的语言
Posted 札记mintis
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用机器学习算法解读消失的语言相关的知识,希望对你有一定的参考价值。
ML 算法加入了(incorporate)语言学限制(主要是语音演化方面)
算法高度无监督,表现在:
a) 不需要语言的背景知识,如:相关的亲缘语言是谁
b) 自动完成分词,可以处理没有空格、标点的语言
基于该算法,研究人员提出一种定量度量语言之间相似度的方法:
c) 基于 Romance、Germantic 等语族的测试证实该方法非常有效
d) 印证了近来语言学界关于 Iberian 与 Basque 无关联的观点
算法目前的局限性在于:
e) 因为依赖语音特性,目前的设计是针对字母拼写语言的
f) 算法假定未知语言与某已知语言有“同源词”,但已证实有例外(Iberian)
研究团队计划开发不依赖语音、不需假定存在“同源词”的新算法。
tag:
翻译练习-英语
机器学习
(?) 语言学
Translating lost languages using machine learning
System developed at MIT CSAIL aims to help linguists decipher languages that have been lost to history.
Adam Conner-Simons | MIT CSAIL
Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these dead languages are also considered to be lost, or “undeciphered” — that is, we don’t know enough about their grammar, vocabulary, or syntax to be able to actually understand their texts.
>近来的研究表明,大部分曾经存在的语言已经不再被使用。大量死语言被认为已经消亡、不可解读:我们对它们的语法、词汇、句法了解不足,无法理解这些语言。
>近来的研究表明,大部分曾经存在的语言如今已不再被使用。因为缺乏对其语法、词汇、句法方面的了解,我们难以解读这些“死去”的语言,从这个意义上说,这些语言已经“消失”了。
Lost languages are more than a mere academic curiosity; without them, we miss an entire body of knowledge about the people who spoke them. Unfortunately, most of them have such minimal records that scientists can’t decipher them by using machine-translation algorithms like Google Translate. Some don't have a well-researched "relative" language to be compared to, and often lack traditional dividers like white space and punctuation.
(To illustrate, imaginetryingtodecipheraforeignlanguagewrittenlikethis.)
[countable] body of something: a large amount or collection of something
>消失的语言不仅是学术 curiosity;没有这些语言,我们失去了大量关于语言使用者的知识。不幸的是,这些语言大多数只留下很少的记录,科学家们无法使用机器翻译算法,比如谷歌翻译,去解读它们。一些死语言没有可与之比较的、研究得较好的“亲缘”语言,还常常缺乏空格、标点这些传统分隔符。
hmm,academic curiosity 该怎么翻译?
>消失的语言不仅仅是值得研究的学术对象;如果失去这些语言,我们同时也失去了大量它们使用者的知识。不幸的是,大多数消失的语言只留下了很少记录,科学家无法使用 Google 翻译之类的机器翻译算法解读它们。其中一些不但没有较为我们了解的“亲缘”语言供对比研究,往往还缺少分词的空格和断句的标点。
(试读:此韩退之所谓句读知不知惑之不解)
# meiyoukonggebiaodianyushengdiaobiaojidepinyinwenkenengshigengheshidelizi
However, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) recently made a major development in this area: a new system that has been shown to be able to automatically decipher a lost language, without needing advanced knowledge of its relation to other languages. They also showed that their system can itself determine relationships between languages, and they used it to corroborate recent scholarship suggesting that the language of Iberian is not actually related to Basque.
cor·rob·or·ate /kəˈrɒbəreɪt/ [transitive, intransitive, often passive] formal
corroborate (something) to provide evidence or information that supports a statement, theory, etc.
SYNONYM confirm
Iber·ian /aɪˈbɪəriən/
relating to Spain and Portugal
Basque /bæsk/, /bɑːsk/
connected with the people or language of the Basque country of France and Spain
>然而,最近 MIT 计算机科学与人工智能实验室(CSAIL)的研究人员取得了大的进展:一种新的系统被证实可以自动解读一门消失的语言,而不需要事先提供任何关于它与其它语言亲缘关系的知识。这个系统还可以确定语言之间的关系,研究人员用它证实了最近有的学者提出的 Iberian 和 Basque 没有关系的说法。
>最近,MIT 计算机科学与人工智能实验室(CSAIL)的研究人员在解读消失的语言方面有了重要进展:ta 们设计出一个可以自动解读消失语言的系统,它不需要预先知晓任何关于该语言与其它语言关系的知识。同时这个系统还能确定不同语言之间的关系,研究人员用它证实了一项近来的学说:该学说认为 Iberian 与 Basque 之间没有关联。
The team’s ultimate goal is for the system to be able to decipher lost languages that have eluded linguists for decades, using just a few thousand words.
elude /ɪˈluːd/
elude somebody if something eludes you, you are not able to achieve it, or not able to remember or understand it
>研究团队最终的目标是让这个系统只靠几千个单词就能解读消失的语言,那些语言学家几十年没有解读出的。
>研究团队希望最终这个系统能做到仅靠几千个词语便解读出语言学家花了几十年都不能解读的语言。
Spearheaded by MIT Professor Regina Barzilay, the system relies on several principles grounded in insights from historical linguistics, such as the fact that languages generally only evolve in certain predictable ways. For instance, while a given language rarely adds or deletes an entire sound, certain sound substitutions are likely to occur. A word with a “p” in the parent language may change into a “b” in the descendant language, but changing to a “k” is less likely due to the significant pronunciation gap.
gap /ɡæp/
a difference that separates people, or their opinions, situation, etc.
>由 MIT 教授 Regina Barzilay 带领,这个系统依赖的几个主要原理受到来自语言学历史学的启发,比如:语言一般只按照可预测的方式演化。举例说,一门语言很少新增或删减一个完整的发音,但特定的发音替换很容易发生。母系语言中带“p”的单词,在子系语言中可能会变成“b”,但是不太可能变成“k”,因为显著的发音差别。
似乎很多对汉语演化的研究基于各种方言、日语的发音。
# 发音很重要!IPA 学起来!(舌头打结走起来 ><)
关于甲骨文和金文与后续文字的字形演化,ML/DL 应该也能起作用。
>这个系统由 MIT 的 Regina Barzilay 教授带领的团队开发,它的主要基础受启发于语言学历史学的研究发现,比如:语言通常只遵循某些特定方式演化。举个具体点的例子,语言很少在演化中突然增加一个音或完全弃用一个音,相似发音之间的替换则很常见:[p] 与 [b] 发声相近,与 [k] 则很不同,因此,母语言中一个带 [p] 发音的词语,在子语言中对应词语的发音可能会变成 [b],但不太可能变成 [k]。
这一版本可能改得有点过:通常“p”应该理解为字母,但文中提到例子是基于发音的,理解为 [p] 可能也没错。理解为发音的另一个理由是,类似的发音演化现象在非拉丁/希腊字母系的语言中也存在,如日语中「やはり」→「やっぱり」。当然,这属于译者个人理解,或有违“信达雅”之“信”。
By incorporating these and other linguistic constraints, Barzilay and MIT PhD student Jiaming Luo developed a decipherment algorithm that can handle the vast space of possible transformations and the scarcity of a guiding signal in the input. The algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors. This design enables them to capture pertinent patterns of language change and express them as computational constraints. The resulting model can segment words in an ancient language and map them to counterparts in a related language.
scar·city /ˈskeəsəti/
[uncountable, countable] plural scar·cities
if there is a scarcity of something, there is not enough of it and it is difficult to obtain it
SYNONYM shortage
per·tin·ent /ˈpɜːtɪnənt/
formal
appropriate to a particular situation
SYNONYM relevant
>通过考虑上述及其它语言学方面的限制,Barzilay 和 MIT 的 PhD 学生 Luo Jiaming 发展了一种解读算法,处理输入中可能的变形的巨大空间和缺少指引信号。这个算法采用一个多维空间描述语言发音,音与音的区别反应在空间中对应矢量的距离。这个设计使得 ta 们可以捕捉合适的语言变化模式,并且将其描述成计算机的限制。这个算法可以对一门古老语言进行分词,并将之对应到相关语言中。
文章见 http://people.csail.mit.edu/j_luo/assets/publications/DecipherUnsegmented.pdf 这段翻译太糟。
>借助上述语音演化及其它语言学方面的限制,Barzilay 和 MIT 的 PhD 学生 Luo Jiaming 提出的解读算法得以应对语言中巨量的变形可能性、处理缺少指引信号的输入。算法以多维空间中的矢量描述发音,以矢量之间的距离度量对应的音之间的区别。这个设计可以捕捉语言变化的相关模式,并将语言学的限制以程序语言表述出来。经过训练,这个算法可以给一门古代语言的文字分词,然后在相关联的另一种语言中找到与之对应的词。
修改版依然好不到哪里去。
"guiding signal in the input"该如何理解?感觉可能和前文"lack divider"相关。
pertinent 该译作“合适”还是“相关”?
The project builds on a paper Barzilay and Luo wrote last year that deciphered the dead languages of Ugaritic and Linear B, the latter of which had previously taken decades for humans to decode. However, a key difference with that project was that the team knew that these languages were related to early forms of Hebrew and Greek, respectively.
Ugaritic is an extinct North-West Semiticlanguage, classifiedby some as a dialect of the Amoritelanguage and so theonly known Amorite dialect preserved inwriting. It is known through the Ugaritictexts discovered by French archaeologistsin 1929 at Ugarit.
(cont') It has been used byscholars of the Hebrew Bible to clarifyBiblical Hebrew texts and has revealedways in which the cultures of ancientIsrael and Judah found parallels in theneighboring cultures.
Linear B is a syllabic script that was used for writing Mycenaean Greek, the earliest attested form of Greek...It is descended from the older Linear A, an undeciphered earlier script used for writing the Minoan language...Linear B, found mainly in the palace archives at Knossos, Cydonia, Pylos, Thebes and Mycenae,
(cont')disappeared with the fall of Mycenaean civilization during the Late Bronze Age collapse...It is also the only one of the Bronze Age Aegean scripts to have been deciphered, by English architect and self-taught linguist Michael Ventris.
# Linear A, B, ...is this what they call "linear algebra"?
>这个项目基于 Barzilay 和 Luo 去年一篇关于解读死语言 Ugaritic 和 Linear B 的文章,后一种花了人们数十年时间解读。但是,一个重要的区别是那个项目团队知道这两门语言分别和早期 Hebrew 及 Greek 相关。
文章见 https://arxiv.org/pdf/1906.06718.pdf
>这个项目基于 Barzilay 和 Luo 去年合作的一篇文章,那篇文章成功解读了 Ugaritic 和 Linear B 两门“死语言”,后者曾耗费人们几十年时间才得以解读。但是,去年的项目与今年的有一个关键不同:去年的项目中,研究团队事先就知道 Ugaritic 和 Linear B 分别与早期 Hebrew 及早期 Greek 有关联。
With the new system, the relationship between languages is inferred by the algorithm. This question is one of the biggest challenges in decipherment. In the case of Linear B, it took several decades to discover the correct known descendant. For Iberian, the scholars still cannot agree on the related language: Some argue for Basque, while others refute this hypothesis and claim that Iberian doesn’t relate to any known language.
re·fute /rɪˈfjuːt/
formal
1 refute something to prove that something is wrong
SYNONYM rebut
2 refute something to say that something is not true or fair
SYNONYM deny
>新的系统中,语言之间的关系由算法决定。这是解读中最大的挑战。以 Linear B 为例,人们花了数十年发现它真正的子语言。而 Iberian,学者还没有在其子语言上取得共识:有人认为是 Basque,其他人不认同这个假设并宣称 Iberian 和任何已知语言都没有关联。
>确定语言之间的关联性是解读语言的最大挑战之一,新系统把这一问题交由算法回答。以前研究 Linear B 时,人们花了几十年才在已知的语言中找到它的“后代”。至于 Iberian 和哪种语言有关联,目前学术界尚没有统一的意见:一些学者认为 Iberian 和 Basque 有关,其他人则认为 Iberian 和任何已知语言都没有关联。
The proposed algorithm can assess the proximity between two languages; in fact, when tested on known languages, it can even accurately identify language families. The team applied their algorithm to Iberian considering Basque, as well as less-likely candidates from Romance, Germanic, Turkic, and Uralic families. While Basque and Latin were closer to Iberian than other languages, they were still too different to be considered related.
prox·im·ity /prɒkˈsɪməti/
[uncountable] formal
the state of being near somebody/something in distance or time
>新提出的算法可以评估两门语言之间的相似度。实际上,在对已知的语言测试时,算法甚至能精确地判别语族。团队用算法比较了 Iberian 和 Basque,以及其它来自 Romance、Germanic、Turkic、Uralic 语族的不太可能的候选者。虽然 Basque 和 Latin 比其它语言更接近 Iberian,但它们之间差别仍然太大,不能认为存在关联。
>新算法可以评估两种语言之间的相似程度。对于已知的语言,新算法甚至可以准确地判别出不同的语族。研究团队用新算法对比了 Iberian 和 Basque,一同被比较的还有 Romance、Germanic、Turkic、Uralic 语族的一些语言。结果表明,Basque 和 Latin 比其它语言更接近 Iberian,但差别仍然大到难以认为它们和 Iberian 之间存在关联。
In future work, the team hopes to expand their work beyond the act of connecting texts to related words in a known language — an approach referred to as “cognate-based decipherment.” This paradigm assumes that such a known language exists, but the example of Iberian shows that this is not always the case. The team’s new approach would involve identifying semantic meaning of the words, even if they don’t know how to read them.
cog·nate /ˈkɒɡneɪt/
linguistics
a word that has the same origin as another
se·man·tic /sɪˈmæntɪk/
[usually before noun] linguistics
connected with the meaning of words and sentences
>将来,团队希望对工作扩展,不只通过建立文字与已知语言中词语的关联——一种被称为“基于词源的解读”的方案。这种范式假定存在这样一门已知的语言,但是 Iberian 的例子表明情况并不总是如此。团队的新方案将包括识别单词的语义,即使 ta 们不知道如何读它。
>当前的解读方案是基于“同源词”的:建立待解读文字与已知语言词语上的关联。该种范式假定确实存在一种与被解读语言有关的已知语言,但是 Iberian 的例子表明情况并不总是如此。研究团队希望将来的方案不止局限于此范式,在新方案中,即使不知道一个词如何读,也能解读出它的语义。
“For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” says Barzilay. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”
>“比如说,我们能识别出文档中关于人和地点的部分,借助已知的历史学事实进一步研究,”Barzilay 说。“这些‘识别孤立实体’的方法在今天的各自文字处理程序中很常用、很精确,但关键问题是,这项任务能否在没有古语言训练数据的情况下达成。”
>“比如,我们可以识别出关于人和地点的文字,然后借助历史学做进一步研究,”Barzilay 说。“这类‘识别孤立实体’的方法广泛用于当今各种文字处理程序,它们很准确。但是,解读未知古代语言时我们没有任何可用的训练数据,能否在此种情况下识别出代表人和地点的文字是研究关键。”
Photo of an ancient tablet showing the language of Ugaritic
以上是关于用机器学习算法解读消失的语言的主要内容,如果未能解决你的问题,请参考以下文章