有没有办法将数字单词转换为整数?
Posted
技术标签:
【中文标题】有没有办法将数字单词转换为整数?【英文标题】:Is there a way to convert number words to Integers? 【发布时间】:2010-10-04 08:06:35 【问题描述】:我需要将one
转换为1
,two
转换为2
等等。
有没有办法通过库或类或任何东西来做到这一点?
【问题讨论】:
另见:***.com/questions/70161/… 也许这会有所帮助:pastebin.com/WwFCjYtt 如果有人还在寻找这个问题的答案,我从以下所有答案中获得灵感并创建了一个 python 包:github.com/careless25/text2digits 我已经使用下面的例子来开发和扩展这个过程,但是变成了西班牙语,以备将来参考:github.com/elbaulp/text2digits_es 任何到达这里的人都不是在寻找 Python 解决方案,这是并行的 C# 问题:Convert words (string) to Int,这是 Java 一:Converting Words to Numbers in Java 【参考方案1】:这段代码大部分是设置numwords dict,它只在第一次调用时完成。
def text2int(textnum, numwords=):
if not numwords:
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
scales = ["hundred", "thousand", "million", "billion", "trillion"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
current = result = 0
for word in textnum.split():
if word not in numwords:
raise Exception("Illegal word: " + word)
scale, increment = numwords[word]
current = current * scale + increment
if scale > 100:
result += current
current = 0
return result + current
print text2int("seven billion one hundred million thirty one thousand three hundred thirty seven")
#7100031337
【讨论】:
仅供参考,这不适用于日期。试试看:print text2int("nineteen ninety six") # 115
1996 的正确写法是“一千九百九十六”。如果你想支持年,你需要不同的代码。
Marc Burns 的 ruby gem 可以做到这一点。我最近分叉了它以增加多年的支持。您可以拨打ruby code from python。
“一百零六”尝试会中断。 print(text2int("hundred and Six")) .. 还有 print(text2int("thousand"))
“预期的结果”。我想不同的用户有不同的期望。就个人而言,我的是不会使用该输入调用它,因为它不是有效数字。是两个。【参考方案2】:
我刚刚为 PyPI 发布了一个名为 word2number 的 Python 模块,用于确切用途。 https://github.com/akshaynagpal/w2n
安装它使用:
pip install word2number
确保您的 pip 已更新到最新版本。
用法:
from word2number import w2n
print w2n.word_to_num("two million three thousand nine hundred and eighty four")
2003984
【讨论】:
试过你的包。建议处理如下字符串:"1 million"
或 "1M"
。 w2n.word_to_num("100万") 抛出错误。
@Ray 感谢您试用。您能否在github.com/akshaynagpal/w2n/issues 提出问题。如果你愿意,你也可以贡献。否则,我一定会在下一个版本中研究这个问题。再次感谢!
罗伯特,开源软件就是人们合作改进它。我想要一个图书馆,并且看到人们也想要一个。所以做到了。它可能还没有为生产级系统做好准备或不符合教科书的流行语。但是,它可以达到目的。此外,如果您可以提交 PR 以便为所有用户进一步改进,那就太好了。
它会计算吗?说:百分之十九五十七?或任何其他运算符,即 +、6、* 和 /
目前还没有@S.Jackson。【参考方案3】:
如果有人有兴趣,我破解了一个维护字符串其余部分的版本(虽然它可能有错误,但尚未对其进行过多测试)。
def text2int (textnum, numwords=):
if not numwords:
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
scales = ["hundred", "thousand", "million", "billion", "trillion"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
ordinal_words = 'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12
ordinal_endings = [('ieth', 'y'), ('th', '')]
textnum = textnum.replace('-', ' ')
current = result = 0
curstring = ""
onnumber = False
for word in textnum.split():
if word in ordinal_words:
scale, increment = (1, ordinal_words[word])
current = current * scale + increment
if scale > 100:
result += current
current = 0
onnumber = True
else:
for ending, replacement in ordinal_endings:
if word.endswith(ending):
word = "%s%s" % (word[:-len(ending)], replacement)
if word not in numwords:
if onnumber:
curstring += repr(result + current) + " "
curstring += word + " "
result = current = 0
onnumber = False
else:
scale, increment = numwords[word]
current = current * scale + increment
if scale > 100:
result += current
current = 0
onnumber = True
if onnumber:
curstring += repr(result + current)
return curstring
例子:
>>> text2int("I want fifty five hot dogs for two hundred dollars.")
I want 55 hot dogs for 200 dollars.
如果您有“200 美元”,则可能会出现问题。但是,这真的很粗糙。
【讨论】:
我从这里获取了这个和其他代码 sn-ps 并把它变成了一个 python 库:github.com/careless25/text2digits【参考方案4】:我需要一些不同的东西,因为我的输入来自语音到文本的转换,而解决方案并不总是对数字求和。例如,“我的邮政编码是一二三四五”不应转换为“我的邮政编码是 15”。
我采用了 Andrew 的 answer 并对其进行了调整,以处理人们突出显示为错误的其他一些情况,并添加了对我上面提到的邮政编码等示例的支持。下面展示了一些基本的测试用例,但我相信还有改进的空间。
def is_number(x):
if type(x) == str:
x = x.replace(',', '')
try:
float(x)
except:
return False
return True
def text2int (textnum, numwords=):
units = [
'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight',
'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen',
'sixteen', 'seventeen', 'eighteen', 'nineteen',
]
tens = ['', '', 'twenty', 'thirty', 'forty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninety']
scales = ['hundred', 'thousand', 'million', 'billion', 'trillion']
ordinal_words = 'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12
ordinal_endings = [('ieth', 'y'), ('th', '')]
if not numwords:
numwords['and'] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
textnum = textnum.replace('-', ' ')
current = result = 0
curstring = ''
onnumber = False
lastunit = False
lastscale = False
def is_numword(x):
if is_number(x):
return True
if word in numwords:
return True
return False
def from_numword(x):
if is_number(x):
scale = 0
increment = int(x.replace(',', ''))
return scale, increment
return numwords[x]
for word in textnum.split():
if word in ordinal_words:
scale, increment = (1, ordinal_words[word])
current = current * scale + increment
if scale > 100:
result += current
current = 0
onnumber = True
lastunit = False
lastscale = False
else:
for ending, replacement in ordinal_endings:
if word.endswith(ending):
word = "%s%s" % (word[:-len(ending)], replacement)
if (not is_numword(word)) or (word == 'and' and not lastscale):
if onnumber:
# Flush the current number we are building
curstring += repr(result + current) + " "
curstring += word + " "
result = current = 0
onnumber = False
lastunit = False
lastscale = False
else:
scale, increment = from_numword(word)
onnumber = True
if lastunit and (word not in scales):
# Assume this is part of a string of individual numbers to
# be flushed, such as a zipcode "one two three four five"
curstring += repr(result + current)
result = current = 0
if scale > 1:
current = max(1, current)
current = current * scale + increment
if scale > 100:
result += current
current = 0
lastscale = False
lastunit = False
if word in scales:
lastscale = True
elif word in units:
lastunit = True
if onnumber:
curstring += repr(result + current)
return curstring
一些测试...
one two three -> 123
three forty five -> 345
three and forty five -> 3 and 45
three hundred and forty five -> 345
three hundred -> 300
twenty five hundred -> 2500
three thousand and six -> 3006
three thousand six -> 3006
nineteenth -> 19
twentieth -> 20
first -> 1
my zip is one two three four five -> my zip is 12345
nineteen ninety six -> 1996
fifty-seventh -> 57
one million -> 1000000
first hundred -> 100
I will buy the first thousand -> I will buy the 1000 # probably should leave ordinal in the string
thousand -> 1000
hundred and six -> 106
1 million -> 1000000
【讨论】:
我接受了您的回答并修复了一些错误。增加了对“二十”的支持 -> 2010 和一般的所有十。你可以在这里找到它:github.com/careless25/text2digits 它会计算吗?说:百分之十九五十七?或任何其他运算符,即 +、6、* 和 / @S.Jackson 它不进行计算。如果您的文本 sn-p 是 python 中的一个有效方程式,我想您可以使用它首先转换为整数,然后eval
结果(假设您熟悉并且对安全问题感到满意)。所以“10 + 5”变成“10 + 5”,然后eval("10 + 5")
给你 15。但这只会处理最简单的情况。没有浮动,括号控制顺序,支持在语音到文本中说加/减/等。【参考方案5】:
我需要处理一些额外的解析情况,例如序数词(“first”、“second”)、连字词(“one-hundred”)和连字符的序词(如“fifty-seventh”),所以我添加了几行:
def text2int(textnum, numwords=):
if not numwords:
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
scales = ["hundred", "thousand", "million", "billion", "trillion"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
ordinal_words = 'first':1, 'second':2, 'third':3, 'fifth':5, 'eighth':8, 'ninth':9, 'twelfth':12
ordinal_endings = [('ieth', 'y'), ('th', '')]
textnum = textnum.replace('-', ' ')
current = result = 0
for word in textnum.split():
if word in ordinal_words:
scale, increment = (1, ordinal_words[word])
else:
for ending, replacement in ordinal_endings:
if word.endswith(ending):
word = "%s%s" % (word[:-len(ending)], replacement)
if word not in numwords:
raise Exception("Illegal word: " + word)
scale, increment = numwords[word]
current = current * scale + increment
if scale > 100:
result += current
current = 0
return result + current`
【讨论】:
注意:hundredth
、thousandth
等返回零。使用one hundredth
获取100
!
可变的默认参数是反模式【参考方案6】:
这是一个简单的案例方法:
>>> number = 'one':1,
... 'two':2,
... 'three':3,
>>>
>>> number['two']
2
或者你在寻找可以处理“一万二千一百七十二”的东西?
【讨论】:
【参考方案7】:如果您要解析的数字数量有限,则可以轻松地将其硬编码到字典中。
对于稍微复杂的情况,您可能希望根据相对简单的数字语法自动生成此字典。类似这样的东西(当然,广义的......)
for i in range(10):
myDict[30 + i] = "thirty-" + singleDigitsDict[i]
如果您需要更广泛的内容,那么您似乎需要自然语言处理工具。 This article 可能是一个很好的起点。
【讨论】:
【参考方案8】:使用 Python 包:WordToDigits
pip install wordtodigits
它可以在句子中找到以单词形式出现的数字,然后将它们转换为正确的数字格式。如果存在小数部分,还需要处理。 数字的单词表示可以在文章中的任何地方。
【讨论】:
【参考方案9】:def parse_int(string):
ONES = 'zero': 0,
'one': 1,
'two': 2,
'three': 3,
'four': 4,
'five': 5,
'six': 6,
'seven': 7,
'eight': 8,
'nine': 9,
'ten': 10,
'eleven': 11,
'twelve': 12,
'thirteen': 13,
'fourteen': 14,
'fifteen': 15,
'sixteen': 16,
'seventeen': 17,
'eighteen': 18,
'nineteen': 19,
'twenty': 20,
'thirty': 30,
'forty': 40,
'fifty': 50,
'sixty': 60,
'seventy': 70,
'eighty': 80,
'ninety': 90,
numbers = []
for token in string.replace('-', ' ').split(' '):
if token in ONES:
numbers.append(ONES[token])
elif token == 'hundred':
numbers[-1] *= 100
elif token == 'thousand':
numbers = [x * 1000 for x in numbers]
elif token == 'million':
numbers = [x * 1000000 for x in numbers]
return sum(numbers)
用 700 个 1 到 100 万范围内的随机数测试效果很好。
【讨论】:
【参考方案10】:进行更改,以便 text2int(scale) 将返回正确的转换。例如,text2int("hundred") => 100。
import re
numwords =
def text2int(textnum):
if not numwords:
units = [ "zero", "one", "two", "three", "four", "five", "six",
"seven", "eight", "nine", "ten", "eleven", "twelve",
"thirteen", "fourteen", "fifteen", "sixteen", "seventeen",
"eighteen", "nineteen"]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty",
"seventy", "eighty", "ninety"]
scales = ["hundred", "thousand", "million", "billion", "trillion",
'quadrillion', 'quintillion', 'sexillion', 'septillion',
'octillion', 'nonillion', 'decillion' ]
numwords["and"] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
ordinal_words = 'first':1, 'second':2, 'third':3, 'fifth':5,
'eighth':8, 'ninth':9, 'twelfth':12
ordinal_endings = [('ieth', 'y'), ('th', '')]
current = result = 0
tokens = re.split(r"[\s-]+", textnum)
for word in tokens:
if word in ordinal_words:
scale, increment = (1, ordinal_words[word])
else:
for ending, replacement in ordinal_endings:
if word.endswith(ending):
word = "%s%s" % (word[:-len(ending)], replacement)
if word not in numwords:
raise Exception("Illegal word: " + word)
scale, increment = numwords[word]
if scale > 1:
current = max(1, current)
current = current * scale + increment
if scale > 100:
result += current
current = 0
return result + current
【讨论】:
我认为100的正确英文拼写是“一百”。 @recursive 你是绝对正确的,但是这段代码的优点是它可以处理“百分之一”(也许这就是 Dawa 试图强调的)。从描述的声音来看,其他类似的代码需要“百分之一”,这并不总是常用的术语(例如“她选择了要丢弃的第一百个项目”)【参考方案11】:Marc Burns 的 ruby gem 可以做到这一点。我最近分叉了它以增加多年的支持。您可以拨打ruby code from python。
require 'numbers_in_words'
require 'numbers_in_words/duck_punch'
nums = ["fifteen sixteen", "eighty five sixteen", "nineteen ninety six",
"one hundred and seventy nine", "thirteen hundred", "nine thousand two hundred and ninety seven"]
nums.each |n| p n; p n.in_numbers
结果:"fifteen sixteen"
1516
"eighty five sixteen"
8516
"nineteen ninety six"
1996
"one hundred and seventy nine"
179
"thirteen hundred"
1300
"nine thousand two hundred and ninety seven"
9297
【讨论】:
请不要从 python 调用 ruby 代码或从 ruby 调用 python 代码。它们足够接近,这样的东西应该被移植过来。 同意,但在移植之前,调用 ruby 代码总比没有好。 它不是很复杂,@recursive 下面提供了可以使用的逻辑(几行代码)。 实际上在我看来“十五十六”是错的? @yekta 对,我认为递归的答案在 SO 答案的范围内是好的。但是,gem 提供了一个包含测试和其他功能的完整包。无论如何,我认为两者都有自己的位置。【参考方案12】:一个快速的解决方案是使用inflect.py 生成字典进行翻译。
inflect.py 有一个number_to_words()
函数,它将一个数字(例如2
)转换成它的单词形式(例如'two'
)。不幸的是,没有提供它的反向(这将允许您避免使用翻译词典路线)。同样,您可以使用该功能来构建翻译词典:
>>> import inflect
>>> p = inflect.engine()
>>> word_to_number_mapping =
>>>
>>> for i in range(1, 100):
... word_form = p.number_to_words(i) # 1 -> 'one'
... word_to_number_mapping[word_form] = i
...
>>> print word_to_number_mapping['one']
1
>>> print word_to_number_mapping['eleven']
11
>>> print word_to_number_mapping['forty-three']
43
如果您愿意花一些时间,可能会检查 inflect.py 的 number_to_words()
函数的内部工作原理并构建您自己的代码以动态执行此操作(我没有尝试这样做) .
【讨论】:
【参考方案13】:我采用了@recursive 的logic 并转换为Ruby。我还对查找表进行了硬编码,因此它并不那么酷,但可能有助于新手了解正在发生的事情。
WORDNUMS = "zero"=> [1,0], "one"=> [1,1], "two"=> [1,2], "three"=> [1,3],
"four"=> [1,4], "five"=> [1,5], "six"=> [1,6], "seven"=> [1,7],
"eight"=> [1,8], "nine"=> [1,9], "ten"=> [1,10],
"eleven"=> [1,11], "twelve"=> [1,12], "thirteen"=> [1,13],
"fourteen"=> [1,14], "fifteen"=> [1,15], "sixteen"=> [1,16],
"seventeen"=> [1,17], "eighteen"=> [1,18], "nineteen"=> [1,19],
"twenty"=> [1,20], "thirty" => [1,30], "forty" => [1,40],
"fifty" => [1,50], "sixty" => [1,60], "seventy" => [1,70],
"eighty" => [1,80], "ninety" => [1,90],
"hundred" => [100,0], "thousand" => [1000,0],
"million" => [1000000, 0]
def text_2_int(string)
numberWords = string.gsub('-', ' ').split(/ /) - %wand
current = result = 0
numberWords.each do |word|
scale, increment = WORDNUMS[word]
current = current * scale + increment
if scale > 100
result += current
current = 0
end
end
return result + current
end
我想处理像two thousand one hundred and forty-six
这样的字符串
【讨论】:
【参考方案14】:这处理印度风格的单词中的数字,一些分数,数字和单词的组合以及加法。
def words_to_number(words):
numbers = "zero":0, "a":1, "half":0.5, "quarter":0.25, "one":1,"two":2,
"three":3, "four":4,"five":5,"six":6,"seven":7,"eight":8,
"nine":9, "ten":10,"eleven":11,"twelve":12, "thirteen":13,
"fourteen":14, "fifteen":15,"sixteen":16,"seventeen":17,
"eighteen":18,"nineteen":19, "twenty":20,"thirty":30, "forty":40,
"fifty":50,"sixty":60,"seventy":70, "eighty":80,"ninety":90
groups = "hundred":100, "thousand":1_000,
"lac":1_00_000, "lakh":1_00_000,
"million":1_000_000, "crore":10**7,
"billion":10**9, "trillion":10**12
split_at = ["and", "plus"]
n = 0
skip = False
words_array = words.split(" ")
for i, word in enumerate(words_array):
if not skip:
if word in groups:
n*= groups[word]
elif word in numbers:
n += numbers[word]
elif word in split_at:
skip = True
remaining = ' '.join(words_array[i+1:])
n+=words_to_number(remaining)
else:
try:
n += float(word)
except ValueError as e:
raise ValueError(f"Invalid word word") from e
return n
测试:
print(words_to_number("a million and one"))
>> 1000001
print(words_to_number("one crore and one"))
>> 1000,0001
print(words_to_number("0.5 million one"))
>> 500001.0
print(words_to_number("half million and one hundred"))
>> 500100.0
print(words_to_number("quarter"))
>> 0.25
print(words_to_number("one hundred plus one"))
>> 101
【讨论】:
我又做了一些测试,“一万七百”= 1700“一万七百”=1700 但“一千七百”=(一千七)百= 1007 * 100 = 100700。说“一千七百”而不是“一千七百”在技术上是错误的吗?!【参考方案15】:此代码适用于系列数据:
import pandas as pd
mylist = pd.Series(['one','two','three'])
mylist1 = []
for x in range(len(mylist)):
mylist1.append(w2n.word_to_num(mylist[x]))
print(mylist1)
【讨论】:
w2n
是什么?它没有在任何地方定义【参考方案16】:
此代码仅适用于99以下的数字。word to int和int to word(其余需要实现10-20行代码和简单逻辑。这只是初学者的简单代码):
num = input("Enter the number you want to convert : ")
mydict = '1': 'One', '2': 'Two', '3': 'Three', '4': 'Four', '5': 'Five','6': 'Six', '7': 'Seven', '8': 'Eight', '9': 'Nine', '10': 'Ten','11': 'Eleven', '12': 'Twelve', '13': 'Thirteen', '14': 'Fourteen', '15': 'Fifteen', '16': 'Sixteen', '17': 'Seventeen', '18': 'Eighteen', '19': 'Nineteen'
mydict2 = ['', '', 'Twenty', 'Thirty', 'Fourty', 'fifty', 'sixty', 'Seventy', 'Eighty', 'Ninty']
if num.isdigit():
if(int(num) < 20):
print(" :---> " + mydict[num])
else:
var1 = int(num) % 10
var2 = int(num) / 10
print(" :---> " + mydict2[int(var2)] + mydict[str(var1)])
else:
num = num.lower()
dict_w = 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8, 'nine': 9, 'ten': 10, 'eleven': 11, 'twelve': 12, 'thirteen': 13, 'fourteen': 14, 'fifteen': 15, 'sixteen': 16, 'seventeen': '17', 'eighteen': '18', 'nineteen': '19'
mydict2 = ['', '', 'twenty', 'thirty', 'fourty', 'fifty', 'sixty', 'seventy', 'eighty', 'ninty']
divide = num[num.find("ty")+2:]
if num:
if(num in dict_w.keys()):
print(" :---> " + str(dict_w[num]))
elif divide == '' :
for i in range(0, len(mydict2)-1):
if mydict2[i] == num:
print(" :---> " + str(i * 10))
else :
str3 = 0
str1 = num[num.find("ty")+2:]
str2 = num[:-len(str1)]
for i in range(0, len(mydict2)):
if mydict2[i] == str2:
str3 = i
if str2 not in mydict2:
print("----->Invalid Input<-----")
else:
try:
print(" :---> " + str((str3*10) + dict_w[str1]))
except:
print("----->Invalid Input<-----")
else:
print("----->Please Enter Input<-----")
【讨论】:
请解释这段代码的作用,以及它是如何做到的。这样一来,对于那些还不太了解编码的人来说,您的答案更有价值。 如果用户将数字作为输入,程序将以单词返回,反之亦然,例如 5->5 和 5->5。程序适用于 100 以下的数字,但可以扩展到任何范围只需添加几行代码。以上是关于有没有办法将数字单词转换为整数?的主要内容,如果未能解决你的问题,请参考以下文章