Python 如何对输出的词频结果按字母顺序排序(NLTK)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python 如何对输出的词频结果按字母顺序排序(NLTK)相关的知识,希望对你有一定的参考价值。

编程新手……有这样一个Python程序(使用了nltk):
import nltk
file_b = open('a.txt', 'r')
tokens = nltk.word_tokenize(file_b)
fdist1 = nltk.FreqDist(tokens)
for key,val in sorted(fdist1.iteritems())[:5]:
print "1: 0".format(key, round(val / len(tokens), 2))

输出的结果是:
0.02: a
0.02: after
0.04: and
0.02: arms
0.02: away

我想将结果修改为:首先以频率由大到小排序,再按照字母顺序排序,请问我该如何修改我的语句?

不胜感谢!!

import nltk
file_b = open('a.txt', 'r')
tokens = nltk.word_tokenize(file_b)
fdist1 = nltk.FreqDist(tokens)
for key,val in sorted(fdist1.iteritems())[:5]:
    print ("1: 0".format(key, round(val / len(tokens), 2)))

参考技术A for key,val in sorted(fdist1.iteritems())[:5]:
==>
for key,val in sorted(fdist1.items(), key=lambda x: (-1*x[1], x[0]))[:5]:本回答被提问者采纳

如何按数字顺序对字母数字 SQL Server NVARCHAR 列进行排序?

【中文标题】如何按数字顺序对字母数字 SQL Server NVARCHAR 列进行排序?【英文标题】:How do I sort an alphanumeric SQL Server NVARCHAR column in numerical order? 【发布时间】:2016-01-20 09:56:08 【问题描述】:

我有以下 SQL:

SELECT fldTitle 
FROM tblTrafficAlerts 
ORDER BY fldTitle

按以下顺序返回结果(来自NVARCHAR 列):

A1M northbound within J17 Congestion
M1 J19 southbound exit Congestion
M1 southbound between J2 and J1 Congestion
M23 northbound between J8 and J7 Congestion
M25 anti-clockwise between J13 and J12 Congestion
M25 clockwise between J8 and J9 Broken down vehicle
M3 eastbound at the Fleet services between J5 and J4A Congestion
M4 J19 westbound exit Congestion

您会看到 M23 和 M25 列在 M3 和 M4 行的上方,这看起来并不顺眼,如果扫描更长的结果列表,您不会期望按此顺序阅读它们。

因此,我希望结果按字母顺序排序,然后按数字排序,如下所示:

A1M northbound within J17 Congestion
M1 J19 southbound exit Congestion
M1 southbound between J2 and J1 Congestion
M3 eastbound at the Fleet services between J5 and J4A Congestion
M4 J19 westbound exit Congestion
M23 northbound between J8 and J7 Congestion
M25 anti-clockwise between J13 and J12 Congestion
M25 clockwise between J8 and J9 Broken down vehicle

所以 M3 和 M4 出现在 M23 和 M25 的上方。

【问题讨论】:

标记使用的 dbms。 (答案可能取决于所使用的产品。) Microsoft SQL Server 2008 - 感谢您的标签编辑建议! @JaydipJ 操作并没有说它没有按预期工作。他在问如何与默认排序不同 @CarlSixsmith 是的。我也这么说 MS SQL Server 是否有任何数字敏感排序规则? 【参考方案1】:

这应该处理它。还添加了一些奇怪的数据以确保排序也适用:

SELECT x
FROM 
(values
('A1M northbound within J17 Congestion'),
('M1 J19 southbound exit Congestion'),
('M1 southbound between J2 and J1 Congestion'),
('M23 northbound between J8 and J7 Congestion'),
('M25 anti-clockwise between J13 and J12 Congestion'),
('M25 clockwise between J8 and J9 Broken down vehicle'),
('M3 eastbound at the Fleet services between J5 and J4A Congestion'),
('M4 J19 westbound exit Congestion'),('x'), ('2'), ('x2')) x(x)
ORDER BY
  LEFT(x, patindex('%_[0-9]%', x +'0')), 
  0 + STUFF(LEFT(x, 
  PATINDEX('%[0-9][^0-9]%', x + 'x1x')),1,
  PATINDEX('%_[0-9]%', x + '0'),'')

结果:

2
A1M northbound within J17 Congestion
M1 J19 southbound exit Congestion
M1 southbound between J2 and J1 Congestion
M3 eastbound at the Fleet services between J5 and J4A Congestion
M4 J19 westbound exit Congestion
M23 northbound between J8 and J7 Congestion
M25 anti-clockwise between J13 and J12 Congestion
M25 clockwise between J8 and J9 Broken down vehicle
x
x2

【讨论】:

这很好用,谢谢!为了完整起见,我的结果查询是: SELECT fldTitle FROM tblTrafficAlerts ORDER BY LEFT(fldTitle, PATINDEX('%_[0-9]%', fldTitle +'0')), 0 + STUFF(LEFT(fldTitle, PATINDEX('%[0-9][^0-9]%', fldTitle + 'x1x')),1, PATINDEX('%_[0-9]%', fldTitle + '0'),'' )【参考方案2】:

也许这并不漂亮,但它确实有效:

DECLARE @tblTrafficAlerts  TABLE
(
    fldTitle NVARCHAR(500)
);

INSERT INTO @tblTrafficAlerts  (fldTitle)
VALUES (N'A1M northbound within J17 Congestion')
    , (N'M1 J19 southbound exit Congestion')
    , (N'M1 southbound between J2 and J1 Congestion')
    , (N'M23 northbound between J8 and J7 Congestion')
    , (N'M25 anti-clockwise between J13 and J12 Congestion')
    , (N'M25 clockwise between J8 and J9 Broken down vehicle')
    , (N'M3 eastbound at the Fleet services between J5 and J4A Congestion')
    , (N'M4 J19 westbound exit Congestion');

SELECT *
FROM @tblTrafficAlerts AS T
CROSS APPLY (SELECT PATINDEX('%[0-9]%', T.fldTitle)) AS N(NumIndex)
CROSS APPLY (SELECT PATINDEX('%[0-9][^0-9]%', T.fldTitle)) AS NN(NextLetter)
ORDER BY SUBSTRING(T.fldTitle, 0, N.NumIndex), CONVERT(INT, SUBSTRING(T.fldTitle, N.NumIndex, NN.NextLetter - 1));

这将提取第一个数字之前的所有内容,按它排序,然后提取该数字并按整数排序。

这是输出:

╔══════════════════════════════════════════════════════════════════╗
║                             fldTitle                             ║
╠══════════════════════════════════════════════════════════════════╣
║ A1M northbound within J17 Congestion                             ║
║ M1 J19 southbound exit Congestion                                ║
║ M1 southbound between J2 and J1 Congestion                       ║
║ M3 eastbound at the Fleet services between J5 and J4A Congestion ║
║ M4 J19 westbound exit Congestion                                 ║
║ M23 northbound between J8 and J7 Congestion                      ║
║ M25 anti-clockwise between J13 and J12 Congestion                ║
║ M25 clockwise between J8 and J9 Broken down vehicle              ║
╚══════════════════════════════════════════════════════════════════╝

【讨论】:

【参考方案3】:
SELECT fldTitle FROM tblTrafficAlerts order by LEFT(fldTitle , CHARINDEX(' ', fldTitle) - 1), fldTitle 

或使用patindex

ORDER BY LEFT(Col1,PATINDEX('%[^0-9]%',Col1)-1)

【讨论】:

【参考方案4】:

我会这样:

编辑:我把它分成两部分:前导字母和第二部分。这允许您 - 如果需要 - 以数字方式处理第二部分(但第一行中有一个令人不安的“M”......)

只做第二步会更容易:在第一个空白处剪切,检查长度,如果需要,在排序时添加“0”。

DECLARE @tblTrafficAlerts TABLE(fldTitle VARCHAR(500));

INSERT INTO @tblTrafficAlerts VALUES 
 ('A1M northbound within J17 Congestion')
,('M1 J19 southbound exit Congestion')
,('M1 southbound between J2 and J1 Congestion')
,('M23 northbound between J8 and J7 Congestion')
,('M25 anti-clockwise between J13 and J12 Congestion')
,('M25 clockwise between J8 and J9 Broken down vehicle')
,('M3 eastbound at the Fleet services between J5 and J4A Congestion')
,('M4 J19 westbound exit Congestion');

SELECT ta.fldTitle
      ,Leading.Letter
      ,Leading.SecondPart
FROM @tblTrafficAlerts AS ta
CROSS APPLY(SELECT SUBSTRING(ta.fldTitle,1,1) AS Letter
                  ,SUBSTRING(ta.fldTitle,2,CHARINDEX(' ',ta.fldTitle)-1) AS SecondPart) AS Leading
ORDER BY Leading.Letter,CASE WHEN LEN(Leading.SecondPart)=1 THEN Leading.SecondPart + '0' ELSE Leading.SecondPart END

结果:

fldTitle                                                           Letter   SecondPart
A1M northbound within J17 Congestion                               A        1M 
M1 J19 southbound exit Congestion                                  M        1 
M1 southbound between J2 and J1 Congestion                         M        1 
M23 northbound between J8 and J7 Congestion                        M        23 
M25 anti-clockwise between J13 and J12 Congestion                  M        25 
M25 clockwise between J8 and J9 Broken down vehicle                M        25 
M3 eastbound at the Fleet services between J5 and J4A Congestion   M        3 
M4 J19 westbound exit Congestion                                   M        4 

【讨论】:

以上是关于Python 如何对输出的词频结果按字母顺序排序(NLTK)的主要内容,如果未能解决你的问题,请参考以下文章

python中进行字符串排序

如何按字母顺序对字符串字符进行排序?

按字母顺序对HashMap程序的输出进行排序[重复]

按字母顺序对 NSMutableArray 进行排序[重复]

如何按数字顺序对字母数字 SQL Server NVARCHAR 列进行排序?

linq 字母数字组合字符串排序