加快循环过滤字符串[重复]

Posted 2023-03-11

技术标签:

【中文标题】加快循环过滤字符串[重复]【英文标题】：Speed up a loop filtering a string [duplicate] 【发布时间】：2019-11-02 17:38:22 【问题描述】：

我想通过删除那些不包含关键字的推文来过滤熊猫数据框中包含推文（3+百万行）的列。为此，我正在运行以下循环（对不起，我是 python 新手）：

filter_word_indicators = []
for i in range(1, len(df)):
    if 'filter_word' in str(df.tweets[0:i]):
        indicator = 1 
    else:
        indicator = 0
    filter_word_indicators.append(indicator)

这个想法是，如果指标等于 0，则删除推文。问题是这个循环需要永远运行。我确信有更好的方法来删除不包含我的“filer_word”的推文，但我不知道如何编码。任何帮助都会很棒。

【问题讨论】：

这是 python 2 还是 3？另外，你知道推文中有多少百分比有这个词？ Python 3。我预计只有大约 1% 的人会有我打算过滤的关键字。您能否发布一些示例输入和输出。我建议添加代码来创建一个数据框，其中包含 3 条只有几个词的假推文以及过滤后的预期结果。不要使用实际的长推文。 【参考方案1】：

查看pandas.Series.str.contains，可以如下使用。

df[~df.tweets.str.contains('filter_word')]

MWE

In [0]: df = pd.DataFrame(
            [[1, "abc"],
             [2, "bce"]],
            columns=["number", "string"]
        )    
In [1]: df
Out[1]: 
   number string
0       1    abc
1       2    bce

In [2]: df[~df.string.str.contains("ab")]
Out[2]: 
   number string
1       2    bce

时间

使用三百万条推文大小的随机字符串对以下合成 DataFrame 进行小型计时测试

df = pd.DataFrame(
    [
        "".join(random.choices(string.ascii_lowercase, k=280))
        for _ in range(3000000)
    ],
    columns=["strings"],
)

和关键字abc，比较原始解决方案map + regex和这个提议的解决方案（str.contains）。结果如下。

original       99s
map + regex    21s
str.contains  2.8s

【讨论】：

【参考方案2】：

我创建以下示例：

df = pd.DataFrame("""Suggested order for Amazon Prime Doctor Who series
Why did pressing the joystick button spit out keypresses?
Why tighten down in a criss-cross pattern?
What exactly is the 'online' in OLAP and OLTP?
How is hair tissue mineral analysis performed?
Understanding the reasoning of the woman who agreed with King Solomon to "cut the baby in half"
Can Ogre clerics use Purify Food and Drink on humanoid characters?
Heavily limited premature compiler translates text into excecutable python code
How many children?
Why are < or > required to use /dev/tcp
Hot coffee brewing solutions for deep woods camping
Minor traveling without parents from USA to Sweden
Non-flat partitions of a set
Are springs compressed by energy, or by momentum?
What is "industrial ethernet"?
What does the hyphen "-" mean in "tar xzf -"?
How long would it take to cross the Channel in 1890's?
Why do all the teams that I have worked with always finish a sprint without completion of all the stories?
Is it illegal to withhold someone's passport and green card in California?
When to remove insignificant variables?
Why does Linux list NVMe drives as /dev/nvme0 instead of /dev/sda?
Cut the gold chain
Why do some professors with PhDs leave their professorships to teach high school?
"How can you guarantee that you won't change/quit job after just couple of months?" How to respond?""".split('\n'), columns = ['Sentence'])

您可以使用正则表达式创建一个简单的函数（在大写字符的情况下更灵活）：

def tweetsFilter(s, keyword):
    return bool(re.match('(?i).*(' + keyword + ').*', s))

调用该函数可以获取包含特定关键字的布尔系列字符串。 map可以加速你的脚本（你需要测试！！！）：

keyword = 'Why'
sel = df.Sentence.map(lambda x: tweetsFilter(x, keyword))
df[sel]

我们得到：

    Sentence
1   Why did pressing the joystick button spit out ...
2   Why tighten down in a criss-cross pattern?
9   Why are < or > required to use /dev/tcp
17  Why do all the teams that I have worked with a...
20  Why does Linux list NVMe drives as /dev/nvme0 ...
22  Why do some professors with PhDs leave their p...

【讨论】：

以上是关于加快循环过滤字符串[重复]的主要内容，如果未能解决你的问题，请参考以下文章