需要一种从文件中获取信息并跳过信息是不是在另一个文件中的方法

Posted 2023-02-25

技术标签:

【中文标题】需要一种从文件中获取信息并跳过信息是不是在另一个文件中的方法【英文标题】：Need a way to grab information out of a file and skip if information is in another file需要一种从文件中获取信息并跳过信息是否在另一个文件中的方法 【发布时间】：2016-04-09 05:43:14 【问题描述】：

我有一个名为 skip.txt 的文件，其中包含以下信息：

***.com 
github.com 
www.sa-k.net 
yoursearch.me 
search1.speedbit.com 
duckfm.net
search.clearch.org 
webcache.googleusercontent.com

我还有一个名为 information.txt 的文件，其中包含以下信息：

http://search.clearch.org/?a=web&q=Viewcat_h.php%3Fidcategory%3D%20%3Cstrong%3ESite%3C%2Fstrong%3E%20.pl%20
https://moodle.org/mod/forum/discuss.php?d=246409
http://webcache.googleusercontent.com/search?q=cache:oqPwN7FtDWgJ
http://www.aquariumist.com.ua/spr.php?id=7
http://search.clearch.org/?a%3Dweb%26q%3DViewcat_h.php%253Fidcategory%253D%2520%253Cstrong%253ESite%253C%252Fstrong%253E%2520.pl%2520%2Binurl:viewCat_h.php?idCategory%3D&hl=en&gbv=1&ct=clnk
http://www.astbury.leeds.ac.uk/research/spr.php
http://www.media4play.li/s/spr+php+id.html
http://v.virscan.org/SPR/PHP.ID.html
http://search.clearch.org/?a=images&q=Viewcat_h.php%3Fidcategory%3D+
http://search.clearch.org/?a=web&q=Inurl%20Viewcat_h.php%3Fidcategory%3D%20Site%20Clinsp=%3Fpvaid%3D97f2b2aa136c4af0936453a19d9ab1b2%26fcoid%3D302363
http://webcache.googleusercontent.com/search?q=cache:5qNE1JBqUeIJ
http://search.clearch.org/?a%3Dweb%26q%3DInurl%2520Viewcat_h.php%253Fidcategory%253D%2520Site%2520Cl%26insp%3D%253Fpvaid%253D97f2b2aa136c4af0936453a19d9ab1b2%2526fcoid%253D302363%2Binurl:viewCat_h.php?idCategory%3D&hl=en&gbv=1&ct=clnk

我想要一种方法来获取此信息并移动到next url，有没有一种方法可以从skip.txt 文件中读取，如果information.txt 文件包含该skip.txt 文件中的任何内容，请移动到文件中的下一个 url？

预期输出：

http://www.astbury.leeds.ac.uk/research/spr.php
http://www.media4play.li/s/spr+php+id.html
http://v.virscan.org/SPR/PHP.ID.html
https://moodle.org/mod/forum/discuss.php?d=246409
http://www.aquariumist.com.ua/spr.php?id=7

我做了一些研究，发现了 grep 函数，但这需要一个复杂的正则表达式，我不是很好。所以如果你能帮助我找到一种方法来跳过skip.txt 中的信息，或者帮助我使用一个很棒的正则表达式！提前谢谢你。

【问题讨论】：

以后，请将您的示例归结为基本要素。为了说明您的观点，“skip_txt”可能是三四行，而“information.txt”可能有更少和更短的行。 skip.txt 中的行都以尾随空格结尾，这显然不是预期的。在我将错误跟踪到这些空间之前，我无法理解为什么我的代码不起作用。请编辑以删除它们。 @CarySwoveland 很抱歉，我是新来的。我将编辑代码以去除尾随空格。 【参考方案1】：

假设您将跳过文件读入变量skip，将信息文件读入变量info_file。那么

skip_arr = skip.split("\n").map(&:strip)
  #=> ["***.com", "github.com", "www.sa-k.net", "yoursearch.me",
  #    "search1.speedbit.com", "duckfm.net", "search.clearch.org",
  #    "webcache.googleusercontent.com"]

.map(&:strip)（您可以将其视为.map |s| s.strip ）使用String#strip 删除围绕skip.split("\n") 生成的数组元素的所有空格。这可能没有必要，但预防措施无害。

info_arr = info.split("\n")
  #=> ["http://search.clearch.org/?a=web&q=Viewcat_h...,
  #    "https://moodle.org/mod/forum/discuss.php?d=246409",
  #    "http://webcache.googleusercontent.com/search?q=cache:oqPwN7FtDWgJ",
  #    "http://www.aquariumist.com.ua/spr.php?id=7",
  #    "http://search.clearch.org/?a%3Dweb%26q%3DViewcat_h.php...,
  #    "http://www.astbury.leeds.ac.uk/research/spr.php",
  #    "http://www.media4play.li/s/spr+php+id.html",
  #    "http://v.virscan.org/SPR/PHP.ID.html",
  #    "http://search.clearch.org/?a=images&q=Viewcat_h.php%3Fidcategory%3D+",
  #    "http://search.clearch.org/?a=web&q=Inurl%20Viewcat_h.php...,
  #    "http://webcache.googleusercontent.com/search?q=cache:5qNE1JBqUeIJ",
  #    "http://search.clearch.org/?a%3Dweb%26q%3DInurl%2520Viewcat_h.php...]

接下来我们定义一个正则表达式。

r = / 
    (?<=\/\/)  # match two forward slashes in a positive lookbehind
    # Regexp.union(skip_arr)  # match any element of skip_arr
    (?=\/)     # match a forward slash in a positive lookahead
    /x         # free-spacing regex definition mode
#=> / 
    (?<=\/\/)  # match two forward slashes in a positive lookbehind
    (?-mix:***\.com|github\.com|www\.sa\-k\.net|yoursearch\.me|
      search1\.speedbit\.com|duckfm\.net|search\.clearch\.org|
      webcache\.googleusercontent\.com) # match any element of skip_arr
    (?=\/)     # match a forward slash in a positive lookahead
    /x

最后，使用Array#reject 方法删除info.arr 中匹配此“正则表达式”的元素：

info_arr.reject  |s| s =~ r 
  #=> ["https://moodle.org/mod/forum/discuss.php?d=246409",
  #    "http://www.aquariumist.com.ua/spr.php?id=7", 
  #    "http://www.astbury.leeds.ac.uk/research/spr.php",
  #    "http://www.media4play.li/s/spr+php+id.html",
  #    "http://v.virscan.org/SPR/PHP.ID.html"]

【讨论】：

以上是关于需要一种从文件中获取信息并跳过信息是不是在另一个文件中的方法的主要内容，如果未能解决你的问题，请参考以下文章