Perl 解析 csv 文件并迭代 curl

Posted 2023-03-24

技术标签:

【中文标题】Perl 解析 csv 文件并迭代 curl【英文标题】：Perl parse csv file and iterate curl 【发布时间】：2021-01-03 00:52:36 【问题描述】：

我正在尝试解析一个 csv 文件并使用 curl 对其进行迭代。以下是我的数据集：

Act No. 2,Sep/1900/28
Act No. 3,Sep/1900/28
Act No. 10,Oct/1900/28

我遵循了这个 *** 问题：CSV into hash 基本上为我的数据集创建哈希。这是我的代码：

#!/usr/bin/perl
use strict;
use warnings;

use Text::CSV_XS;
use IO::File;

use WWW::Curl::Easy;

my $url = "https://elibrary.judiciary.gov.ph/thebookshelf/docmonth/";
#my $filestoprocess = 'list_acts.csv';

# Usage example:
my $hash_ref = csv_file_hashref('toharvest_og_sourcing.csv');

foreach my $key (sort keys %$hash_ref)

my $urlcomplete = "$url"."@$hash_ref->$key";
   
#start the curl
my $user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";

my $curl = WWW::Curl::Easy->new;

$curl->setopt(CURLOPT_HEADER,1);
$curl->setopt(CURLOPT_USERAGENT, $user_agent);
$curl->setopt(CURLOPT_FOLLOWLOCATION, 1);
#$curl->setopt(CURLOPT_SSL_VERIFYPEER, 1L);
#$curl->curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 1L);
$curl->setopt(CURLOPT_SSL_VERIFYPEER, 0);
$curl->setopt(CURLOPT_URL, $urlcomplete);

# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);

# Starts the actual request
my $retcode = $curl->perform;

# Looking at the results...
    if ($retcode == 0) 
        my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
  my $curledurldate = $response_body;
  our ($issuancelink) = $curledurldate =~ /a href='(https.*?)'>.*?<STRONG>$key/s;
  #print "$issuancelink\n";

        if (defined $issuancelink) 

my $user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";

#my $curl = WWW::Curl::Easy->new;

$curl->setopt(CURLOPT_HEADER,1);
$curl->setopt(CURLOPT_USERAGENT, $user_agent);
$curl->setopt(CURLOPT_FOLLOWLOCATION, 1);
#$curl->setopt(CURLOPT_SSL_VERIFYPEER, 1L);
#$curl->curl_easy_setopt(curl, CURLOPT_SSL_VERIFYPEER, 1L);
$curl->setopt(CURLOPT_SSL_VERIFYPEER, 0);

$curl->setopt(CURLOPT_URL, $issuancelink);

# A filehandle, reference to a scalar or reference to a typeglob can be used here.
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);

# Starts the actual request
my $retcode = $curl->perform;

# Looking at the results...
if ($retcode == 0) 
#       print("Transfer went ok\n");
        my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
      my $curledsource = $response_body;
      our ($ogsourcing) = $curledsource =~ /<br>\s+(\w+.*?)\s+?<CENTER>.*?H2/s;
    
        my $filename = 'ogsourcingharvested.txt';
              open (FH, '>>', $filename) or die("Could not open file. $!");
                #print "Error processing ".$fh."$_\n";
                                print FH $ogsourcing."|"."$key\n";
               close (FH);       
        

        else 
        # Error code, type of error, error message
        print("An error happened: $retcode ".$curl->strerror($retcode)." ".$curl->errbuf."\n");

        
 else 
        # Error code, type of error, error message
        print("An error happened: $retcode ".$curl->strerror($retcode)." ".$curl->errbuf."\n");





# Implementation:
sub csv_file_hashref 
   my ($filename) = @_;

   my $csv_fh = IO::File->new($filename, 'r');
   my $csv = Text::CSV_XS->new ();

   my %output_hash;

   while(my $colref = $csv->getline ($csv_fh))
   
      $output_hashshift @$colref = $colref;
   

   return \%output_hash;

基本上，代码遍历第二列，将其添加到 URL 的末尾，然后该 URL 被卷曲。然后，在卷曲 URL 的内容中搜索特定内容：

our ($issuancelink) = $curledurldate =~ /a href='(https.*?)'>.*?<STRONG>$key/s;

当该链接出现在搜索中时，该链接被放入一个变量 ($issuancelink) 中，然后该变量 $issuancelink 被卷曲。然后搜索卷曲文件中的特定文本，然后捕获该特定文本并将其保存到文本文件中。但是，如果第二列（在这种情况下为 9 月 1900 年 28 月、1900 年 28 月 28 日）不重复，我的代码很好。但是，如果重复，那就是我遇到问题的地方，似乎第一次迭代就是被捕获的那个。因此，就我而言，第 3 幕的链接与第 2 幕 (https://elibrary.judiciary.gov.ph/thebookshelf/docmonth/Sep/1900/28) 具有相同的原始 URL (https://elibrary.judiciary.gov.ph/thebookshelf/docmonth/Sep/1900/28)，而第 2 幕的链接被捕获。提前致谢！

【问题讨论】：

这是很多代码。请edit您的问题并正确格式化，很难阅读。 【参考方案1】：

但是，如果第二列（本例中为 9 月 1900 年 28 月、1900 年 10 月 28 日）不重复，我的代码很好。

当您在散列中存储值时，散列键是唯一的。这意味着如果您有相同的键名，它们将相互覆盖。

这部分代码：

   while(my $colref = $csv->getline ($csv_fh))
   
      $output_hashshift @$colref = $colref;

似乎是负责任的。您可以做的是将值保存在数组而不是标量中（在本例中，保存在数组 ref 中）。

我会这样做：

   while(my $colref = $csv->getline ($csv_fh))
   
      my ($key, $value) = @$colref;
      push @$output_hash$key, $value;       # store values in array

这样做的另一个好处是可以复制值。在您的代码中，数组 ref 被复制。变量my $colref 的有限范围使您免于出现问题，但一般来说，复制这些值将使您免于出现问题。

要访问数组值，您可能需要对每个哈希键进行循环。类似的东西

for my $key (sort keys %$hash_ref) 
    for my $values (@$hash_ref$key) 
          # do stuff...

【讨论】：

您好 TLP！感谢您的回答。我正在尝试您的代码，我收到此错误：全局符号“%hash_ref”需要显式包名称（您是否忘记在 test2.pl 第 17 行声明“我的 %hash_ref”？）。我不知道%hash_ref 替换什么。 @schnydszch 嗯，这很简单。您尝试使用未声明的散列变量。可能您声明了my $hash_ref，并尝试使用$hash_reffoo。后者指的是哈希%hash_ref中的一个哈希值。您需要做的是使用$hash_ref->foo，这是访问哈希引用的正确方法。

以上是关于Perl 解析 csv 文件并迭代 curl的主要内容，如果未能解决你的问题，请参考以下文章