从末尾读取大文件

Posted 2023-02-18

技术标签:

【中文标题】从末尾读取大文件【英文标题】：Reading large files from end 【发布时间】：2011-09-21 00:41:31 【问题描述】：

我可以在 php 中读取文件吗，例如，如果我想读取最后 10-20 行？

而且，正如我所读到的，如果文件的大小超过 10mbs，我就会开始出错。

如何防止出现此错误？

为了读取普通文件，我们使用代码：

if ($handle) 
    while (($buffer = fgets($handle, 4096)) !== false) 
    $i1++;
    $content[$i1]=$buffer;
    
    if (!feof($handle)) 
        echo "Error: unexpected fgets() fail\n";
    
    fclose($handle);

我的文件可能超过 10mbs，但我只需要阅读最后几行。我该怎么做？

谢谢

【问题讨论】：

可能重复：PHP - reading from the end of a text file 【参考方案1】：

这取决于你如何理解“可以”。

如果您想知道是否可以不阅读前面的所有行而直接（使用 PHP 函数）执行此操作，那么答案是：不能，您不能。

行尾是对数据的一种解释，如果您实际阅读了数据，您只能知道它们在哪里。

如果它是一个非常大的文件，我不会这样做。最好是从文件末尾开始扫描，从末尾逐步读取块到文件。

更新

这是一种仅限 PHP 的方法，可以读取文件的最后 n 行，而无需通读所有文件：

function last_lines($path, $line_count, $block_size = 512)
    $lines = array();

    // we will always have a fragment of a non-complete line
    // keep this in here till we have our next entire line.
    $leftover = "";

    $fh = fopen($path, 'r');
    // go to the end of the file
    fseek($fh, 0, SEEK_END);
    do
        // need to know whether we can actually go back
        // $block_size bytes
        $can_read = $block_size;
        if(ftell($fh) < $block_size)
            $can_read = ftell($fh);
        

        // go back as many bytes as we can
        // read them to $data and then move the file pointer
        // back to where we were.
        fseek($fh, -$can_read, SEEK_CUR);
        $data = fread($fh, $can_read);
        $data .= $leftover;
        fseek($fh, -$can_read, SEEK_CUR);

        // split lines by \n. Then reverse them,
        // now the last line is most likely not a complete
        // line which is why we do not directly add it, but
        // append it to the data read the next time.
        $split_data = array_reverse(explode("\n", $data));
        $new_lines = array_slice($split_data, 0, -1);
        $lines = array_merge($lines, $new_lines);
        $leftover = $split_data[count($split_data) - 1];
    
    while(count($lines) < $line_count && ftell($fh) != 0);
    if(ftell($fh) == 0)
        $lines[] = $leftover;
    
    fclose($fh);
    // Usually, we will read too many lines, correct that here.
    return array_slice($lines, 0, $line_count);

【讨论】：

您完全可以在不阅读前面所有行的情况下执行此操作，正如您自己在最后一句中所建议的那样。 :) @awgy：我的直接意思是使用 PHP 函数或来自操作系统的帮助；）也许我措辞不好 :) @kritya, @awgy：我添加了我所描述的实现。是否可以让这个 sn-p 发音为 GPLv2+ 兼容？ :) 我想在 WordPress 插件中使用它，官方存储库有这样的许可要求，所以使用的 CC-wiki 不兼容。 :( @Rarst：当然，您可以通过该许可证使用它。（我想我这样说就够了？）【参考方案2】：

它不是纯 PHP，但常见的解决方案是使用 tac 命令，它是 cat 的还原并反向加载文件。使用 exec() 或 passthru() 在服务器上运行它，然后读取结果。示例用法：

<?php
$myfile = 'myfile.txt';
$command = "tac $myfile > /tmp/myfilereversed.txt";
exec($command);
$currentRow = 0;
$numRows = 20;  // stops after this number of rows
$handle = fopen("/tmp/myfilereversed.txt", "r");
while (!feof($handle) && $currentRow <= $numRows) 
   $currentRow++;
   $buffer = fgets($handle, 4096);
   echo $buffer."<br>";

fclose($handle);
?>

【讨论】：

但它会影响真实文件还是只是命令虚拟地影响它？它不影响真实文件，但它会创建一个新文件/tmp/myfilereversed.txt，所以你需要把它删除【参考方案3】：

您可以使用 fopen 和 fseek 在文件中从末尾向后导航。例如

$fp = @fopen($file, "r");
$pos = -2;
while (fgetc($fp) != "\n") 
    fseek($fp, $pos, SEEK_END);
    $pos = $pos - 1;

$lastline = fgets($fp);

【讨论】：

通过使用带负偏移量的 fseek 和 SEEK_END，您可以将位置指示器设置为在文件结尾之前定位 $offset 字节，因此您不需要从文件开头如果文件以换行符结尾，这个 sn-p 将只返回换行符。另外，我相信$pos应该在循环开始之前初始化为-1。同意，修复了 sn-p。我认为 -2 的初始值将涵盖第一种情况。当然它不会涵盖文件以几个“\n”结尾的情况，但我会把它留给海报这是最好的解决方案。 +1 对此进行了小幅更新。似乎 fseek 在内部使用整数，这会阻止您在 32 位设置上设置超过 2147483647 的位置。这阻止了我在 ~4.8gb 的日志文件上使用它。【参考方案4】：

如果您的代码无法运行并报告错误，您应该在帖子中包含错误！

您收到错误的原因是您试图将文件的全部内容存储在 PHP 的内存空间中。

解决问题的最有效方法是按照 Greenisha 的建议，查找文件末尾，然后返回一点。但Greenisha 的回溯机制效率不高。

请考虑从流中获取最后几行的方法（即您无法寻找的地方）：

while (($buffer = fgets($handle, 4096)) !== false) 
    $i1++;
    $content[$i1]=$buffer;
    unset($content[$i1-$lines_to_keep]);

所以如果你知道你的最大行长是 4096，那么你会：

if (4096*lines_to_keep<filesize($input_file)) 
   fseek($fp, -4096*$lines_to_keep, SEEK_END);

然后应用我之前描述的循环。

由于 C 有一些更有效的方法来处理字节流，因此最快的解决方案（在 POSIX/Unix/Linux/BSD 系统上）很简单：

$last_lines=system("last -" . $lines_to_keep . " filename");

【讨论】：

如果您认为 +1 可以取消设置，请提供更多解释。您的解决方案还会遍历整个文件，但会因 fgets 和 fseek 的开销而慢一些。 @stefgosselin: 否 - 再读一遍 - 它只遍历文件末尾的块，该块大于或与要提取的数据相同。【参考方案5】：

这是另一种解决方案。 fgets()中没有行长控制，可以添加。

/* Read file from end line by line */
$fp = fopen( dirname(__FILE__) . '\\some_file.txt', 'r');
$lines_read = 0;
$lines_to_read = 1000;
fseek($fp, 0, SEEK_END); //goto EOF
$eol_size = 2; // for windows is 2, rest is 1
$eol_char = "\r\n"; // mac=\r, unix=\n
while ($lines_read < $lines_to_read) 
    if (ftell($fp)==0) break; //break on BOF (beginning...)
    do 
            fseek($fp, -1, SEEK_CUR); //seek 1 by 1 char from EOF
        $eol = fgetc($fp) . fgetc($fp); //search for EOL (remove 1 fgetc if needed)
        fseek($fp, -$eol_size, SEEK_CUR); //go back for EOL
     while ($eol != $eol_char && ftell($fp)>0 ); //check EOL and BOF

    $position = ftell($fp); //save current position
    if ($position != 0) fseek($fp, $eol_size, SEEK_CUR); //move for EOL
    echo fgets($fp); //read LINE or do whatever is needed
    fseek($fp, $position, SEEK_SET); //set current position
    $lines_read++;

fclose($fp);

【讨论】：

【参考方案6】：

以下 sn-p 对我有用。

$file = popen("tac $filename",'r');

while ($line = fgets($file))
   echo $line;

参考：http://laughingmeme.org/2008/02/28/reading-a-file-backwards-in-php/

【讨论】：

@Lenin 是的，我测试了 1G【参考方案7】：

正如爱因斯坦所说，每一件事都应该尽可能简单，但不能简单。此时你需要一个数据结构，一个 LIFO 数据结构或者简单地放一个栈。

【讨论】：

【参考方案8】：

对于 Linux，您可以这样做

$linesToRead = 10;
exec("tail -n$linesToRead $myFileName" , $content);

您将在 $content 变量中获得一组行

纯 PHP 解决方案

$f = fopen($myFileName, 'r');

    $maxLineLength = 1000;  // Real maximum length of your records
    $linesToRead = 10;
    fseek($f, -$maxLineLength*$linesToRead, SEEK_END);  // Moves cursor back from the end of file
    $res = array();
    while (($buffer = fgets($f, $maxLineLength)) !== false) 
        $res[] = $buffer;
    

    $content = array_slice($res, -$linesToRead);

【讨论】：

【参考方案9】：

好吧，在搜索相同的东西时，我可以浏览以下内容，并认为它可能对其他人也有用，所以在这里分享它：

/* 从末尾逐行读取文件 */

function tail_custom($filepath, $lines = 1, $adaptive = true) 
        // Open file
        $f = @fopen($filepath, "rb");
        if ($f === false) return false;

        // Sets buffer size, according to the number of lines to retrieve.
        // This gives a performance boost when reading a few lines from the file.
        if (!$adaptive) $buffer = 4096;
        else $buffer = ($lines < 2 ? 64 : ($lines < 10 ? 512 : 4096));

        // Jump to last character
        fseek($f, -1, SEEK_END);

        // Read it and adjust line number if necessary
        // (Otherwise the result would be wrong if file doesn't end with a blank line)
        if (fread($f, 1) != "\n") $lines -= 1;

        // Start reading
        $output = '';
        $chunk = '';

        // While we would like more
        while (ftell($f) > 0 && $lines >= 0) 

            // Figure out how far back we should jump
            $seek = min(ftell($f), $buffer);

            // Do the jump (backwards, relative to where we are)
            fseek($f, -$seek, SEEK_CUR);

            // Read a chunk and prepend it to our output
            $output = ($chunk = fread($f, $seek)) . $output;

            // Jump back to where we started reading
            fseek($f, -mb_strlen($chunk, '8bit'), SEEK_CUR);

            // Decrease our line counter
            $lines -= substr_count($chunk, "\n");

        

        // While we have too many lines
        // (Because of buffer size we might have read too many)
        while ($lines++ < 0) 
            // Find first newline and remove all text before that
            $output = substr($output, strpos($output, "\n") + 1);
        

        // Close file and return
        fclose($f);     
        return trim($output);

【讨论】：

【参考方案10】：

如果您知道行的长度，您可以避免很多黑魔法，而只需抓取文件末尾的一大块。

我需要一个非常大的日志文件的最后 15 行，总共大约 3000 个字符。所以我只是抓住最后 8000 字节以确保安全，然后正常读取文件并从最后获取我需要的内容。

    $fh = fopen($file, "r");
    fseek($fh, -8192, SEEK_END);
    $lines = array();
    while($lines[] = fgets($fh))

这可能比评分最高的答案更有效，后者逐个字符读取文件，比较每个字符，并根据换行符进行拆分。

【讨论】：

【参考方案11】：

此处提供了上述“尾部”建议的更完整示例。这似乎是一种简单而有效的方法——谢谢。非常大的文件应该不是问题，也不需要临时文件。

$out = array();
$ret = null;

// capture the last 30 files of the log file into a buffer
exec('tail -30 ' . $weatherLog, $buf, $ret);

if ( $ret == 0 ) 

  // process the captured lines one at a time
  foreach ($buf as $line) 
    $n = sscanf($line, "%s temperature %f", $dt, $t);
    if ( $n > 0 ) $temperature = $t;
    $n = sscanf($line, "%s humidity %f", $dt, $h);
    if ( $n > 0 ) $humidity = $h;
  
  printf("<tr><th>Temperature</th><td>%0.1f</td></tr>\n", 
          $temperature);
  printf("<tr><th>Humidity</th><td>%0.1f</td></tr>\n", $humidity);

else  # something bad happened

在上面的示例中，代码读取 30 行文本输出并显示文件中最后的温度和湿度读数（这就是 printf 位于循环之外的原因，以防您想知道）。该文件由 ESP32 填充，即使传感器仅报告 nan，它也会每隔几分钟添加到文件中。所以三十行得到了很多读数，所以它永远不会失败。每个读数都包括日期和时间，因此在最终版本中，输出将包括读数的时间。

【讨论】：

以上是关于从末尾读取大文件的主要内容，如果未能解决你的问题，请参考以下文章