Come leggere un file di testo al contrario con iteratore in C #

Ho bisogno di elaborare un file di grandi dimensioni, circa 400K linee e 200 M. Ma a volte devo elaborare dal basso verso l’alto. Come posso usare iterator (rendimento in uscita) qui? Fondamentalmente non mi piace caricare tutto in memoria. So che è più efficiente usare iterator in .NET.

Leggere i file di testo all’indietro è davvero complicato a meno che non si stia utilizzando una codifica a dimensione fissa (ad es. ASCII). Quando hai codifica a dimensione variabile (come UTF-8) continuerai a dover controllare se sei nel bel mezzo di un personaggio o meno quando recuperi i dati.

Non c’è nulla di costruito nel framework e ho il sospetto che dovresti fare un hard coding separato per ogni codifica a larghezza variabile.

EDIT: Questo è stato in qualche modo testato – ma questo non vuol dire che non abbia ancora alcuni piccoli bug in giro. Usa StreamUtil da MiscUtil, ma ho incluso solo il necessario (nuovo) metodo da lì in basso. Oh, e ha bisogno di refactoring – c’è un metodo piuttosto pesante, come vedrai:

using System; using System.Collections; using System.Collections.Generic; using System.IO; using System.Text; namespace MiscUtil.IO { ///  /// Takes an encoding (defaulting to UTF-8) and a function which produces a seekable stream /// (or a filename for convenience) and yields lines from the end of the stream backwards. /// Only single byte encodings, and UTF-8 and Unicode, are supported. The stream /// returned by the function must be seekable. ///  public sealed class ReverseLineReader : IEnumerable { ///  /// Buffer size to use by default. Classes with internal access can specify /// a different buffer size - this is useful for testing. ///  private const int DefaultBufferSize = 4096; ///  /// Means of creating a Stream to read from. ///  private readonly Func streamSource; ///  /// Encoding to use when converting bytes to text ///  private readonly Encoding encoding; ///  /// Size of buffer (in bytes) to read each time we read from the /// stream. This must be at least as big as the maximum number of /// bytes for a single character. ///  private readonly int bufferSize; ///  /// Function which, when given a position within a file and a byte, states whether /// or not the byte represents the start of a character. ///  private Func characterStartDetector; ///  /// Creates a LineReader from a stream source. The delegate is only /// called when the enumerator is fetched. UTF-8 is used to decode /// the stream into text. ///  /// Data source public ReverseLineReader(Func streamSource) : this(streamSource, Encoding.UTF8) { } ///  /// Creates a LineReader from a filename. The file is only opened /// (or even checked for existence) when the enumerator is fetched. /// UTF8 is used to decode the file into text. ///  /// File to read from public ReverseLineReader(string filename) : this(filename, Encoding.UTF8) { } ///  /// Creates a LineReader from a filename. The file is only opened /// (or even checked for existence) when the enumerator is fetched. ///  /// File to read from /// Encoding to use to decode the file into text public ReverseLineReader(string filename, Encoding encoding) : this(() => File.OpenRead(filename), encoding) { } ///  /// Creates a LineReader from a stream source. The delegate is only /// called when the enumerator is fetched. ///  /// Data source /// Encoding to use to decode the stream into text public ReverseLineReader(Func streamSource, Encoding encoding) : this(streamSource, encoding, DefaultBufferSize) { } internal ReverseLineReader(Func streamSource, Encoding encoding, int bufferSize) { this.streamSource = streamSource; this.encoding = encoding; this.bufferSize = bufferSize; if (encoding.IsSingleByte) { // For a single byte encoding, every byte is the start (and end) of a character characterStartDetector = (pos, data) => true; } else if (encoding is UnicodeEncoding) { // For UTF-16, even-numbered positions are the start of a character. // TODO: This assumes no surrogate pairs. More work required // to handle that. characterStartDetector = (pos, data) => (pos & 1) == 0; } else if (encoding is UTF8Encoding) { // For UTF-8, bytes with the top bit clear or the second bit set are the start of a character // See http://www.cl.cam.ac.uk/~mgk25/unicode.html characterStartDetector = (pos, data) => (data & 0x80) == 0 || (data & 0x40) != 0; } else { throw new ArgumentException("Only single byte, UTF-8 and Unicode encodings are permitted"); } } ///  /// Returns the enumerator reading strings backwards. If this method discovers that /// the returned stream is either unreadable or unseekable, a NotSupportedException is thrown. ///  public IEnumerator GetEnumerator() { Stream stream = streamSource(); if (!stream.CanSeek) { stream.Dispose(); throw new NotSupportedException("Unable to seek within stream"); } if (!stream.CanRead) { stream.Dispose(); throw new NotSupportedException("Unable to read within stream"); } return GetEnumeratorImpl(stream); } private IEnumerator GetEnumeratorImpl(Stream stream) { try { long position = stream.Length; if (encoding is UnicodeEncoding && (position & 1) != 0) { throw new InvalidDataException("UTF-16 encoding provided, but stream has odd length."); } // Allow up to two bytes for data from the start of the previous // read which didn't quite make it as full characters byte[] buffer = new byte[bufferSize + 2]; char[] charBuffer = new char[encoding.GetMaxCharCount(buffer.Length)]; int leftOverData = 0; String previousEnd = null; // TextReader doesn't return an empty string if there's line break at the end // of the data. Therefore we don't return an empty string if it's our *first* // return. bool firstYield = true; // A line-feed at the start of the previous buffer means we need to swallow // the carriage-return at the end of this buffer - hence this needs declaring // way up here! bool swallowCarriageReturn = false; while (position > 0) { int bytesToRead = Math.Min(position > int.MaxValue ? bufferSize : (int)position, bufferSize); position -= bytesToRead; stream.Position = position; StreamUtil.ReadExactly(stream, buffer, bytesToRead); // If we haven't read a full buffer, but we had bytes left // over from before, copy them to the end of the buffer if (leftOverData > 0 && bytesToRead != bufferSize) { // Buffer.BlockCopy doesn't document its behaviour with respect // to overlapping data: we *might* just have read 7 bytes instead of // 8, and have two bytes to copy... Array.Copy(buffer, bufferSize, buffer, bytesToRead, leftOverData); } // We've now *effectively* read this much data. bytesToRead += leftOverData; int firstCharPosition = 0; while (!characterStartDetector(position + firstCharPosition, buffer[firstCharPosition])) { firstCharPosition++; // Bad UTF-8 sequences could trigger this. For UTF-8 we should always // see a valid character start in every 3 bytes, and if this is the start of the file // so we've done a short read, we should have the character start // somewhere in the usable buffer. if (firstCharPosition == 3 || firstCharPosition == bytesToRead) { throw new InvalidDataException("Invalid UTF-8 data"); } } leftOverData = firstCharPosition; int charsRead = encoding.GetChars(buffer, firstCharPosition, bytesToRead - firstCharPosition, charBuffer, 0); int endExclusive = charsRead; for (int i = charsRead - 1; i >= 0; i--) { char lookingAt = charBuffer[i]; if (swallowCarriageReturn) { swallowCarriageReturn = false; if (lookingAt == '\r') { endExclusive--; continue; } } // Anything non-line-breaking, just keep looking backwards if (lookingAt != '\n' && lookingAt != '\r') { continue; } // End of CRLF? Swallow the preceding CR if (lookingAt == '\n') { swallowCarriageReturn = true; } int start = i + 1; string bufferContents = new string(charBuffer, start, endExclusive - start); endExclusive = i; string stringToYield = previousEnd == null ? bufferContents : bufferContents + previousEnd; if (!firstYield || stringToYield.Length != 0) { yield return stringToYield; } firstYield = false; previousEnd = null; } previousEnd = endExclusive == 0 ? null : (new string(charBuffer, 0, endExclusive) + previousEnd); // If we didn't decode the start of the array, put it at the end for next time if (leftOverData != 0) { Buffer.BlockCopy(buffer, 0, buffer, bufferSize, leftOverData); } } if (leftOverData != 0) { // At the start of the final buffer, we had the end of another character. throw new InvalidDataException("Invalid UTF-8 data at start of stream"); } if (firstYield && string.IsNullOrEmpty(previousEnd)) { yield break; } yield return previousEnd ?? ""; } finally { stream.Dispose(); } } IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); } } } // StreamUtil.cs: public static class StreamUtil { public static void ReadExactly(Stream input, byte[] buffer, int bytesToRead) { int index = 0; while (index < bytesToRead) { int read = input.Read(buffer, index, bytesToRead - index); if (read == 0) { throw new EndOfStreamException (String.Format("End of stream reached with {0} byte{1} left to read.", bytesToRead - index, bytesToRead - index == 1 ? "s" : "")); } index += read; } } } 

Feedback molto benvenuto. Questo è stato divertente :)

È ansible utilizzare File.ReadLines per ottenere righe iteratore

 foreach (var line in File.ReadLines(@"C:\temp\ReverseRead.txt").Reverse()) { if (noNeedToReadFurther) break; // process line here Console.WriteLine(line); } 

MODIFICARE:

Dopo aver letto il commento di applejacks01 , .Reverse() alcuni test e sembra che. .Reverse() carichi effettivamente tutto il file.

Ho usato File.ReadLines() per stampare la prima riga di un file da 40 MB: l’utilizzo della memoria dell’app per console era 5 MB . Quindi, è stato utilizzato File.ReadLines().Reverse() per stampare l’ ultima riga dello stesso file – l’utilizzo della memoria è stato di 95 MB .

Conclusione

Qualunque cosa stia facendo `Reverse () ‘, non è una buona scelta per leggere il fondo di un grosso file.

Ho messo il file in una lista riga per riga, quindi ho usato List.Reverse ();

  StreamReader objReader = new StreamReader(filename); string sLine = ""; ArrayList arrText = new ArrayList(); while (sLine != null) { sLine = objReader.ReadLine(); if (sLine != null) arrText.Add(sLine); } objReader.Close(); arrText.Reverse(); foreach (string sOutput in arrText) { 

Per creare un iteratore di file puoi farlo:

MODIFICARE:

Questa è la mia versione fissa di un lettore di file inverso a larghezza fissa:

 public static IEnumerable readFile() { using (FileStream reader = new FileStream(@"c:\test.txt",FileMode.Open,FileAccess.Read)) { int i=0; StringBuilder lineBuffer = new StringBuilder(); int byteRead; while (-i < reader.Length) { reader.Seek(--i, SeekOrigin.End); byteRead = reader.ReadByte(); if (byteRead == 10 && lineBuffer.Length > 0) { yield return Reverse(lineBuffer.ToString()); lineBuffer.Remove(0, lineBuffer.Length); } lineBuffer.Append((char)byteRead); } yield return Reverse(lineBuffer.ToString()); reader.Close(); } } public static string Reverse(string str) { char[] arr = new char[str.Length]; for (int i = 0; i < str.Length; i++) arr[i] = str[str.Length - 1 - i]; return new string(arr); } 

È ansible leggere il file un carattere alla volta all’indietro e memorizzare tutti i caratteri nella cache finché non si raggiunge un ritorno a capo e / o un avanzamento riga.

Quindi si inverte la stringa raccolta e la si annulla come una linea.

So che questo post è molto vecchio, ma non riuscendo a trovare la soluzione più votata, ho finalmente trovato questo: ecco la risposta migliore che ho trovato con un basso costo di memoria in VB e C #

http://www.blakepell.com/2010-11-29-backward-file-reader-vb-csharp-source

Spero, aiuterò gli altri con questo perché mi ci sono volute ore per trovare finalmente questo post!

[Modificare]

Ecco il codice c #:

 //********************************************************************************************************************************* // // Class: BackwardReader // Initial Date: 11/29/2010 // Last Modified: 11/29/2010 // Programmer(s): Original C# Source - the_real_herminator // http://social.msdn.microsoft.com/forums/en-US/csharpgeneral/thread/9acdde1a-03cd-4018-9f87-6e201d8f5d09 // VB Converstion - Blake Pell // //********************************************************************************************************************************* using System.Text; using System.IO; public class BackwardReader { private string path; private FileStream fs = null; public BackwardReader(string path) { this.path = path; fs = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite); fs.Seek(0, SeekOrigin.End); } public string Readline() { byte[] line; byte[] text = new byte[1]; long position = 0; int count; fs.Seek(0, SeekOrigin.Current); position = fs.Position; //do we have trailing rn? if (fs.Length > 1) { byte[] vagnretur = new byte[2]; fs.Seek(-2, SeekOrigin.Current); fs.Read(vagnretur, 0, 2); if (ASCIIEncoding.ASCII.GetString(vagnretur).Equals("rn")) { //move it back fs.Seek(-2, SeekOrigin.Current); position = fs.Position; } } while (fs.Position > 0) { text.Initialize(); //read one char fs.Read(text, 0, 1); string asciiText = ASCIIEncoding.ASCII.GetString(text); //moveback to the charachter before fs.Seek(-2, SeekOrigin.Current); if (asciiText.Equals("n")) { fs.Read(text, 0, 1); asciiText = ASCIIEncoding.ASCII.GetString(text); if (asciiText.Equals("r")) { fs.Seek(1, SeekOrigin.Current); break; } } } count = int.Parse((position - fs.Position).ToString()); line = new byte[count]; fs.Read(line, 0, count); fs.Seek(-count, SeekOrigin.Current); return ASCIIEncoding.ASCII.GetString(line); } public bool SOF { get { return fs.Position == 0; } } public void Close() { fs.Close(); } } 

Volevo fare la cosa simile. Ecco il mio codice. Questa class creerà file temporanei contenenti blocchi del grande file. Ciò eviterà il gonfiore della memoria. L’utente può specificare se desidera che il file venga invertito. Di conseguenza restituirà il contenuto in maniera inversa.

Questa class può anche essere utilizzata per scrivere grandi dati in un singolo file senza gonfiore di memoria.

Si prega di fornire un feedback.

  using System; using System.Collections.Generic; using System.Diagnostics; using System.IO; using System.Linq; using System.Text; using System.Threading.Tasks; namespace BigFileService { public class BigFileDumper { ///  /// Buffer that will store the lines until it is full. /// Then it will dump it to temp files. ///  public int CHUNK_SIZE = 1000; public bool ReverseIt { get; set; } public long TotalLineCount { get { return totalLineCount; } } private long totalLineCount; private int BufferCount = 0; private StreamWriter Writer; ///  /// List of files that would store the chunks. ///  private List LstTempFiles; private string ParentDirectory; private char[] trimchars = { '/', '\\'}; public BigFileDumper(string FolderPathToWrite) { this.LstTempFiles = new List(); this.ParentDirectory = FolderPathToWrite.TrimEnd(trimchars) + "\\" + "BIG_FILE_DUMP"; this.totalLineCount = 0; this.BufferCount = 0; this.Initialize(); } private void Initialize() { // Delete existing directory. if (Directory.Exists(this.ParentDirectory)) { Directory.Delete(this.ParentDirectory, true); } // Create a new directory. Directory.CreateDirectory(this.ParentDirectory); } public void WriteLine(string line) { if (this.BufferCount == 0) { string newFile = "DumpFile_" + LstTempFiles.Count(); LstTempFiles.Add(newFile); Writer = new StreamWriter(this.ParentDirectory + "\\" + newFile); } // Keep on adding in the buffer as long as size is okay. if (this.BufferCount < this.CHUNK_SIZE) { this.totalLineCount++; // main count this.BufferCount++; // Chunk count. Writer.WriteLine(line); } else { // Buffer is full, time to create a new file. // Close the existing file first. Writer.Close(); // Make buffer count 0 again. this.BufferCount = 0; this.WriteLine(line); } } public void Close() { if (Writer != null) Writer.Close(); } public string GetFullFile() { if (LstTempFiles.Count <= 0) { Debug.Assert(false, "There are no files created."); return ""; } string returnFilename = this.ParentDirectory + "\\" + "FullFile"; if (File.Exists(returnFilename) == false) { // Create a consolidated file from the existing small dump files. // Now this is interesting. We will open the small dump files one by one. // Depending on whether the user require inverted file, we will read them in descending order & reverted, // or ascending order in normal way. if (this.ReverseIt) this.LstTempFiles.Reverse(); foreach (var fileName in LstTempFiles) { string fullFileName = this.ParentDirectory + "\\" + fileName; // FileLines will use small memory depending on size of CHUNK. User has control. var fileLines = File.ReadAllLines(fullFileName); // Time to write in the writer. if (this.ReverseIt) fileLines = fileLines.Reverse().ToArray(); // Write the lines File.AppendAllLines(returnFilename, fileLines); } } return returnFilename; } } } 

Questo servizio può essere utilizzato come segue:

 void TestBigFileDump_File(string BIG_FILE, string FOLDER_PATH_FOR_CHUNK_FILES) { // Start processing the input Big file. StreamReader reader = new StreamReader(BIG_FILE); // Create a dump file class object to handle efficient memory management. var bigFileDumper = new BigFileDumper(FOLDER_PATH_FOR_CHUNK_FILES); // Set to reverse the output file. bigFileDumper.ReverseIt = true; bigFileDumper.CHUNK_SIZE = 100; // How much at a time to keep in RAM before dumping to local file. while (reader.EndOfStream == false) { string line = reader.ReadLine(); bigFileDumper.WriteLine(line); } bigFileDumper.Close(); reader.Close(); // Get back full reversed file. var reversedFilename = bigFileDumper.GetFullFile(); Console.WriteLine("Check output file - " + reversedFilename); } 

Ci sono già buone risposte qui, ed ecco un’altra class compatibile con LINQ che puoi usare che si concentra sulle prestazioni e sul supporto per file di grandi dimensioni. Presuppone un terminatore di riga “\ r \ n”.

Uso :

 var reader = new ReverseTextReader(@"C:\Temp\ReverseTest.txt"); while (!reader.EndOfStream) Console.WriteLine(reader.ReadLine()); 

ReverseTextReader Class :

 ///  /// Reads a text file backwards, line-by-line. ///  /// This class uses file seeking to read a text file of any size in reverse order. This /// is useful for needs such as reading a log file newest-entries first. public sealed class ReverseTextReader : IEnumerable { private const int BufferSize = 16384; // The number of bytes read from the uderlying stream. private readonly Stream _stream; // Stores the stream feeding data into this reader private readonly Encoding _encoding; // Stores the encoding used to process the file private byte[] _leftoverBuffer; // Stores the leftover partial line after processing a buffer private readonly Queue _lines; // Stores the lines parsed from the buffer #region Constructors ///  /// Creates a reader for the specified file. ///  ///  public ReverseTextReader(string filePath) : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), Encoding.Default) { } ///  /// Creates a reader using the specified stream. ///  ///  public ReverseTextReader(Stream stream) : this(stream, Encoding.Default) { } ///  /// Creates a reader using the specified path and encoding. ///  ///  ///  public ReverseTextReader(string filePath, Encoding encoding) : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), encoding) { } ///  /// Creates a reader using the specified stream and encoding. ///  ///  ///  public ReverseTextReader(Stream stream, Encoding encoding) { _stream = stream; _encoding = encoding; _lines = new Queue(128); // The stream needs to support seeking for this to work if(!_stream.CanSeek) throw new InvalidOperationException("The specified stream needs to support seeking to be read backwards."); if (!_stream.CanRead) throw new InvalidOperationException("The specified stream needs to support reading to be read backwards."); // Set the current position to the end of the file _stream.Position = _stream.Length; _leftoverBuffer = new byte[0]; } #endregion #region Overrides ///  /// Reads the next previous line from the underlying stream. ///  ///  public string ReadLine() { // Are there lines left to read? If so, return the next one if (_lines.Count != 0) return _lines.Dequeue(); // Are we at the beginning of the stream? If so, we're done if (_stream.Position == 0) return null; #region Read and Process the Next Chunk // Remember the current position var currentPosition = _stream.Position; var newPosition = currentPosition - BufferSize; // Are we before the beginning of the stream? if (newPosition < 0) newPosition = 0; // Calculate the buffer size to read var count = (int)(currentPosition - newPosition); // Set the new position _stream.Position = newPosition; // Make a new buffer but append the previous leftovers var buffer = new byte[count + _leftoverBuffer.Length]; // Read the next buffer _stream.Read(buffer, 0, count); // Move the position of the stream back _stream.Position = newPosition; // And copy in the leftovers from the last buffer if (_leftoverBuffer.Length != 0) Array.Copy(_leftoverBuffer, 0, buffer, count, _leftoverBuffer.Length); // Look for CrLf delimiters var end = buffer.Length - 1; var start = buffer.Length - 2; // Search backwards for a line feed while (start >= 0) { // Is it a line feed? if (buffer[start] == 10) { // Yes. Extract a line and queue it (but exclude the \r\n) _lines.Enqueue(_encoding.GetString(buffer, start + 1, end - start - 2)); // And reset the end end = start; } // Move to the previous character start--; } // What's left over is a portion of a line. Save it for later. _leftoverBuffer = new byte[end + 1]; Array.Copy(buffer, 0, _leftoverBuffer, 0, end + 1); // Are we at the beginning of the stream? if (_stream.Position == 0) // Yes. Add the last line. _lines.Enqueue(_encoding.GetString(_leftoverBuffer, 0, end - 1)); #endregion // If we have something in the queue, return it return _lines.Count == 0 ? null : _lines.Dequeue(); } #endregion #region IEnumerator Interface public IEnumerator GetEnumerator() { string line; // So long as the next line isn't null... while ((line = ReadLine()) != null) // Read and return it. yield return line; } IEnumerator IEnumerable.GetEnumerator() { throw new NotImplementedException(); } #endregion } 

Nel caso che qualcun altro si imbatta in questo, l’ho risolto con il seguente script PowerShell che può essere facilmente modificato in uno script C # con un piccolo sforzo.

 [System.IO.FileStream]$fileStream = [System.IO.File]::Open("C:\Name_of_very_large_file.log", [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite) [System.IO.BufferedStream]$bs = New-Object System.IO.BufferedStream $fileStream; [System.IO.StreamReader]$sr = New-Object System.IO.StreamReader $bs; $buff = New-Object char[] 20; $seek = $bs.Seek($fileStream.Length - 10000, [System.IO.SeekOrigin]::Begin); while(($line = $sr.ReadLine()) -ne $null) { $line; } 

Fondamentalmente questo inizia a leggere dagli ultimi 10.000 caratteri di un file, emettendo ogni riga.