• subscribe
February 23, 2009 12:00 AM

Divide and Conquer Mega-Sized Text and Log Files

Have split files rather than a splitting headache
Windows IT Pro
InstantDoc ID #101218
Downloads
101218.zip

Several times a year, various department heads give me text files and ask me to perform data analyses and create summary reports. Often these files are massive directory dumps or application data dumps that can be as large as 700MB and contain more than 9.5 million lines of text. I'm also occasionally asked to extract contents from extremely large text-based cluster logs, Web logs, and event logs for technicians who need to send log samples to our security department or to analysts to diagnose problems. In addition, there are times when I need to look at the data in an enormous text file so that I know how to work with its contents in a script.

Sometimes these mega-sized text and log files are too large for Notepad to open. Other times, Notepad is sluggish when I try to scroll through their contents. Having smaller files not only makes it easier to do data analyses but also dramatically speeds up code development and testing.

After many months of working around this problem using a mixed bag tactics such as exporting data to Microsoft Access or trying to open the files with some other application, I finally decided to write the Log Splitter utility. This HTML Application (HTA) splits large text files into smaller files that I can easily open with Notepad and easily work with when writing a script.

The Log Splitter utility offers simple but adequate functionality. After you download the utility (click the Download the Code Here button at the top of the page) and copy it to your computer, double-click it. In the UI (see Figure 1), enter the pathname of the text or log file you want to split or use the Browse button to locate it.

You can split up the file by the number of lines or number of pages. To find out how many lines are in the file, click the GetLineCount button. Knowing the total number of lines can help you decide whether to split it by line count or number of files.

If you want to split the large file into a specific number of smaller files, select the Split into number of files option, then specify that number in the Enter Line Count or Number of Files field. If you want to split the large file into smaller files that contain a certain number of lines, select the Split by Line count option, then specify the maximum number of lines you want in the smaller files. You must enter a value of 100,000 or higher. I found that lower values tend to produce too many files, particularly if you're splitting a file that's several hundred megabytes.

All that's left to do is to click the RunScript button and click OK. Before the utility starts splitting the large file, it checks for possible problems. If it finds a problem, it displays a message. For example, the utility checks to see whether the specified file is a text or log file. If you try to split another type of file, you'll receive the message This script only works with '.log' and '.txt' files.

After splitting the large file, the utility saves and names each smaller file. For example, if you're splitting C:\data\massive.log into three smaller files, the smaller files will be named C:\data\massive~1.log, C:\data\massive~2.log, and C:\data\massive~3.log. If these smaller files already exist, they'll be overwritten.

Depending on the size of the file you're splitting, the process could take a long time to finish (e.g., about five minutes to split a 100MB file into five files), so the utility's UI is hidden while the process runs. The UI reappears when the process completes.

If you get the following Microsoft Internet Explorer (IE) message when running the Log Splitter utility—A script on this page is causing Internet Explorer to run slowly. If it continues to run, your computer may become unresponsive. Do you want to abort the script?—abort and see the Microsoft article "How to set time-out period for script." This article tells you how to add a new registry entry named MaxScriptStatements to alleviate the problem. It's a relatively simple modification but as with all registry changes, you need to use extreme caution. After I received this message, I set the MaxScriptStatements value to 100000000 (100 million). That value works well for me, but you could try a smaller value and see how it works on your computer.

 



ARTICLE TOOLS

Comments
  • Karen
    3 years ago
    Mar 16, 2009

    Hi x16wda,

    The information about the free text editors is helpful—thanks! If you (or anyone reading this) would like to share information about the free tools they like to use, you can send me a short description of what it does, where to download it, and how to use its main features. We feature a "Tool Time" column in the Reader to Reader" section of Windows IT Pro for such recommendations. If your write-up is selected for publication, it would get printed in the "Tool Time" column and you'd get $100.

    For an example of a "Tool Time" write-up, see "Tool Time: Test Connectivity to Remote Email Servers with TestMX" (http://windowsitpro.com/Windows/article/articleid/100732/100732.html) or "Tool Time: Copy Many Pathnames at Once With Path Copy" (http://windowsitpro.com/Windows/article/articleid/100962/100962.html)

    You can email the "Tool Time" write-up to kbemowski@windowsitpro.com or r2r@windowsitpro.com.

    Sincerely,

    Karen Bemowski, senior editor
    Windows IT Pro, SQL Server Magazine

  • Bill
    3 years ago
    Mar 10, 2009

    Notepad is an awful application to use once the file size gets up to several megabytes. There are many free Notepad replacements out there, but most of them also fail miserably when confronted with a hundred-megabyte file.

    Fortunately there are several free text editors that aren't fazed by big files. When I started needing to handle these beasts I did extensive research and testing on the available freebies. I primarily wanted fast large file handling capability, but I also looked for several extras.... line numbers, multiple document handling, column mode, etc.

    My current favorites are SciTE (http://www.scintilla.org/SciTE.html), ConTEXT (http://www.contexteditor.org/), and Crimson Editor (http://www.crimsoneditor.com/). They're all capable of handling hundred megabyte files without choking and they have lots of features built in. Crimson Editor also includes a column mode, which is a lifesaver in some situations. I would also be remiss if I didn't mention UltraEdit (http://www.ultraedit.com/), a commercial offering that has almost every feature you'd want.

    Coming from a Big Iron/ISPF background, though, I have to say that almost none of today's text editors can do what IBM's ISPF Edit could. If you are familiar with ISPF Edit and Rexx edit macros, you'll understand what I'm saying - the power of that interface can make short work of complicated editing tasks. There have been several DOS-type clones of this environment, but they've been limited to an 80 by 24 screen and aren't available any more. Now here's the good news - Mizumaki-machi (sakachin2@yahoo.co.jp) has produced an ISPF Edit clone that takes advantage of Windows features like resizable screens! The program is called Hybrid Editor XE and it is FREE. There are two web sites for this program, http://hp.vector.co.jp/authors/VA010562 and http://www.geocities.jp/sakachin2/index.htm. You've got to try this out, just use the help since some commands have changed slightly from ISPF use.

You must log on before posting a comment.

Are you a new visitor? Register Here