Web lists-archives.com

Re: [PHP] Counting File Lines in XMLReader with a Large File





> Date: Wednesday, August 30, 2017 11:18:51 -0600
> From: Alan Feuerbacher <alanf00@xxxxxxxxxxx>
>
> Hi,
> 
> As a new PHP user, I've recently completed a PHP program that
> extracts a bunch of data from a relatively unstructured XML file.
> The file has roughly 500,000 lines and I have no control over its
> generation.
> 
> The file generally has one XML tag like <foo> per line, but
> sometimes lines are more complicated.
> 
> After a lot of reading and experimenting, I found that XMLReader
> was the tool for getting the data.
> 
> As part of my debugging process, I used the function LineNumber =
> $reader->expand()->getLineNo(); (after doing $reader->open(
> "InputFileName" ); ) to get the file line number that the XMLReader
> cursor was pointing to. Eventually I found that files larger than
> about 65535 lines returned wrong line numbers. Again after some
> online searching, I found a discussion from about 2006 between a
> PHP user and a developer that pretty much explained what was going
> on: the XMLReader program uses a 16-bit integer to count file line
> numbers, which of course is limited to 65535. The developer said he
> would not fix this, for various reasons.
> 
> I ended up splitting the original XML file into smaller pieces
> under 65535 lines each, and concatenating the results.
> 
> It appears that this line numbering issue remains today. Are there
> any plans to make file line numbering work with larger files?
> 
> One of the PHP developer's points was that XML does not necessarily
> include Newlines that would result in file lines, but that all
> content could be in one giant string. True in principle, but not in
> practice where human readers are involved. I know that I would have
> been hard put to debug my PHP code without being able to correlate
> file lines with XMLReader cursor positions.
> 
> Comments?
> 
> Alan

I don't think it is ever a particularly good idea to try to read in
the whole of some arbitrarily sized file over which you have no
control. If something like this specific issue doesn't get you
something else will, e.g., machine memory constraints.

So instead, write your program to read in a specific number of
bytes/characters (or if appropriate, lines) that you know are safe
for your environment and tools, process those, and go on to the next
chunk.



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php