-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Hello all,
first of all, I want to thanks you for developing such a nice and useful library which I have been using for fun and profit recently.
I am not yet familiar with the OpenXML specifications and the code but I noticed some massive issue when parsing serious documents ... I am talking about a typical document.xml of about 2 MB containing around 5850 tags in total.
Even by cutting the tag detection to:
def parse(self, el):
print "Visiting ",el.tag
## what nodes we have visited
## if it was visited already terminates recursion!
if el in self.visited:
#we should remove it from the tree as well
print "Already visited"
return ''
## otherwise append the element
self.visited.append(el)
self.number_of_tags+=1
## flush the parsed content
parsed = ''
## navigate through every children to bottom
for child in el:
print "Opened child ",child.tag
parsed += self.parse(child)
print "Parsed child %s and result %s" % (child.tag,parsed)
return ''
The task takes about 2 - 3 minutes on my AMD A6 machine with 8 GB of RAM.
This even after introducing the lxml faster implementation of the ElementTree API.
The faster iterparse fetches all of them in about 2 seconds.
Is anybody working in an iterative version of the parse method which just now relies on a not-very efficient recursive method?
I wrote some BFS and DFS code, but I would like to know if I am wasting my time.
I can also share the stress document used and the Python profiler output if required.
I have also some other approaches to propose for the DocxParser class but will write in a different topic!