Performance Issue on recursive method parse(self, el)

Hello all,
first of all, I want to thanks you for developing such a nice and useful library which I have been using for fun and profit recently.
I am not yet familiar with the OpenXML specifications and the code but I noticed some massive issue when parsing serious documents ... I am talking about a typical document.xml of about 2 MB containing around 5850 tags in total.
Even by cutting the tag detection to:

```
def parse(self, el):

    print "Visiting ",el.tag
    ## what nodes we have visited
    ## if it was visited already terminates recursion!
    if el in self.visited:
        #we should remove it from the tree as well
        print "Already visited"
        return ''
    ## otherwise append the element
    self.visited.append(el)
    self.number_of_tags+=1
    ## flush the parsed content
    parsed = ''


    ## navigate through every children to bottom
    for child in el:
        print "Opened child ",child.tag
        parsed += self.parse(child)
        print "Parsed child %s and result %s" % (child.tag,parsed)
    return ''
```

The task takes about 2 - 3 minutes on my AMD A6 machine with 8 GB of RAM.
This even after introducing the lxml faster implementation of the ElementTree API.
The faster iterparse fetches all of them in about 2 seconds.

Is anybody working in an iterative version of the parse method which just now relies on a not-very efficient recursive method?

I wrote some BFS and DFS code, but I would like to know if I am wasting my time.
I can also share the stress document used and the Python profiler output if required.

I have also some other approaches to propose for the DocxParser class but will write in a different topic!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Issue on recursive method parse(self, el) #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Issue on recursive method parse(self, el) #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions