Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Performance Issue on recursive method parse(self, el) #60

@robomotic

Description

@robomotic

Hello all,
first of all, I want to thanks you for developing such a nice and useful library which I have been using for fun and profit recently.
I am not yet familiar with the OpenXML specifications and the code but I noticed some massive issue when parsing serious documents ... I am talking about a typical document.xml of about 2 MB containing around 5850 tags in total.
Even by cutting the tag detection to:

def parse(self, el):

    print "Visiting ",el.tag
    ## what nodes we have visited
    ## if it was visited already terminates recursion!
    if el in self.visited:
        #we should remove it from the tree as well
        print "Already visited"
        return ''
    ## otherwise append the element
    self.visited.append(el)
    self.number_of_tags+=1
    ## flush the parsed content
    parsed = ''


    ## navigate through every children to bottom
    for child in el:
        print "Opened child ",child.tag
        parsed += self.parse(child)
        print "Parsed child %s and result %s" % (child.tag,parsed)
    return ''

The task takes about 2 - 3 minutes on my AMD A6 machine with 8 GB of RAM.
This even after introducing the lxml faster implementation of the ElementTree API.
The faster iterparse fetches all of them in about 2 seconds.

Is anybody working in an iterative version of the parse method which just now relies on a not-very efficient recursive method?

I wrote some BFS and DFS code, but I would like to know if I am wasting my time.
I can also share the stress document used and the Python profiler output if required.

I have also some other approaches to propose for the DocxParser class but will write in a different topic!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions