Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Processing text with tds.Hash is very slow #26

@tastyminerals

Description

@tastyminerals

I have created a few scripts to preprocess text corpus ~6MB. In order to keep text formatting I need to iterate over each line and do some text manipulations with it. This in turn produces PANIC: unprotected error in call to Lua API (not enough memory). I decided to try tds.Hash to keep my corpus table.

Here is the code I am using:

  text_arr = tokenize(text)
  text_arr = tds.Hash(text_arr)
  -- replace rare tokens with <unk>
  -- text_arr is a {idx: {tokens arr}}
  for l=1,#text_arr do -- iterating lines {}
    for t=1,#text_arr[l] do -- iterating tokens {}
      -- rare is arr of rare words
      for r=1,#rare do
        if text_arr[l][t] == rare[r] then text_arr[l][t] = "<unk>" end
      end
    end
  end

text_arr is a table of size 2900 and this 3 loop operation becomes really slow when using tds.Hash.
I am by no means a lua expert but am I doing something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions