-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Removed fuzzywuzzy dependency #497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There is a typo in line 618. |
@Agent-Pandit Could you add few basic testcases(test if algo gives a valid output without errors) for GA. I'm not sure my implementation does what fuzz.ratio() does |
I took a look at the code, and I think I'm missing something. The fitness function in the genetic algorithm measures how good an individual is. The fuzzywuzzy function does not do that. From my understanding, fuzzywuzzy (what a beautiful name) calculates some sort of Edit/Levenshtein distance. It is a similarity function and does not indicate which individual is better, since we don't know the solution to compare it to. Also, I have to say that the implementation of the genetic algorithm we have right now does not follow the pseudocode. It doesn't follow the same structure, nor does it work the same way. For one, the fitness function is given as input and is not a set method, like in the implementation. Personally I feel the implementation needs to be rewritten, or even have #480 mostly reverted, since it veers far away from the pseudocode in the book. @Agent-Pandit, is there something I'm missing? Do you have examples of usage or tests we can take a look at to see the implementation in use? |
Errors and GA are very closely related. Throughout the process (in GA) we try to minimise the error (or maximise the fitness). There may be cases in which one may not get the exact solution but only nearly correct solution (which also might differ each time). Thus, I believe adding a test case may not be very useful. As far as removing the dependency on fuzzywuzzy is considered, I believe it is a good idea. But I have some concerns… After correcting the typo this works well but it is not a very general approach and might fail under some cases. The particular approach (used in this PR) works fine as far as the implementation of GA is considered in search.py because there the length of the strings are same. This might seem to produces the correct result (as expected from GA) but it is not the correct way to do it. (This is again a personal point of view)
@Chipe1 If you address the issue it will be great. Till then it believe that using an external module is not that bad idea. It makes the approach more general and increases the ease to understand. |
@Agent-Pandit: I'm not saying errors and genetic algorithms aren't connected, I'm just saying your implementation only works for this one case, and in general GA does not work the way your implementation does. In GAs, we don't know the solution and we want to approximate it to an adequate score (or get it spot on). Your implementation takes the solution as input ( Take for example graph coloring. We don't know the solution beforehand. We do know though how a solution should look. Namely, edges should not connect nodes of the same color. So, for a fitness function we can count how many acceptable edges an individual has. The more the merrier. We keep working this way until we find the solution, or a very good approximation. Your implementation, unfortunately, cannot solve problems other than string matching (and for those, you already know the solution/target). It is not bad as an introduction to how the algorithm works, but it is just a toy example. Later tonight I will try and get a complete implementation with example running, so I can better showcase what I'm talking about. |
I have made the PR in #501. |
#501 fixes the error and algorithm |
I replaced the function fuzz.ratio() with an implementation similar to mean_boolean_error().