Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Removed fuzzywuzzy dependency #497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Removed fuzzywuzzy dependency #497

wants to merge 1 commit into from

Conversation

Chipe1
Copy link
Contributor

@Chipe1 Chipe1 commented Apr 14, 2017

I replaced the function fuzz.ratio() with an implementation similar to mean_boolean_error().

@BesanHalwa
Copy link
Contributor

There is a typo in line 618.
def fitness_ration(str_1, str_2): but it should be def fitness_ratio(str_1, str_2): n

@Chipe1
Copy link
Contributor Author

Chipe1 commented Apr 14, 2017

@Agent-Pandit Could you add few basic testcases(test if algo gives a valid output without errors) for GA. I'm not sure my implementation does what fuzz.ratio() does

@antmarakis antmarakis mentioned this pull request Apr 14, 2017
@antmarakis
Copy link
Collaborator

I took a look at the code, and I think I'm missing something. The fitness function in the genetic algorithm measures how good an individual is. The fuzzywuzzy function does not do that. From my understanding, fuzzywuzzy (what a beautiful name) calculates some sort of Edit/Levenshtein distance. It is a similarity function and does not indicate which individual is better, since we don't know the solution to compare it to.

Also, I have to say that the implementation of the genetic algorithm we have right now does not follow the pseudocode. It doesn't follow the same structure, nor does it work the same way. For one, the fitness function is given as input and is not a set method, like in the implementation.

Personally I feel the implementation needs to be rewritten, or even have #480 mostly reverted, since it veers far away from the pseudocode in the book.

@Agent-Pandit, is there something I'm missing? Do you have examples of usage or tests we can take a look at to see the implementation in use?

@BesanHalwa
Copy link
Contributor

Errors and GA are very closely related. Throughout the process (in GA) we try to minimise the error (or maximise the fitness). There may be cases in which one may not get the exact solution but only nearly correct solution (which also might differ each time). Thus, I believe adding a test case may not be very useful.

As far as removing the dependency on fuzzywuzzy is considered, I believe it is a good idea. But I have some concerns…

After correcting the typo this works well but it is not a very general approach and might fail under some cases.
I compared fuzz.ratio and fitness_ratio(), here is the result
(str1, str2) |fitness value from fuzz.ratio(str1, str2) |fitness value from fitness_ratio(str_1, str_2) | Result of fitness_ratio
(Eshan, Eshan) | 100 | 100 | acceptable
(Eshak, Eshan) | 80 | 80 | acceptable
(Eshak1, Eshan) | 73 | 66.66666666666667 | acceptable*
(Eshak11, Eshan) | 67 | 57.142857142857146 | acceptable*
(Eshan, EshanABCD) | 71 | 100 | not acceptable
In the last case we see that fitness_ratio function fails.

The particular approach (used in this PR) works fine as far as the implementation of GA is considered in search.py because there the length of the strings are same. This might seem to produces the correct result (as expected from GA) but it is not the correct way to do it. (This is again a personal point of view)

  • Difference in fitness value is another issue, and I don’t understand the reason. However it is not a major concern. As long as we get the extreme values (0 and 100) accurate, the mid range does not matter much because we make a relative comparison.

@Chipe1 If you address the issue it will be great. Till then it believe that using an external module is not that bad idea. It makes the approach more general and increases the ease to understand.

@antmarakis
Copy link
Collaborator

@Agent-Pandit: I'm not saying errors and genetic algorithms aren't connected, I'm just saying your implementation only works for this one case, and in general GA does not work the way your implementation does.

In GAs, we don't know the solution and we want to approximate it to an adequate score (or get it spot on). Your implementation takes the solution as input (in_str), and approximates that. It is not the same thing. For fitness you are comparing an individual with the given solution.

Take for example graph coloring. We don't know the solution beforehand. We do know though how a solution should look. Namely, edges should not connect nodes of the same color. So, for a fitness function we can count how many acceptable edges an individual has. The more the merrier. We keep working this way until we find the solution, or a very good approximation.

Your implementation, unfortunately, cannot solve problems other than string matching (and for those, you already know the solution/target). It is not bad as an introduction to how the algorithm works, but it is just a toy example.

Later tonight I will try and get a complete implementation with example running, so I can better showcase what I'm talking about.

@antmarakis
Copy link
Collaborator

I have made the PR in #501.

@Chipe1
Copy link
Contributor Author

Chipe1 commented Apr 15, 2017

#501 fixes the error and algorithm

@Chipe1 Chipe1 closed this Apr 15, 2017
@Chipe1 Chipe1 deleted the fuzz branch April 15, 2017 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants