Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

Conversation

@k0pernicus
Copy link

This PR solves issue #189 in order to match the content in our doctests.

I updated all the sources in the texthero folder - the main issue is in the scatterplot function, in visualization.py, where the 3D representation on the browser does not show anything (WiP).

I also updated the file CONTRIBUTING.md in order to inform the project contributors to match as much as possible the doctests in their examples / tests.

To finish, I also updated some doctests in order to add a new line between the doc and the source code sometimes (for clarity), to remove extra whitespaces, etc...

@k0pernicus k0pernicus changed the title WiP Matching content in our doctests [WIP] Matching content in our doctests Dec 4, 2020
@k0pernicus
Copy link
Author

k0pernicus commented Dec 5, 2020

Hi @jbesomi,
I have an issue about the return of replace_stopwords for Python 3.6.

Based on the runners, the doctests for Python 3.6 are not valid due that replace_stopwords filters the punctuations: https://travis-ci.com/github/jbesomi/texthero/jobs/454997146.

However, the doctests for Python 3.X (with X >= 7) are valid because replace_stopwords does not filter the punctuations, which is, checking directly in the regex, seems the good behaviour.
As an example, this is output of the runner for Python 3.8: https://travis-ci.com/github/jbesomi/texthero/jobs/454997148.

Do you have any idea about this issue?
It does not seems, in the code, that there is a specific code version for Python 3.6 and another one for the other Python versions...

@jbesomi
Copy link
Owner

jbesomi commented Dec 8, 2020

Hi @k0pernicus, thank you for your PR! Amazing πŸŽ‰

Regarding the issue with replace_stopwords, probably this is due to the regex pattern. I will investigate that and get back to you. For you to know, we were thinking about adding a tokenization function and require all preprocessing functions to receive an already tokenized function, to avoid this kind of problems (for this function, for instance, we would have to go through the list of tokens and remove the stopwords).

@k0pernicus
Copy link
Author

Hi @jbesomi, thank you for the update :)

For you to know, we were thinking about adding a tokenization function and require all preprocessing functions to receive an already tokenized function, to avoid this kind of problems[...].

Great!
Do not hesitate if you want to test this feature and integrate in this PR, I would be glad to help!

@k0pernicus
Copy link
Author

k0pernicus commented Jan 9, 2021

Hi @jbesomi ,
Do you have any news about the regex issue please? :)

@jbesomi
Copy link
Owner

jbesomi commented Jan 12, 2021

Hi @k0pernicus,

I've intensively thought about this and come to the conclusion that it's better for both of us developers and for all Texthero users to have all preprocessing functions to accept an already Tokenized Series. See here #145 for a complete discussion on the subject. Would you like to help with #145 too? Once implemented, writing replace_stopwords will be much easier, hence we will be able to integrate this PR too.

@k0pernicus
Copy link
Author

Hi @jbesomi,
I would be glad to help for this issue, I can take a look starting tomorrow about #145.

@jbesomi
Copy link
Owner

jbesomi commented Jan 12, 2021

Thank you @k0pernicus !

@k0pernicus
Copy link
Author

Hi,
It's been a while and, no surprise, but I don't have any solution for that PR.
I think we can close this PR as I do think my modifications are not relevant anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants