-
Notifications
You must be signed in to change notification settings - Fork 3.3k
ENH: Replace np.genfromtxt with pandas.read_csv for CSV loading #9671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| cur_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| data = np.genfromtxt(os.path.join(cur_dir, "results", fname), | ||
| delimiter=" ") | ||
| df = pd.read_csv(os.path.join(cur_dir, "results", fname), delimiter=" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, this is missing , header=None
| cur_dir = os.path.abspath(os.path.dirname(__file__)) | ||
| fpath = os.path.join(cur_dir, "test_data.txt") | ||
| pet = np.genfromtxt(fpath) | ||
| pet = pd.read_csv(fpath, header=None).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like missing delimiter=" "
space delimited
|
Thank you for the PR Still several test failures. |
|
Thanks! I’m working on fixing these test failures and will update soon. |
|
@josef-pkt There are |
This reverts commit 69bab5c.
|
@josef-pkt I’m currently getting three test failures. Could you please review my changes and guide me on how to fix them? I’m stuck at this point and would appreciate your help so I can proceed. |
| cur_dir = os.path.abspath(os.path.dirname(__file__)) | ||
| fpath = os.path.join(cur_dir, "results/leverage_influence_ols_nostars.txt") | ||
| lev = np.genfromtxt(fpath, skip_header=3, skip_footer=1, | ||
| converters={0: lambda s: s}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this needs either 3 or 4 skip_footer (4 if we count empty line)
lev = pd.read_csv(fpath, skiprows=3, skipfooter=3, engine="python", sep=r"\s+",
header=None, names=names)
then the rest here is not needed anymore
i.e. no to.numeric, dtypes are already float
And the np.isnan part was numpy version compat, which is also not needed anymore
(I'm just checking the read_csv part without running the unit tests)
| cls.res1 = res1 | ||
| fname = os.path.join(curdir, "results", "results_kde_weights.csv") | ||
| cls.res_density = np.genfromtxt(open(fname, "rb"), skip_header=1) | ||
| cls.res_density = pd.read_csv(fname, header=None, dtype=float).to_numpy().ravel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my older pandas version and at least one failing test machine, this breaks if header=None.
header=0 works for me
Otherwise, I was not yet able to figure out where the shape mismatch comes from.
(there are too many ravel to read this quickly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for correcting me! One test case is fixed now. I’ll work on the remaining two that are still failing .
|
in general, |
@josef-pkt Should I revert all the changes where I replaced squeeze with ravel? |
|
yes, switch back to |
| dta_path = os.path.join(current_path, "Matlab_results", "test_coint.csv") | ||
| with open(dta_path, "rb") as fd: | ||
| dta = np.genfromtxt(fd) | ||
| dta = pd.read_csv(dta_path, header=None).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this csv file is space delimited
this works for me:
dta = pd.read_csv(dta_path, header=None, delimiter="\s+").values
| def test_density(self): | ||
| npt.assert_almost_equal(self.res1.density, self.res_density, | ||
| npt.assert_almost_equal(self.res1.density, | ||
| np.asarray(self.res_density).ravel(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, In these cases self.res_density should have already the correct type and shape.
Check that the attribute is correctly set in setup_class.
Then, we can avoid having to do asarray and squeeze/ravel each time res_density is used.
|
|
||
| # read data set and drop rows with missing data | ||
| dta = np.genfromtxt("dftest3.data", dt_b, missing=".", usemask=True) | ||
| dta = pd.read_csv("dftest3.data", header=None, na_values=".").values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not work. I guess the code will raise an exception.
This is old sandbox code, I guess, without unit tests.
We could leave this for after this PR.
My guess is that the code up to and including line 337 can be replace by dta.dropna()
But it is likely not worth the effort to rescue this module.
| ("y", float), | ||
| ] | ||
| ) | ||
| dta = np.genfromtxt("dftest3.data", dt_b, missing=".", usemask=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar case of using masked array as in ols_anova_original.py
| # from stata | ||
| # forecast = genfromtxt(open(cur_dir+"/arima111_forecasts.csv"), | ||
| # delimiter=",", skip_header=1, usecols=[1,2,3,4,5]) | ||
| # forecast = pd.read_csv(open(cur_dir+"/arima111_forecasts.csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also likely wrong, but it's commented out code
I think this was added to show how the reference results can be loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the next part to be correct it needs additionally
forecast = forecast.iloc[:, 1:].to_numpy()
|
to sandbox
So, we can ignore any changes there. aside: |
|
@josef-pkt Thank you! I have made the suggested updates and reverted the changes in |
|
do you know how to use interactive rebase to squash some commits together? It looks like it's almost ready to merge. Thanks for all this work. |
|
test run complains about 3 style violations statsmodels/nonparametric/tests/test_lowess.py:146:45: E127 continuation line over-indented for visual indent |
|
I haven’t used interactive rebase for squashing commit before, but I can give it a try. |
|
Make a copy of the branch to experiment with squashing. It's not too difficult but rebase is always a bit dangerous. I will briefly skim the changes again, but I don't expect that there is anything to change once CI is green |
I’ve fixed these issues. Should I go ahead and squash the commits now, or wait until all checks pass first? |
|
wait until the checks pass, just in case there is something left to change. |
|
It's already 3 AM here, and I'm feeling a bit sleepy. I'll squash the commits in the morning. Hope that's okay. |
|
I don't know where that comes from. update The code in fit should be: statsmodels.nonparametric.kde.KDEUnivariate Can you add this as a |
no problem at all. All green except for the bug in |
Done with this fix , now i will squash commit together. |
Hi @josef-pkt |
|
no problem Thanks for the PR and going through this |
|
Hi @josef-pkt just checking in on this PR. Absolutely no hurry, just wanted to make sure it's still on your radar. |
NumPy's guide.
Please let me know if my approach or fix needs any improvements . I’m open to feedback and happy to make changes based on suggestions.
Thankyou !