ENH: Replace np.genfromtxt with pandas.read_csv for CSV loading #9671

Aniketsy · 2025-10-16T11:25:38Z

closes Maint: replace np.genfromtxt with pandas read_csv #9566
tests added / passed.
code/documentation is well formatted.
properly formatted commit message. See
NumPy's guide.

Please let me know if my approach or fix needs any improvements . I’m open to feedback and happy to make changes based on suggestions.
Thankyou !

josef-pkt · 2025-10-22T16:51:57Z

statsmodels/duration/tests/test_phreg.py

        cur_dir = os.path.dirname(os.path.abspath(__file__))
-        data = np.genfromtxt(os.path.join(cur_dir, "results", fname),
-                             delimiter=" ")
+        df = pd.read_csv(os.path.join(cur_dir, "results", fname), delimiter=" ")


AFAICS, this is missing , header=None

josef-pkt · 2025-10-22T16:58:51Z

statsmodels/stats/tests/test_sandwich.py

    cur_dir = os.path.abspath(os.path.dirname(__file__))
    fpath = os.path.join(cur_dir, "test_data.txt")
-    pet = np.genfromtxt(fpath)
+    pet = pd.read_csv(fpath, header=None).values


it looks like missing delimiter=" "
space delimited

josef-pkt · 2025-10-22T17:01:25Z

Thank you for the PR

Still several test failures.
I checked two cases and those are missing the correct option in pandas read_csv, AFAICS.
Based on reading code and files, not verified.

Aniketsy · 2025-10-24T08:59:14Z

Thanks! I’m working on fixing these test failures and will update soon.

Aniketsy · 2025-10-25T07:42:03Z

@josef-pkt There are isort issues in several other files as well. Could you please confirm if I should run isort on all files and include those changes in this PR, or only fix the files directly related to this PR?

examples/python/ets.py

examples/python/linear_regression_diagnostics_plots.py

examples/python/metaanalysis1.py

examples/python/statespace_chandrasekhar.py

examples/python/statespace_structural_harvey_jaeger.py

examples/python/statespace_varmax.py

This reverts commit 69bab5c.

Aniketsy · 2025-11-17T11:21:51Z

@josef-pkt I’m currently getting three test failures. Could you please review my changes and guide me on how to fix them? I’m stuck at this point and would appreciate your help so I can proceed.

FAILED ../venv-test/lib/python3.12/site-packages/statsmodels/nonparametric/tests/test_kde.py::TestKDEWGauss::test_evaluate - ValueError: Length of values (1) does not match length of index (60)
  FAILED ../venv-test/lib/python3.12/site-packages/statsmodels/nonparametric/tests/test_kde.py::TestKDEWGauss::test_compare - ValueError: Length of values (1) does not match length of index (60)
  FAILED ../venv-test/lib/python3.12/site-packages/statsmodels/regression/tests/test_glsar_gretl.py::TestGLSARGretl::test_all - AssertionError: 
  Arrays are not almost equal to 3 decimals
  
  (shapes (203,), (202,) mismatch)
   ACTUAL: array(['-2.0742', '-16.829', '17.807', '14.214', '-24.269', '5.7992',
         '-20.886', '9.8798', '7.4863', '14.11', '-18.151', '2.6465',
         '-13.329', '2.0981', '-8.0123', '9.6415', '-5.0585', '-10.143',...
   DESIRED: array([ -2.074, -16.829,  17.807,  14.214, -24.269,   5.799, -20.886,
           9.88 ,   7.486,  14.11 , -18.151,   2.646, -13.329,   2.098,
          -8.012,   9.642,  -5.059, -10.143,   2.719, -12.585, -10.268,...

josef-pkt · 2025-11-17T14:43:11Z

statsmodels/regression/tests/test_glsar_gretl.py

        cur_dir = os.path.abspath(os.path.dirname(__file__))
-        fpath = os.path.join(cur_dir, "results/leverage_influence_ols_nostars.txt")
-        lev = np.genfromtxt(fpath, skip_header=3, skip_footer=1,
-                            converters={0: lambda s: s})


It looks like this needs either 3 or 4 skip_footer (4 if we count empty line)

lev = pd.read_csv(fpath, skiprows=3, skipfooter=3, engine="python", sep=r"\s+", header=None, names=names)

then the rest here is not needed anymore
i.e. no to.numeric, dtypes are already float

And the np.isnan part was numpy version compat, which is also not needed anymore

(I'm just checking the read_csv part without running the unit tests)

josef-pkt · 2025-11-17T15:22:41Z

statsmodels/nonparametric/tests/test_kde.py

        cls.res1 = res1
        fname = os.path.join(curdir, "results", "results_kde_weights.csv")
-        cls.res_density = np.genfromtxt(open(fname, "rb"), skip_header=1)
+        cls.res_density = pd.read_csv(fname, header=None, dtype=float).to_numpy().ravel()


in my older pandas version and at least one failing test machine, this breaks if header=None.

header=0 works for me

Otherwise, I was not yet able to figure out where the shape mismatch comes from.
(there are too many ravel to read this quickly)

Thanks for correcting me! One test case is fixed now. I’ll work on the remaining two that are still failing .

josef-pkt · 2025-11-17T16:29:19Z

in general, squeeze should not be replaced by ravel.
squeeze only removes extra dimensions, while ravel converts a 2-d array to 1-d even if neither axis has shape=1.

Aniketsy · 2025-11-17T16:36:53Z

in general, squeeze should not be replaced by ravel. squeeze only removes extra dimensions, while ravel converts a 2-d array to 1-d even if neither axis has shape=1.

@josef-pkt Should I revert all the changes where I replaced squeeze with ravel?

josef-pkt · 2025-11-17T17:02:31Z

yes, switch back to squeeze.
At least in the kde tests. I did not look at other changes.

josef-pkt · 2025-11-17T17:12:31Z

statsmodels/tsa/vector_ar/tests/test_coint.py

 dta_path = os.path.join(current_path, "Matlab_results", "test_coint.csv")
-with open(dta_path, "rb") as fd:
-    dta = np.genfromtxt(fd)
+dta = pd.read_csv(dta_path, header=None).values


this csv file is space delimited
this works for me:
dta = pd.read_csv(dta_path, header=None, delimiter="\s+").values

josef-pkt · 2025-11-17T17:33:01Z

statsmodels/nonparametric/tests/test_kde.py

    def test_density(self):
-        npt.assert_almost_equal(self.res1.density, self.res_density,
+        npt.assert_almost_equal(self.res1.density,
+                                np.asarray(self.res_density).ravel(),


AFAICS, In these cases self.res_density should have already the correct type and shape.
Check that the attribute is correctly set in setup_class.
Then, we can avoid having to do asarray and squeeze/ravel each time res_density is used.

josef-pkt · 2025-11-17T17:43:43Z

statsmodels/sandbox/regression/ols_anova_original.py


 # read data set and drop rows with missing data
-dta = np.genfromtxt("dftest3.data", dt_b, missing=".", usemask=True)
+dta = pd.read_csv("dftest3.data", header=None, na_values=".").values


this will not work. I guess the code will raise an exception.
This is old sandbox code, I guess, without unit tests.

We could leave this for after this PR.
My guess is that the code up to and including line 337 can be replace by dta.dropna()

But it is likely not worth the effort to rescue this module.

josef-pkt · 2025-11-17T17:45:07Z

statsmodels/sandbox/regression/try_ols_anova.py

            ("y", float),
        ]
    )
-    dta = np.genfromtxt("dftest3.data", dt_b, missing=".", usemask=True)


similar case of using masked array as in ols_anova_original.py

josef-pkt · 2025-11-17T17:51:00Z

statsmodels/tsa/tests/results/results_arima.py

            # from stata
-            # forecast = genfromtxt(open(cur_dir+"/arima111_forecasts.csv"),
-            #                delimiter=",", skip_header=1, usecols=[1,2,3,4,5])
+            # forecast = pd.read_csv(open(cur_dir+"/arima111_forecasts.csv"))


this is also likely wrong, but it's commented out code

I think this was added to show how the reference results can be loaded.

for the next part to be correct it needs additionally
forecast = forecast.iloc[:, 1:].to_numpy()

josef-pkt · 2025-11-17T19:12:02Z

to sandbox

try_ols_anova.py and ols_anova_original.py are currently already broken in the data handling.
missing data file and missing method in numpy.

So, we can ignore any changes there.

aside:
import of ols_anova_original fails with missing datafile
import of statsmodels.sandbox.regression.try_ols_anova still works and has helper functions that predate formulas and pandas categoricals, i.e. support for dummy variables using only numpy.
one function of it is used in anova_nistcertified.
(I can run the module anova_nistcertified.py after replacing local imports by global, absolute imports)

Aniketsy · 2025-11-17T19:49:53Z

@josef-pkt Thank you! I have made the suggested updates and reverted the changes in try_ols_anova.py and ols_anova_original.py Please let me know if any further improvements are needed.

josef-pkt · 2025-11-17T20:22:36Z

commit 0dcf9f1 is the main commit for the import cleanup
more import cleanup is in commit 4604ded together with genfromtxt changes

josef-pkt · 2025-11-17T20:27:50Z

do you know how to use interactive rebase to squash some commits together?

It looks like it's almost ready to merge.
Before merge I would like to squash it into something like 3 commits, keeping the main import cleanup in a separate commit

Thanks for all this work.

josef-pkt · 2025-11-17T20:31:19Z

test run complains about 3 style violations

statsmodels/nonparametric/tests/test_lowess.py:146:45: E127 continuation line over-indented for visual indent
statsmodels/regression/tests/test_glsar_gretl.py:319:1: W293 blank line contains whitespace
statsmodels/tsa/tests/results/results_arma.py:22:1: E302 expected 2 blank lines, found 1

Aniketsy · 2025-11-17T20:31:41Z

I haven’t used interactive rebase for squashing commit before, but I can give it a try.

josef-pkt · 2025-11-17T20:39:49Z

Make a copy of the branch to experiment with squashing. It's not too difficult but rebase is always a bit dangerous.
It's useful to learn how to do the interactive rebase. However, I can also do it myself, if you prefer.

I will briefly skim the changes again, but I don't expect that there is anything to change once CI is green

Aniketsy · 2025-11-17T20:48:08Z

test run complains about 3 style violations

statsmodels/nonparametric/tests/test_lowess.py:146:45: E127 continuation line over-indented for visual indent statsmodels/regression/tests/test_glsar_gretl.py:319:1: W293 blank line contains whitespace statsmodels/tsa/tests/results/results_arma.py:22:1: E302 expected 2 blank lines, found 1

I’ve fixed these issues. Should I go ahead and squash the commits now, or wait until all checks pass first?

josef-pkt · 2025-11-17T20:53:08Z

wait until the checks pass, just in case there is something left to change.

Aniketsy · 2025-11-17T21:30:16Z

It's already 3 AM here, and I'm feeling a bit sleepy. I'll squash the commits in the morning. Hope that's okay.

josef-pkt · 2025-11-17T21:34:32Z

pre testing, with development versions of some dependencies fail with

        if weights is not None:
>           self.kernel.weights /= weights.sum()
E           ValueError: output array is read-only

I don't know where that comes from.
weights array or series is read-only.
Maybe we need to make a copy of weights, given that we make changes to it.
I have not looked at the details yet. This could also be a bug in the actual KDE code that just shows up now.

update
Yes, it's a bug in KDEUnivariate.fit
The code changes the user provided weights in-place.
This needs to use a copy of the array and not /=

The code in fit should be:

        if weights is not None:
            self.kernel.weights = weights / weights.sum()

statsmodels.nonparametric.kde.KDEUnivariate

Can you add this as a BUG: kde fit, avoid inplace modification of weights commit, not to be squashed with your genfromtxt commits.

josef-pkt · 2025-11-17T21:59:06Z

It's already 3 AM here

no problem at all.
Instant responses are now exceptional events for statsmodels.
:)

All green except for the bug in pre testing.

Aniketsy · 2025-11-18T07:00:28Z

The code in fit should be:
        if weights is not None:
            self.kernel.weights = weights / weights.sum()
statsmodels.nonparametric.kde.KDEUnivariate

Can you add this as a BUG: kde fit, avoid inplace modification of weights commit, not to be squashed with your genfromtxt commits.

Done with this fix , now i will squash commit together.

Aniketsy · 2025-11-18T09:53:40Z

C:\Users\Aniket.DESKTOP-074O80J\statsmodel\statsmodels>git rebase --continue
[detached HEAD 09542e308] MAINT/TST: consolidate read_csv fixes and small cleanups
 Date: Tue Nov 18 12:26:51 2025 +0530
 12 files changed, 41 insertions(+), 35 deletions(-)
[detached HEAD 61adf2805] MAINT/TST: consolidate read_csv fixes and small cleanups
 Date: Tue Nov 18 12:26:51 2025 +0530
 12 files changed, 40 insertions(+), 35 deletions(-)
Successfully rebased and updated refs/heads/replace-9566.

C:\Users\Aniket.DESKTOP-074O80J\statsmodel\statsmodels>git log --oneline --decorate -n 10
61adf2805 (HEAD -> replace-9566) MAINT/TST: consolidate read_csv fixes and small cleanups
d72a21bf0 Fix pandas read_csv options: add header=None and delimiter
b1948e107 MAINT: main import cleanup
523ab7d44 Remove venv from version control and update .gitignore
a3de41a83 (upstream/main, upstream/HEAD, support/9384, main) Merge pull request #9668 from statsmodels/dependabot/github_actions/github/codeql-action-4
aa3559921 Merge pull request #9669 from statsmodels/dependabot/github_actions/pypa/cibuildwheel-3.2.1
f32b179b9 Bump pypa/cibuildwheel from 3.2.0 to 3.2.1
04f9fbc9f Bump github/codeql-action from 3 to 4
b9a1323a4 Merge pull request #9656 from bashtage/py-314-gh-actions
afccae6d2 CI: Add 3.14 in GH actions

Hi @josef-pkt
I tried squashing the commits, but I think I messed up the history during the interactive rebase. I haven’t pushed anything, but my local branch isn’t in the correct state anymore.
If it’s okay with you, I’d like to hand this over to you and would really appreciate it if you could take it from here.
Sorry for the inconvenience, and thank you for your guidance.

josef-pkt · 2025-11-18T12:01:20Z

no problem
I will do it in a few hours.

Thanks for the PR and going through this

Aniketsy · 2025-11-26T15:55:54Z

Hi @josef-pkt just checking in on this PR. Absolutely no hurry, just wanted to make sure it's still on your radar.
Please let me know if I can help with any updates.

Remove venv from version control and update .gitignore

523ab7d

josef-pkt reviewed Oct 22, 2025

View reviewed changes

Fix pandas read_csv options: add header=None and delimiter

a1af191

Fix pandas read_csv options: add header=None and delimiter

0dcf9f1

github-advanced-security bot found potential problems Oct 25, 2025

View reviewed changes

Aniketsy added 6 commits October 25, 2025 17:04

Fix pandas read_csv options: add header=None and delimiter

4604ded

Fix pandas read_csv options: add header=None and delimiter

6b4a27e

Fix pandas read_csv options: add header=None and delimiter

d8abefe

Fix pandas read_csv options: add header=None and delimiter

064c15e

Fix pandas read_csv options: add header=None and delimiter

69bab5c

Revert "Fix pandas read_csv options: add header=None and delimiter"

d950511

This reverts commit 69bab5c.

josef-pkt reviewed Nov 17, 2025

View reviewed changes

Fix pandas read_csv options: add header=None and delimiter

b14c610

Fix pandas read_csv options: add header=None and delimiter

6f21465

josef-pkt reviewed Nov 17, 2025

View reviewed changes

Fix pandas read_csv options: add header=None and delimiter

b5b8f4d

Fix pandas read_csv options: add header=None and delimiter

e9c7b61

Fix pandas read_csv options: add header=None and delimiter

14ec740

Fix pandas read_csv options: add header=None and delimiter

3c42876

BUG: kde fit, avoid inplace modification of weights

ed2cfc5

ENH: Replace np.genfromtxt with pandas.read_csv for CSV loading #9671

Are you sure you want to change the base?

ENH: Replace np.genfromtxt with pandas.read_csv for CSV loading #9671

Conversation

Aniketsy commented Oct 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josef-pkt commented Oct 22, 2025

Uh oh!

Aniketsy commented Oct 24, 2025

Uh oh!

Aniketsy commented Oct 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Aniketsy commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Aniketsy commented Nov 17, 2025

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Aniketsy commented Nov 17, 2025

Uh oh!

josef-pkt commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Aniketsy commented Nov 17, 2025

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Aniketsy commented Nov 17, 2025

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Aniketsy commented Nov 17, 2025

Uh oh!

josef-pkt commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josef-pkt commented Nov 17, 2025

Uh oh!

Aniketsy commented Nov 18, 2025

Uh oh!

Aniketsy commented Nov 18, 2025

Uh oh!

josef-pkt commented Nov 17, 2025 •

edited

Loading

josef-pkt commented Nov 17, 2025 •

edited

Loading