[MRG] DOC, FIX: Support Python 2 and 3 in gen_rst.py #3777

nmayorov · 2014-10-15T22:37:19Z

The encoding (if not specified) used by open in Python is platform dependent. On my Windows machine it is cp1251, so I had troubles building docs with the example gallery because of that (some examples contain non-ASCII characters.)

I think setting it explicitly to utf-8 is a good thing.

agramfort · 2014-10-16T07:16:11Z

on my mac locale.getpreferredencoding(False) is US-ASCII and I can build the doc (unless something changed very recently).

can you point to a file that contains non-ascii characters?

nmayorov · 2014-10-16T13:33:05Z

There are plenty in comment lines. For example: the letter in the name and weird quotes.

The reason of confusion is the version of Python. I tried to build docs in Python 3 and there is indeed a problem with decoding as he tries to access lines of file. In Python 2 it never occurs, because he doesn't try to decode anything at this point. (Is my explanation correct?)

I might suggest to reload open by codecs.open for Python 2 and use open(fname, encoding='utf-8) for both Python 3 and Python 2. Do you approve this approach?

agramfort · 2014-10-16T20:02:03Z

@larsmans our encoding expert what do you think?

larsmans · 2014-10-16T20:34:57Z

Better than codecs.open is io.open, which should behave the same on Py2 and Py3 (it's a backport of Py3's open).

larsmans · 2014-10-16T20:36:22Z

Your patch, as-is, would break Sphinx on Python 2.

nmayorov · 2014-10-16T20:57:29Z

@larsmans I realized that. I will change to io.open.

larsmans · 2014-10-16T21:26:06Z

Cool, ping me when you're done.

nmayorov · 2014-10-17T00:16:10Z

Still struggling to get it working. I came to the conclusion that in Python 2 everything should be str (mixing unicode there seems like a bad idea.) So the initial plan should be abandoned probably. I'll keep looking into that.

nmayorov · 2014-10-17T12:31:26Z

I just used conditions with six.PY2. It builds on both versions of Python.

But there are some problems in Python 2 unrelated to this patch (these errors appear on a clean build from master too.) They look like this (just several of them in total):

Traceback (most recent call last):
  File "c:\scikit-learn-python2.7\doc\sphinxext\gen_rst.py", line 869, in generate_file_rst
    execfile(os.path.basename(src_file), my_globals)
  File "plot_species_distribution_modeling.py", line 207, in <module>
    plot_species_distribution()
  File "plot_species_distribution_modeling.py", line 102, in plot_species_distribution
    data = fetch_species_distributions()
  File "C:\scikit-learn-python2.7\sklearn\datasets\species_distributions.py", line 250, in fetch_species_distributions
    bunch = joblib.load(join(data_home, DATA_ARCHIVE_NAME))
  File "C:\scikit-learn-python2.7\sklearn\externals\joblib\numpy_pickle.py", line 419, in load
    unpickler = ZipNumpyUnpickler(filename, file_handle=file_handle)
  File "C:\scikit-learn-python2.7\sklearn\externals\joblib\numpy_pickle.py", line 308, in __init__
    mmap_mode=None)
  File "C:\scikit-learn-python2.7\sklearn\externals\joblib\numpy_pickle.py", line 266, in __init__
    self.file_handle = self._open_pickle(file_handle)
  File "C:\scikit-learn-python2.7\sklearn\externals\joblib\numpy_pickle.py", line 311, in _open_pickle
    return BytesIO(read_zfile(file_handle))
  File "C:\scikit-learn-python2.7\sklearn\externals\joblib\numpy_pickle.py", line 65, in read_zfile
    length = int(length, 16)
ValueError: invalid literal for int() with base 16: '0x339698f          x'

The problem was caused by Python 2 reusing fetched by Python 3 data files.

So everything is all right, I think it can be merged.

Ping @larsmans

coveralls · 2014-10-17T12:37:20Z

Coverage increased (+0.02%) when pulling 1d19e7b on nmayorov:doc_explicit_utf8 into 8d82d2a on scikit-learn:master.

nmayorov · 2014-10-20T22:33:10Z

Hey, @larsmans I think it can be merged, please take a look.

amueller · 2014-11-06T16:53:33Z

So the problem with io.open is pep 263 right

If a Unicode string with a coding declaration is passed to compile(),
a SyntaxError will be raised

That is slightly annoying.

nmayorov · 2014-11-06T21:59:58Z

I don't think the link is relevant.

The problem is that in PY2 and PY3 default string types are different (str <-> bytes, unicode <-> str). And in Python 3 file's lines are read and decoded (so we have to provide encoding), whereas in Python 2 they are just read as bytes.

The best strategy (as I figured) is to work with default str type in both versions (but they are different types actually), that's why I added conditional opens.

amueller · 2014-11-06T22:28:03Z

Interesting, then you ran into a different error on python 2 than I did. For me the reading worked fine using io.open(fname, encoding='utf-8'), just evaling unicode with a coding declaration gave an error.

nmayorov · 2014-11-07T00:41:23Z

Initially I ran into problems using Python 3. So I modified open(file) to open(file, encoding='utf-8'), but of course it broke Python 2. I tried to use io.open in Python 2, but then unicode and str started conflicting all over the place. So I decided to simply use different open statements in different versions.

amueller · 2014-11-07T15:28:04Z

The current solution is fine with me, but I'm not the expert ;)

ogrisel · 2014-11-21T11:55:42Z

@nmayorov what about never decoding and just using byte strings (str under Python 2 and bytes under Python 3) with open(filename, 'rb')? Would that work?

GaelVaroquaux · 2014-11-21T12:32:28Z

@Titan-C, you want to follow that, for
https://github.com/sphinx-gallery/sphinx-gallery

nmayorov · 2014-11-21T14:51:26Z

@ogrisel:

In theory you could do that, but it's going to cause a lot of conflicts in all sources interacting with gen_rst.py (for starters you'll have to add b prefix to every literal string in code and so on.)

I see it as follows: everything was working well except when encoding gets wrong, so let's fix it and leave the rest intact.

nmayorov · 2015-01-10T18:48:39Z

@ogrisel could you consider merging this?

It's a very small patch, I think it's totally fine. Perhaps it doesn't affect many people, but still a bug. And the project definitely doesn't need another forever hanging pull request (there are already too many imho).

jnothman · 2015-01-11T01:43:41Z

I think it's fine to merge, especially given that it's likely to change once we adopt sphinx-gallery.

lesteve · 2015-03-03T12:04:25Z

I rebased on master, fixed the minor merge conflict and removed a few trailing spaces in this branch.

I regenerated the doc from scratch locally for both python2 and python3 and checked the examples gallery visually and everything seems to work fine AFAICT.

For completeness here is a quick way to reproduce the original problem (only fails inside a python3 environment):

mv examples{,_bak}
mkdir examples
cp examples_bak/{plot_digits_pipe.py,README.txt} examples
cd doc && make clean
LANG=fr_FR LC_CTYPE=fr_FR LC_ALL=fr_FR make html

Output:

~/dev/scikit-learn/doc $ LANG=fr_FR LC_CTYPE=fr_FR LC_ALL=fr_FR make html
# These two lines make the build a bit more lengthy, and the
# the embedding of images more robust
rm -rf _build/html/_images
#rm -rf _build/doctrees/
sphinx-build -b html -d _build/doctrees   . _build/html/stable
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... failed: [Errno 2] No such file or directory: '/home/lesteve/dev/scikit-learn/doc/_build/doctrees/environment.pickle'

Encoding error:
'ascii' codec can't decode byte 0xc3 in position 422: ordinal not in range(128)
The full traceback has been saved in /tmp/sphinx-err-wrv58ba5.log, if you want to report the issue to the developers.
make: *** [html] Error 1
Command exited with non-zero status 2

ogrisel · 2015-03-03T21:17:22Z

Thanks testing @lesteve. I will merge this branch and you fix.

ogrisel · 2015-03-03T21:20:14Z

Done! Thanks again @nmayorov and @lesteve!

Explicit encoding for opened files in gen_rst.py

22b10b8

Support Python 2 and 3

1d19e7b

nmayorov changed the title ~~DOC, FIX: Explicit encoding for opened files in gen_rst.py~~ [MRG] DOC, FIX: Support Python 2 and 3 in gen_rst.py Oct 17, 2014

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

amueller added the Bug label Jan 16, 2015

amueller added this to the 0.16 milestone Jan 16, 2015

Titan-C mentioned this pull request Feb 16, 2015

Unicode support sphinx-gallery/sphinx-gallery#18

Closed

ogrisel closed this Mar 3, 2015

nmayorov deleted the doc_explicit_utf8 branch September 1, 2015 04:46

Uh oh!

[MRG] DOC, FIX: Support Python 2 and 3 in gen_rst.py #3777

[MRG] DOC, FIX: Support Python 2 and 3 in gen_rst.py #3777

Uh oh!

Conversation

nmayorov commented Oct 15, 2014

Uh oh!

agramfort commented Oct 16, 2014

Uh oh!

nmayorov commented Oct 16, 2014

Uh oh!

agramfort commented Oct 16, 2014

Uh oh!

larsmans commented Oct 16, 2014

Uh oh!

larsmans commented Oct 16, 2014

Uh oh!

nmayorov commented Oct 16, 2014

Uh oh!

larsmans commented Oct 16, 2014

Uh oh!

nmayorov commented Oct 17, 2014

Uh oh!

nmayorov commented Oct 17, 2014

Uh oh!

coveralls commented Oct 17, 2014

Uh oh!

nmayorov commented Oct 20, 2014

Uh oh!

amueller commented Nov 6, 2014

Uh oh!

nmayorov commented Nov 6, 2014

Uh oh!

amueller commented Nov 6, 2014

Uh oh!

nmayorov commented Nov 7, 2014

Uh oh!

amueller commented Nov 7, 2014

Uh oh!

ogrisel commented Nov 21, 2014

Uh oh!

GaelVaroquaux commented Nov 21, 2014

Uh oh!

nmayorov commented Nov 21, 2014

Uh oh!

nmayorov commented Jan 10, 2015

Uh oh!

jnothman commented Jan 11, 2015

Uh oh!

lesteve commented Mar 3, 2015

Uh oh!

ogrisel commented Mar 3, 2015

Uh oh!

ogrisel commented Mar 3, 2015

Uh oh!

Uh oh!