Using Dir.pwd with Cyrillic user-name etc

dir-encoding

#1

I am having issues with a few Windows users who have non-ASCII user names.
This is affecting several scripts.
I can replicate it by creating temporary folders etc using such characters.
Particularly Cyrillic and Korean characters.

Oddly when I use Sketchup.temp_dir it can return a folder-path that looks at first sight to be wrong:
e.g. …/Антон/… >>> …/3EC2~1/…
However, when used in File.exist?(…) etc it is taken to be a valid path !
So that works despite it’s odd appearance…
I suspect that it’s the old MSDOS version of the file name ?

My main issue is when I try to use Dir processes.
I need to remember the current working directory - using pwd=Dir.pwd
Then use Dir.chdir(…) to change to a new temporary directory, when some file downloading and processing is done.
Then use Dir.chdir(pwd) to set things as they were.
It fails because the pwd reference substitutes characters as it’s made.
…/Антон/… >>> …/?????/…
And it’s always reported as being UTF-8 encoded in every case.

The first chdir is more than likely to work, because of the valid, but oddly referenced, folder-path to the user’s temp folder.
BUT the second chdir to reset things fails - because the pwd string is invalid - the specified folder …/?????/… can never exist, a ? is an invalid character in a file-path !
Clearly I can trap for that error, but then the user’s current working-directory is foobar !

So my question is, what other way could I establish a reference to a current directory for the reset code ?
Dir seems to be a dead-end because the returned string from pwd is already processed with the ?????.
Interestingly if I use Dir.entries… on the folder’s parent folder path it is listed as “?????”, BUT if I use Dir.glob… it gets listed properly using its Cyrillic name Антон !
So it is possible to see it, but not to get its currentness.

Using IO.read on a temporary text file into which I’ve previously written the current folder-path as a string, also returns the string with ?????, so that doesn’t work either !

Using a cmd shell in Windows to cd to its parent directory and dir a make list shows it correctly named, but of course its currentness is not accessible.

All ideas welcomed…


#2

Ruby under Windows and encodings are something that really has only been well tested in the last year. Many of the fixes have been backported, so current versions (2.3, 2.4, & 2.5) should work much better.

You mention Dir.chdir(…). Can you use the block form? Once the block is finished, you’re returned to the previous directory. It’s very common in code…

Otherwise, is the code below what shows your issue (assuming Dir.pwd contains ‘non standard’ characters)?

pwd = Dir.pwd
Dir.chdir <any valid path>
Dir.chdir pwd

Greg


#3

Thanks for the reply

You closing example code is exactly what I do at the moment.

However, to find the current directory I need to use Dir.pwd.
That substitutes ????? for the Cyrillic name.
So it’s then useless to use in Dir.chdir

I’ll have to consider using the temporary change in the ‘block’ version of Dir.chdir…
It’s still using a path in the use’s folder-tree - therefore including the Cyrillic characters… !
But hopefully the substituted string in the TEMP path will still work as a valid path…

I’ll try it… unless there are any other simpler workarounds - but it will involve not inconsiderable recoding to fix it that way !


#4

I’ll work on it later today. The issue is also painful because there path strings that Ruby will process correctly but the console (either SU’s or a Window’s console) will not display correctly.

I’ve got some folders with non standard characters (for testing encoding issues), and if I copy the path from the Windows Explorer GUI, I can paste it into a console statement, and the statement executes correctly, but the characters aren’t displayed correctly…


#5

Reminder that the Console is bugged for v18 initial release …


#6

It’s not just the console ?
If I write some non-ASCII string output to a text file it still substitutes ???
???


#7

Are you using the magic comment at the top of file ?

# encoding: UTF-8

Does this workaround work ?

dir = Dir::getwd.encode( Encoding::find("filesystem"), "UTF-8")

… or …

dir = Dir::getwd.force_encoding("UTF-8")

SketchUp has never set the default internal encoding (which I believe is not correct.)

Read this section of the Encoding class doc …
http://ruby-doc.org/core-2.0.0/Encoding.html#class-Encoding-label-Internal+encoding

Without it set many automatic encoding conversions cannot be done, ie …
http://ruby-doc.org/core-2.0.0/String.html#method-i-encode
… and File.open’s defaults for the mode string …
http://ruby-doc.org/core-2.0.0/IO.html#method-c-new


Are you specifying output encoding with File::open ?

File::open(path,"w:UTF-8:UTF-8") {
  # code
}

#8

default_internal is normally nil.

Windows builds I checked locally, *nix builds I checked on Travis. Didn’t check MacOS, but they probably match *nix. rb file used to check did not have a magic comment.

*nix

__ENCODING__     UTF-8
default_internal 
internal         
default_external UTF-8
locale           UTF-8
filesystem       UTF-8

Windows (standard US)

__ENCODING__     UTF-8
default_internal
internal
default_external IBM437
locale           IBM437
filesystem       Windows-1252


#9

“Out of the box”, perhaps … but the docs do not indicate it should remain so.
They say it should be done when Ruby loads, which means SketchUp should do it when it loads it’s Ruby process.

(If the docs are wrong then they should be changed.)

There are also other places in the docs that mention functionality dependent upon the internal encoding setting.


But, I do not maintain that the version of Ruby that Trimble compiled does not have encoding bugs, nor that what TIG describes is normal. (Encoding has been a recurring issue with the Windows editions.)


#10

I do have the ‘magic comment’ at the top of the RB file.

Note how setting an existing non-ASCII folder path thus:

t = 'C:/Users/TIG/Desktop/Антон'

I can get a short RB script to load and change the working directory to that folder.

Dir.chdir(t)

This successfully changes it to the specified folder - with no forced encoding needed; although,

tt = t.force_encoding(Encoding::find("filesystem")).force_encoding("UTF-8")
Dir.chdir(tt)

also works !

But then when I try to get the current folder’s path using pwd or getwd, e.g. …

dir = Dir.getwd

It is incorrect - because it always gives the path containing ?????.

The attempted forced encoding stumbles…

dir = Dir::getwd.encode( Encoding::find("filesystem"), "UTF-8")

It fails with this error:

Error: #<Encoding::UndefinedConversionError: U+0410 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252>

AND note how the returned string is already encoded as UTF-8 anyway.

My alternative code sort of works without an error:

dir = Dir.getwd.force_encoding(Encoding::find("filesystem")).force_encoding("UTF-8")

But it is pointless because the string returned is not changed, and so it is still NOT correct:

"C:/Users/TIG/Desktop/?????"

So that result cannot be reused to change back to a remembered previous working directory…

Note that:

Dir.entries(t)

Correctly lists the folder’s contents when the known path is passed.
BUT of course I am trying to remember an initially unknown path, which might contain non-ASCII characters, and needs to be got with pwd or similar…
The

Dir.entries(Dir.pwd)

Fails as the folder path pwd makes contains ????? - and therefore it is not found.
BUT of course I can’t get a proper string referring to that folder from pwd or getwd - any alternatives ?

It seems that the Dir.pwd immediately gives a badly encoded string with the ?????, without recourse to re-encoding at all, listing it as UTF-8 from the offset…

Currently my ‘fix’ is to side step the current directory by using:

Dir.chdir(a_temp_folder){
  # process stuff here
}

which should leave the existing working directory unchanged by that process…


#11

Additional observations…
Dir.entries(File.dirname(t))
returns a list of contents with ????? substitited
Dir.entries(File.dirname(t), encoding:"UTF-8")
returns a list of contents with Антон named properly.
So setting the entries’ encoding gives the correct name, BUT sadly:
Dir.pwd has no additional arguments available to encode its result properly !


#12

I think I have a solution which sidesteps the Dir.pwd mess…

puts dir = Dir.pwd # incorrect Антон >>> ?????
Dir.open('.'){ dir = File.expand_path('.') }
puts dir # correct Антон 

This gets the current working directory and returns its path in UTF-8 without needing any forced encoding etc…


#13

In the code I’ve been investigating with, I’ve used three strings for directories, as follows:

dir1 = 'E:/r_misc/Антон'
dir2 = 'E:/r_misc/テスト'
dir3 = 'E:/r_misc/АнтонŠ'

I’m also calling Dir.pwd, and using it in Dir.chdir.

Running the code in SU 2018, with standard Ruby 2.2.4 and also Ruby 2.3.6+, both fail.

Not that it helps this situation, but stand-alone Ruby trunk handles all the paths correctly, with no encoding or force_encoding calls. I suspect that the current version of Ruby 2.5 will also work, but I haven’t updated to the last week’s releases yet…

Greg


#14

Make this text into a RB file and load it…

# encoding: UTF-8
puts
puts t='C:/Users/TIG/Desktop/Антон'
puts Dir.chdir(t)
puts Dir.pwd
dir=''
Dir.open('.'){ dir=File.expand_path('.') }
puts dir
dir2=''
Dir.chdir(File.dirname(t))
Dir.open('.'){ dir2=File.expand_path('.') }
puts dir2
Dir.chdir(dir)
puts Dir.pwd
Dir.open('.'){ dir=File.expand_path('.') }
puts dir
puts

Obviously you need to change the t= path to suit your folder structure…
The simple rule is whenever you need to get the correctly encoded working-directory do not use Dir.pwd
Use Dir.open('.'){ dir=File.expand_path('.') } to preserve the non-ASCII characters correctly…


#15

I believe the reason the doc’s state that doing so when loading is important is that one doesn’t want strings initialized before a command in script changes the encoding.

Regardless of your interpretation of the doc’s, I have been thru many ruby repositories and looked at their Appveyor & Travis scripts, and I can’t recall any that set encoding(s). Hence, I don’t believe ‘SketchUp should do it when it loads it’s Ruby process’ is correct.

I haven’t gone crazy checking, but everything I’ve seen indicates that Trimble either used files from an existing RubyInstaller build, or they used the open source build code to compile Ruby themselves. I can see no changes that are done by Trimble, except possibly a few initialization settings. Some of these are shown in the Ruby tests I added to my SUMT test suite. FYI, my custom Ruby bullds in SU 17 & 18 pass all the tests.

Finally, I may be somewhat sensitive about Ruby, Windows, Trimble/SU, and people assigning blame to one or more of them.

Ruby is open source, and the people contributing to Ruby can only fix what they’re aware of. Since SU has a history of using outdated Ruby versions, not much can be done, unless bugs found by plugin authors are removed and isolated from SU, then tested in current (supported) versions of stand-alone Ruby. At that point, they could be reported to Ruby.

Thru the work of only a handful of people, Ruby started testing it’s mswin build with the 07-June-2017 commit here. MinGW trunk testing is done by others.

I certainly hope that Trimble considers using current Ruby versions (preferably the most recent release) in the future.

Greg


#16

Cool, I’m glad you found it. I was kind of never fully focused on this issue, and I was using absolute_path, but that has issues in older Ruby versions. I kept thinking there was one other method that might work, and you found it…

I’ll try it with АнтонŠ later, as the last character seems to cause issues…

As mentioned, newer Ruby windows versions work much better in terms of encoding, but that isn’t of use when maintaining compatibility with older SU versions is required…

Greg


#17

Thomas logged several (at least 3 if memory serves) Ruby core issues that had to do with Windows and encoding fixes. Last I remember, 1 was done and backported, but not to v2.0.0. The others had either not yet been done, or were not accepted (ie, they disagreed or whatever and the issues were eventually closed.)
Perhaps one of the issues was about SSL ? (Do a search you should find them.)

:+1:

(1) Because other people do not know enough to do it, or need it themselves, is not a good reason.

(2) The Encoding class documentary introduction indicates repeatedly that setting default internal and external encoding should be done at Ruby startup (with the -e parameter.) There is no way to not interpret what they say as otherwise.

(3) It makes no sense to create an internal core control variable used throughout the core classes, give it a getter method (Encoding::default_internal) and setter method (Encoding::default_internal=) and say it should be left uninitialized (as nil.) A smart person interprets the methods as either used quickly after startup where the -e parameter cannot be used, or used temporarily in code where a block needs a certain default encoding set whilst it runs. (It seemed like in my previous tests, it had to be set before the current script ran in order for it to be used. Setting it in the same script did not seem to work.)
We do need to realize that the core Ruby devs are targeting a Ruby process that runs a single script then exits. SketchUp is a shared process, where we need thing setup properly at startup.

(4) If we read the documentation for the Encoding::default_internal getter method, it explains why the internal encoding should be set and not left set as nil.
quote but emphasis by me :

Returns default internal encoding.

Strings will be transcoded to the default internal encoding
in the following places if the default internal encoding is not nil:

Additionally String#encode and String#encode! use the default internal encoding if no encoding is given.

The locale encoding (__ENCODING__), not ::default_internal, is used as the encoding of created strings.

::default_internal is initialized by the source file’s internal_encoding or -E option.

So seeing as how @TIG is having issues with "Files names from Dir (which is on the above list,) and “String#inspect” (which is on the above list,) of those strings in the console: … and that SketchUp coders also repeatedly struggle with the encodings of paths from ENV (which is also on the list,) …

Because of what I’ve shown above … I (and others long involved with the SketchUp API) have been justified in the past in logging formal API issues and bitchin’ about the unset default internal encoding ever since SketchUp stepped up to using Ruby v2.0.0.

We complain repeatedly each year about it not being set to no avail, each time testing reveals issues with pathstring encodings.

It might not solve all issues to set the default internal encoding, but it is clear that some of Ruby’s core and library classes are designed to use the setting. It seems that it should do more good than harm, unless we have goofy workarounds in common extensions that would be broken by correctly setting the internal encoding.
But things cannot be left broken forever.

At some point it’s time to “bite the bullet” and make the breaking change, and go forward correctly.


#18

Interesting that all the repos I’ve looked at don’t set it, including many that are in daily use by lots of people. Probably a lot more than are using SU. So, please let me know if you find any.

You seem to think reading the docs is the best way to form your knowledge of Ruby. I tend to think that writing code and looking at other’s code is also beneficial.

You seem pretty sure of yourself. Have you ever set the internal encoding used in SU?

If so, please enlighten us and let us know what the correct setting is. From my testing, it makes no difference. It’s simply an issue with older Ruby versions, and internal encoding has nothing to do with it.

So, this issue isn’t caused by Trimble’s Ruby configuration.


#19

Once again you are attacking ME. This is against the forum rules !
I have a right to express an opinion without being attacked.

Find a way to express your opinion or disagreement without attacking others.

I am not the only one who has read the docs this way or has complained and made the same arguments.

Other people’s scripts are not evidence. IMO, the majority do not read docs much.
Nothing has been said about the the points 2, 3, or 4, above which express the opinion and intent of the actual core Ruby programmers.


#20

This all started with your attacking Trimble.

As stated prev, have you ever set the internal encoding used in SU? If so, what encoding corrects this issue?