Change encoding of a text file programmatically

Hi

I have a question regarding encoding of text files to be read by a sketchUp plug-in I am working on.
It turns out that the files that I need to read are not UTF-8 encoded and the plug-in fails in reading the files. I could simply open the text files in notepad++ and change the encoding to UTF-8 and the plug-in reads the files just fine (even UTF-8-BOM seems to work fine). So the question is as follows:

Is it possible to change the encoding of a file programmatically from the plug-in given a string with the file name? i.e. filename = ‘C:\Users\Carlos\test\myfilename.txt’

regards, Carlos

:thinking: https://ruby-doc.org/core-2.7.1/Encoding.html ?

Hi dezmo,

thanks for the link, but I am not sure if I can get what I want there. basically I need to change the encoding of the whole file, so when it is opened again with notepad++ the encoding is changed to UTF-8. Looks like the link is more about changing the encoding of a string more than the whole file.

There is a possibility to change the encoding using Windows powershell from the plug-in as follows

filename               = "C:\Users\Carlos\test\myfile may contain spaces.txt"
filenameInSimpleQuotes = "'" + path + "'" 
auxfilename            = "'" + path + "xxx" +"'"    

require 'base64'
cmd = %{Get-Content #{filenameInSimpleQuotes} | Set-Content -Encoding utf8 #{auxfilename}; Move-Item #{auxfilename} -Destination #{filenameInSimpleQuotes} -Force}

encoded_cmd = base64.strict_encode64(cmd.encode('utf-16le'))

find = `powershell.exe -encodedcommand #{encoded_cmd}`

so in the above code cmd is a double command for the windows powerShell (separated by the semicolon). In the first one the content of my file is encoded in utf8 and sent to an auxiliar file. In the second the auxiliar file is move to the original filename overwriting it. For some reason ruby fails in creating encoded_cmd string, do you know why?

Thanks for the help

Do not use backslashes. They denote escaping in strings. The "\t" means a TAB character.
You can see a list of the various escape characters here:
File: literals.rdoc [Ruby 2.7.1] - Strings

Either use:

  • forward slashes:
    filename = "C:/Users/Carlos/test/myfile may contain spaces.txt"

  • single quotes:
    filename = 'C:\Users\Carlos\test\myfile may contain spaces.txt'

The link dezmo gave shows how to open a file withy a specific encoding, and have IO.open transcode it into another encoding. Your Ruby code can then write the file back out again using the new encoding.

file = File.open(filename, "r:UTF-16LE:UTF-8")
transcoded = file.read # string is now UTF-8
file.close

newfile = File.open(somepath,"w:UTF-8")
newfile.write(transcoded)
newfile.close

Hi DanRathbun

Thank you very much for your reply.

I will take into account your comments on the use of backslashes in strings. However I don’t think that is the problem in my plug-in because the string I am using is just the first input argument from the method load_file(path, _status) from a custom Importer class. I just tried to avoid commenting all that in my code for simplicity and create my own string. That work around fails when executing

cmd.encode('utf-16le')

I would prefer to change the encoding directly on ruby though, but all this encoding stuff is a bit tricky for me. The code you posted is certainly changing the encoding of the file as it can be seen when it is open in notepad ++, but still is not the same that directly changing the encoding in notepad++.

In the figure below, top left panel is the original file with the wrong encoding, top right is how it looks when I just change the encoding using notepad++ (this file is read just fine by the plug-in). bottom panel is the resulting file after changing the encoding using your code, as you can see the encoding is UTF-8 but it looks very different and the plug-in doesn’t identify the text inside. Do you know a way to achieve top right result using ruby?

Other thing that would be useful for my plug-in is figuring out what is the encoding of the file so I just do the change it it is not UTF-8 (or UTF-8-BOM which also worked fine)

regards, Carlos

Hi again,

I think I figured out the problem using the work around solution that calls Windows PowerShell to do the encoding conversion. In case this is useful to any one I leave here the code. Let be path a string containing the filename of my text file (i.e. C:/Users/name surname/test/my file.txt)

# preparing powerShell commands									
folderpath  = path.reverse
position    = folderpath.index('\\')
filename    = folderpath[0..position-1].reverse
auxfilename = 'aux' + filename
folderpath  = folderpath[position..-1].reverse
auxpath     = folderpath + auxfilename 
	  
folderpathInQuotes  = "'" + folderpath + "'"
filenameInQuotes    = "'" + filename + "'"
auxfilenameInQuotes = "'" + auxfilename + "'"
	  
cmd1        = %{cd #{folderpathInQuotes}; Get-Content #{filenameInQuotes} | Set-Content -Encoding utf8 #{auxfilenameInQuotes}}
cmd2        = %{cd #{folderpathInQuotes}; Move-Item #{auxfilenameInQuotes} -Destination #{filenameInQuotes} -Force}

# launch first command (creates auxfile with UTF-8 encoding, actually UTF-8-BOM will appear in Notepad++)
find = `powershell.exe  #{cmd1}`

# wait until auxfile is there
until File.exists?(auxpath)
   sleep 0.1
end
	  
# wait until auxfile reaches a constant size
auxpathSize        = File.size(auxpath)
currentauxpathSize = 0
until (auxpathSize == currentauxpathSize)
   sleep 0.1
   # update
   auxpathSize        = currentauxpathSize
   currentauxpathSize = File.size(auxpath)			
end  

# then launch second powerShell (overwrites auxfile in the original file) 		  	  
find = `powershell.exe  #{cmd2}`

regards, Carlos

Your screenshots show the file as UTF-16BE encoded not UTF-16LE.

Hi DanRathbun

thanks a lot, that was it!

file = File.open(filename, "r:UTF-16BE:UTF-8")
transcoded = file.read # string is now UTF-8
file.close

newfile = File.open(somepath,"w:UTF-8")
newfile.write(transcoded)
newfile.close

does the change of encoding I need for the file :slight_smile:
But then It turns out that I need to know the original encoding of the file to write the correct second argument in the first command

file = File.open(filename, "r: ORIGINAL_FILE_ENCODING :UTF-8")

otherwise the plug-in fails when the file is read.

regards, Carlos

Try something like …

      def transcode_to_utf8(oldfilepath, newfilepath)
        # Read 20 bytes of the file and get the encoding:
        encoding = IO.read(oldfilepath, 20).encoding
        # Open the file and transcode it:
        sourcefile = File.open(oldfilepath, "r:#{encoding.name}:UTF-8")
        transcoded = sourcefile.read # string is now UTF-8
      rescue => err
        puts "Error reading(#{encoding.name}): #{oldfilepath}"
        puts err.inspect
      else # no error so far:
        begin
          # Write out the transcoded file:
          newfile = File.open(newfilepath,"w:UTF-8") do |file|
            file.write(transcoded)
          end  # file should be automatically closed
        rescue => err
          puts "Error creating: #{newfilepath}"
          puts err.inspect
        ensure
          newfile.close unless newfile.closed?
        end
      ensure
        sourcefile.close
      end

Well I did not have any luck with that. And doing a search, I found that others before also tried that.

See https://stackoverflow.com/a/54794730 where a responder makes the case that any attempt to “sniff” a text file encoding is unreliable.

Another question (https://stackoverflow.com/a/25871717) had a response about a Ruby gem “rchardet19” with which I also had not very good results. To install this gem from SketchUp’s console, … type:

Gem::install "rchardet19"

After you see all the gem specification spit out, type:

require "rchardet19"

… and lastly:

defined? CharDet

If it is defined the response will be "constant".

You can read the gem’s README to know how it works:

Hi DanRathbun

thank you very much for your replies. Indeed I tried your first suggestion and saw that for all encodings available in notepad++ the answer of

encoding = IO.read(oldfilepath, 20).encoding
encoding.name

was ASCII-8BIT. I wanted to research myself before posting here, but thank you very much for trying to find a solution, I will start with your suggestions. In case I don’t find a better way, the work around solution I posted above seems to work fine for me.

best regards, Carlos