Javascript regex to validate component and group name characters

componentname
javascript
regex

#1

I’m developing a Web Dialog to use as an input form, with different field types - numeric, integer, length, and string for example.

But if I want to limit the valid characters in a string to be used for a component or group name, I can’t find what characters are valid.

I’ve done a general Google search, searched Sketchup Help, and searched this forum, and haven’t yet found an answer.

And having found it, what regex should I use in a javascript function to validate the string to limit input to valid characters only?

PS. And the regex should allow a blank field, null, or zero length string


#2

I’d limit them to just ASCII characters minus those that are forbidden in file-names - because they might get saved externally…
The typically banned ones are those related to file-separators and searches etc…

  • . ” / \ [ ] : ; | =
    also @ is probably best avoided, and a # is also filtered out during export…
    Files starting with . are hidden on MACs, and a ~ is also a special character on MAC file naming,
    Files starting with a space might also cause issues,

There are many snippets in javascript on the www to limit characters etc…
e.g. string.replace( /[<>:"/\|?*]+/g, ‘’ );
Or more fully


#3

Thank you, TIG. That’s just what I wanted.

I found a regex for valid filenames from your reference, and put it in a validity checking function, as below. Since components can be Saved As… with a filename, it’s best if their names only allow valid filename characters, although other characters are allowed in component names in the Component browser.

function isFilename(str) {
//   alert("isFilename called");
  var RE5 = /^[0-9a-zA-Z ... ]+$/ ;
    return RE5.test(str);
}

After I got a few syntax gremlins out of the way in the code that calls this function, this works a treat.


#4

There’s nothing stopping users from naming components with characters such as å, ä and ö, I do this all the time when making models in Swedish in school and at work. I don’t know what character encoding is used but it’s at least quite a bit larger than merely containing the characters used in English.


#5

I agree with eneroth3 and it makes sense for non-English versions particularly.

If I knew how, I’d extend the regex above for different languages. At the moment, I’m working on a Ruby that for the moment has only an English version, and a partially translated French one.

Do Windows and Mac filenames in Swedish allow accented and special characters? For the most part, I’m concerned (selfishly, I admit) with getting an English language version to work at all!

But I would love to learn how to extend it to cope with non-English differences. I know the language handler in SU can translate displayed text in plugins, but don’t know about regexs!


#6

It is called Unicode character encoding.

Prior to Ruby 1.9, code in Ruby was all ANSI, and had a lot of issues with Unicode characters.

But this is no longer true for most regional encodings. (Vietnamese remains an exception.)

At the top of your Ruby code files you should have a “magic comment” to set the file’s internal encoding to UTF-8. See: Encoding class: Script Encoding

On the 1st line of your code (or 2nd line if the first is a shebang [#!], which is only valid for system Ruby scripts,) place a comment like this:

# encoding: UTF-8

You can transcode strings from / to other encodings with new instance methods for class String.
See also more explanation at the top of the Encoding class page.

To transcode a Ruby string in your file from UTF-8 into the user’s locale encoding:

localized_text = "some string".encode("locale")
puts localized_text.encoding

… which on my machines returns “Windows-1252” (aka CP1252, the “CP” stands for “Code Page”.)


So for future Unicode support, I would suggest following TIG’s example of disallowing specific special characters (which is a small set,) rather than repeatedly revising your code to expand an ever increasing allowable set of characters in the huge Unicode character set.


Now,… character transcoding is not the same as language translation. But this does not really matter, because the API’s LanguageHandler class does not do translation. It simply does string replacement, by looking up the string as a key in a replacement hash. It relies upon UTF-8 encoded Ruby code, and UTF-8 encoded .strings files. (For more on how this is done, see the API documents. Personally, I have found it easier to just create Ruby hash files for each language, and load the one needed when the plugin is loaded. The compiled Ruby interpreter can parse .rb files faster than the “langhandler.rb” script can.)


#7

Rather than just be critical, I’ll throw in my 2cent example:

function hasBadChars(str) {
//   alert("hasBadChars() called with arg: '"+str+"'");
  var re = /(\W|\D|\S)+/ ;
    return re.test(str);
}

EDIT: The above expression actaully is incorrect. See correction in post further down.
Should be: /[^\w\d\s]/i

Which will match one or more: non-word characters, or non-digit characters or non-whitespace characters.

You don’t need to worry about the space leading strings as you can always strip them in ruby thus:

rubystr.lstrip!

JavaScript Regular Expression Cheatsheet - Debuggex


#8

Thank you, Dan. I’ll use that approach, and see if I can at least get the code ready for adaptation easily to multi language use.


#9

Dan Rathbun suggested this regex earlier to validate file names

Doesn’t seem to work - allows (for example) file.,>name to pass.

I’ve googled fairly extensively for how to blacklist illegal filename characters using regex, but most posts seem to turn the question saying ‘blacklists are very hard to get to work’ and you are better using a whitelist.

And I can’t get the ones suggested to replace illegal characters to work when I’m trying to exclude a range of characters.

Or maybe just program a function that explicitly loops through the characters and hiccups when it finds a ‘bad’ one.

And of course Windows in particular has a lot of other rules about illegal filenames using perfectly normal characters - like PRN, LPT and so on. Or not having file names with ONLY spaces or full stops (periods).

HMMM. Not worth too much more time and effort. I may stick with [0-9a-zA-Z _.] - alphanumerics plus space , underscore and full stop.


#10

Yes, Sorry I knew that, but got busy today, and could not catch up.

It is a AND scenario, but that expression uses OR.

This regex works better to find (ie “match”) the positions of non-word, and non-numeric and non-whitespace characters.

regex = /[^\w\d\s]/i

In Ruby regular expressions are filter patterns that some string methods use to return the positions of the matches within the string.

We need to use some other function/method or loop to go thru these matches, and process them (in your case replacing the matched characters.)

In Ruby, they have some nifty iterator methods that do this using the regex.
String#scan, String#gsub, String#tr, etc.

So the ticket is to find one that works in Javascript, which replace() (with global search g and i ignorecase flags) looks to be the one. (It is equivalent to Ruby’s gsub() method.):

regex = /[^\w\d\s]/i
new_str = str.replace(regex, '_')

Now lets look back at what TIG wrote:

Which is a stripper RegExp and replaces special characters with nothing (just strips them out.)

I tested the following RegExp as a stripper in Ruby using gsub() (which doesn’t need a g flag because it is the global substitution method.) Javascript has no exact equivalent and the replace method must be used with a RexExp that has a global replace g flag.

rex = /[<>:"\/\\|?*]+/

(The slash and the backslash need escaping in Ruby, but the other characters like the pipe do not. But it doesn’t hurt to escape characters that don’t really need escaping.)