Jump to content
Larry Ullman's Book Forums

Dimitri Vorontzov

Members
  • Posts

    72
  • Joined

  • Last visited

Posts posted by Dimitri Vorontzov

  1. Thanks, HartleySan, and you're absolutely right: now I'm attempting to define what would be a set of requirements that would validate reasonably large and most probable percentage of whatever comes before @ in an email address, and I'm right in the middle of research on that. Obviously, I'm not just sitting around waiting for someone to solve my problems: for the last few days I've been doing mostly research on regex. 

     

    Thanks for the resource on lookarounds, it's indeed very valuable!  

  2. Antonio, point taken, thank you. 

     

    Ditto HartleySan, but to answer your question about what else I want to know, it's this:

     

    I'm well aware I'm caught in the newbie trap of trying to invent the perfect regex to match an email address. But I don't have the goal to match all possible versions of an email address, I just want to figure out the regex that would work with a majority of normal email addresses. The purpose of that "quixotic pursuit" is not to validate emails in any practical application, but rather to master the PHP flavor of regex, using email validation as example, for the lack of a better one – and I'm stuck with that goal, because the chapter on regex in Larry's book is where that book suddenly became challenging for me. So if Larry wants to bail out, it's fine with me, and it's definitely his right, but I can't. 

     

    Does this make any sense to you?   

     

    So back to "what do I still want to know" question:

     

    I'm actually satisfied with the part of the most recent version of that regex that comes after the @. I looked far and wide and for the life of me I can't come up with the kind of valid domain name that wouldn't be matched by that part, so I'm okay with it. 

     

    What comes before @, however, I don't like at all. It sucks. It would validate anything: _______@somewebsite.com, .@somewebsite.com, and so on. As I said, I want it to validate a reasonable majority of normal email addresses. 

     

    So I want to figure out ways to improve it, and that's why I ask questions about it. And I could, of course, be doing that on some other guy's forum, but since it's Larry's book that I'm studying, it's only logical to ask the questions here. 

     

    Help from someone knowledgeable would be appreciated, even though I obviously can't insist.

  3. Sure, Larry, I understand – and after all, it's your forum, isn't it? You can even ban me from it at a click of a button whenever you wish. 

     

    But try to look at it in a different perspective. What if someone is searching Google for "regex to validate email" and finds this already popular thread in your forum? They will be attracted by the discussion, and then they may become curious: "who is this guy, Larry Ullman? Oh, interesting, he wrote quite a few books on PHP! And on other languages! Why don't I check them out!"

     

    And then that person will buy your books, read them, learn from them, and will become a better web developer and programmer, and at the end there will be one more person in the world liking and respecting you for your teaching and writing abilities.

     

    Is this a quixotic pursuit? Most likely, yes. It may even be herculean. But I think it's worth it. 

     

    On the other hand – learning to forge effective regular expressions – is that a quixotic pursuit? I don't think so, I think it's actually a very reasonable and rational pursuit. 

  4. Great.

    Now let's see if we can further improve it.

    The pattern I've arrived to (thanks to everyone's help!) is this:

    ^[\w.-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$

    It will validate email addresses like these:

    somename@somewebsite.com, somename@some-website.com, or somename@somewebsite.co.uk --

    -- however, it will also validate these ugly things:

    somename@some......website.com, somename@--somewebsite..com -- and similar things.

    To fix that problem, we can try to replace the pattern after @ that includes some number of letters, numbers, periods and dashes:

    [A-Za-z0-9.-]+

    -- with the pattern that would validate if the string starts with (at least one letter or number, followed by either one period sign or at least one dash) used zero or more times, followed by at least one letter or number:

     

    ([a-zA-Z0-9]+\.|-+)*[a-zA-Z0-9]+

    Which results in the following regex:

    ^[\w.-]+@([a-zA-Z0-9]+\.|-+)*[a-zA-Z0-9]+\.[A-Za-z]{2,6}$

    Now I'd be curious to know what are other possible vulnerabilities in it that may be fixed. Would appreciate some input from the esteemed members of this forum!

  5. Oh, Larry, when one day you finally meet me in person, as I hope, you will realize that wrath and yours truly are completely incompatible. 

     

    I got the point about filter_var() being the preferred method, thank you very much for reinforcing it, Larry.

     

    Still, what I meant by regular expression that's designed to match an email address "doing the job" is this: validate only email addresses that may include letters, dots, dashes, numbers or underscores before @, letters, numbers, dashes and dots but NO underscores after @, and a certain reasonable number of letters after the last dot. No more, no less. 

     

    Which is another way of saying, "validate about 99.999% of all email addresses used in countries familiar with Latin alphabet". Imperfectly, perhaps, but well enough. 

     

    (I'm actually astounded by the fact that I somehow managed to express my desire to understand how to create that pattern in such confusing manner, that apparently anyone reading this thread found it easier to understand the complex subject matter of regular expressions a lot better than the simple thing that I was asking. But then again, English isn't my first language... )

     

    Also, I hope, no one would deny that this discussion, even though quite thrilling at times, however, thanks to everyone participating in it, added a bit of useful content to this already content-rich forum. 

  6. Thank you, Larry! This is an important clarification. 

     

    I wasn't just idly waiting for your reply, and did a bit of research. Turns out, there are at least two more answers to my initial question that appear to be correct. Instead of this: 

     

     

     

    ^[\w.-]+@[A-Za-z0-9\.\-]+\.[A-Za-z]{2,6}$
     

     

     

    -- or this (without escape characters):

     

     

     

    ^[\w.-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$
     

     

     

    -- the same "perfect" email validation pattern apparently can be expressed a bit shorter, like this:

     

     

     

    ^[\w.-]+@[^\W_]+\.[A-Za-z]{2,6}$
     

     

     

    -- or even like this:

     

     

     

    ^[\w.-]+@(?!_)\w+\.[A-Za-z]{2,6}$
     

     

     

    I'm wondering if you, Larry, or other guys on the Forum, would agree to take a look at these two examples and confirm that they would do the job in PHP context. 

     

    (I would understand if no one replies though: people seem to get annoyed by children when they ask endless questions, and I've never really dropped that childhood habit. It's not an empty curiosity in my case, however: I rather think of it as slightly extended "Review and Pursue" section.)

  7. Not berating at all: I respect everyone on this forum, and personally you, Larry. I just thought my question was clear from the beginning, namely:

     


     

     

     

    I have a question about email matching with regular expressions. 

     

    Chapter 14, Pg. 445, contains the following email matching pattern:

     

    ^[\w.-]+@[\w.-]+\.[A-Za-z]{2,6}$

     

    I may be wrong, but wouldn't it match something like this?

     

    somename@some_website.com

     

     

    What would be the best way to improve it? 

     

    So I was a little taken aback by what I perceived as all the guys ignoring or misinterpreting the actual question, and persistently offering a workaround instead. I should have blamed myself instead and rephrased the question right away, but I truly thought what I was asking about was clear from the start. 

     

    I also honestly thought that you were joking, when you posted the huge chunk of code above. I thought you implied the absurdity of my "quest for perfection" by posting an absurdly long (and not real) regex. As I mentioned before, I'm new to regex. 

     

    One way or another, your follow-up answer is exactly what I was looking for, Larry, and I thank you very much for it. Also you're right, It does lead to the new question, and not one, but two. And here they are:

     

    First Question: 

     

    The initial regex, as offered in the book, is:

     

    ^[\w.-]+@[\w.-]+\.[A-Za-z]{2,6}$

     

    My problem with it was that it would consider some_website.com as valid.

     

    The modified version, which would not validate some_website.com, looks like this:

     

    ^[\w.-]+@[A-Za-z0-9\.\-]+\.[A-Za-z]{2,6}$

     

    This seems to cover pretty much everything, namely:

     

    At least one of upper- and/or lower-case letters, and/or periods, underscores and/or dashes followed by one and only one followed by at least one of upper- and/or lower-case letters, and/or periods, and/or underscores followed by one and only one period followed by anywhere from two to six upper- and/or lower-case letters.

     

    I don't see what else could be in an email address, so the second regex above seems perfect. 

     

    And yet, Larry, since you're saying that the huge regex you posted is the actual email validation code, I'm baffled again. What are the vulnerabilities of the second regex above that would make it not work? 

     

    Second Question:

     

    Is escaping the period sign and the dash inside the character class in PHP version of regex necessary? 

     

    Or, to rephrase that: can it be [A-Za-z0-9.-] instead of [A-Za-z0-9\.\-]  ?

     

     

    With greatest respect and admiration, would be grateful for further discussion. 

  8. Hm... I think I can see how this may work, but something seems missing here. Aren't you supposed to have * instead of + after  [000-\031]?

     

    Seriously though, Larry, I appreciate your humorous (even if slightly extreme) input very much, but... let's think calmly. The chapter on regular expressions in your book is teaching how to use regular expressions, not filter_var() function.

     

    My question is about regular expressions, not filter_var() function. I think it's fair of me to treat this forum with respect as a trustworthy source of information, ask questions about things that interest me, related to the book I'm trying to learn from, and expect straight answer from knowledgeable people. 

     

    I'm sure it doesn't have to be so hard. I'm not asking for a good alternative for a regex from that chapter. I'm asking about a minor problem, related to regex specifically, a solution to which I'm sure is clear to you, experienced coders, but is not clear to me. 

     

    I am NOT asking a question about how to validate emails, in general.

     

    I'm asking this: 

     

    "How to create a regex pattern that would cover things that happen between @ and . in an email address?"

     

    Or, to rephrase it: "what is the proper regex pattern that includes: letters, numbers, period sign, and a dash"? 

     

    You guys know your regex, I'm new to it. Help wold be much appreciated. 

  9. Thanks, HartleySan!

     

    Again, I'm fully with you and Antonio and Edward, as far as using filter_var, and not having to be perfect in most practical situations.

     

    But purely theoretically – this is a forum, dedicated to honing one's programming skills, and we can allow ourselves to play a little, can't we?  

     

    So, my question to you, HartleySan. The solution that you proposed: [A-Za-z0-9] - am I correct in thinking that it would eliminate the following perfectly valid email address as invalid? 

     

    somename@some-website.com

     

    And if yes, as it seems to be, than what would be a better solution? 

  10. You have a point here, Antonio.

     

    Still: perfection, in regex, I guess, is when it does exactly what it means to. If regex is written to check for a valid email address, it's supposed to check for a valid email address, no more, no less. This one doesn't do the job. I want the one that does. 

     

    Yes, a user can submit something that is not an actual email address. But I don't want that user to submit something that is not a valid email address. There's a difference. When things don't do what they are supposed to be doing, next thing that usually happens, the entire civilization crumbles. We mustn't let that happen. 

     

    Curiosity won't let me sleep this night: how can we remove the damn _ from \w?

  11. I have a question, inspired by the "Random Quotes" mini-application, used as an example of writing to a file and reading from a file, in Chapter 11 of this book.

     

    The "read from file" part of this application picks a random quote from an array, and outputs it to a web browser.

     

    I want to figure out the simplest way to improve this script, so that the random quotes that it outputs are not repeated, as long as there are quotes left in the file.

     

    For example, the script has output the array element # 0. I want it to remember that the element # 0 has been output, and the next time to output any other element of the array, but not 0 - and so on, until the script runs out of quotes, in which case it can start over again.

     

    I'm not asking this for any important or practical reason, but I'm just curious, if a simple an elegant solution could be found.

     

    Would appreciate a creative suggestion!

  12. Chapter 11, pg. 301, contains the following text:

     

     

    If you are running PHP 5.1 or greater, you can add the LOCK_EX constant as the third argument to

    file_put_contents():

     

    file_put_contents($file, $data, LOCK_EX);

     

    To use both the LOCK_EX and FILE_APPEND constants, separate them with the binary OR operator (|):

     

    file_put_contents($file, $data, FILE_APPEND | LOCK_EX);

     

    It doesn’t matter in which order you list the two constants.

     

    This is a really interesting solution, and I'm surprised my mind didn't registered it when I was reading that book for the first time.

     

    Could someone please explain to my why the use of | operator is possible in this situation?

     

    Is this a common use of such operator, allowing to place two arguments where only one is expected? Are there any other situations where | could be used in similar manner?

     

    Thank you in advance for your wisdom!

  13. Chapter 11, pg. 293 includes the following text:

     

     

    "When appending data to a file, you normally want each piece of data to be written on its own line, so each submission should conclude with the appropriate line break for the operating system of the computer running PHP. This would be

     

    ■ \n on Unix and Mac OS X

    ■ \r\n on Windows"

     

    I'm pretty sure this is the first time the distinction between \n for Unix/Max and \r\n for Windows is mentioned in this book.

     

    Does this mean that \n wouldn't work on a Windows-based server?

     

    Could someone please elaborate on this?

×
×
  • Create New...