Jump to content
Larry Ullman's Book Forums

Dimitri Vorontzov

Members
  • Content Count

    72
  • Joined

  • Last visited

Community Reputation

0 Neutral

About Dimitri Vorontzov

  • Rank
    Advanced Member
  1. Thanks, HartleySan, and you're absolutely right: now I'm attempting to define what would be a set of requirements that would validate reasonably large and most probable percentage of whatever comes before @ in an email address, and I'm right in the middle of research on that. Obviously, I'm not just sitting around waiting for someone to solve my problems: for the last few days I've been doing mostly research on regex. Thanks for the resource on lookarounds, it's indeed very valuable!
  2. Antonio, point taken, thank you. Ditto HartleySan, but to answer your question about what else I want to know, it's this: I'm well aware I'm caught in the newbie trap of trying to invent the perfect regex to match an email address. But I don't have the goal to match all possible versions of an email address, I just want to figure out the regex that would work with a majority of normal email addresses. The purpose of that "quixotic pursuit" is not to validate emails in any practical application, but rather to master the PHP flavor of regex, using email validation as example, for the lack of a better one – and I'm stuck with that goal, because the chapter on regex in Larry's book is where that book suddenly became challenging for me. So if Larry wants to bail out, it's fine with me, and it's definitely his right, but I can't. Does this make any sense to you? So back to "what do I still want to know" question: I'm actually satisfied with the part of the most recent version of that regex that comes after the @. I looked far and wide and for the life of me I can't come up with the kind of valid domain name that wouldn't be matched by that part, so I'm okay with it. What comes before @, however, I don't like at all. It sucks. It would validate anything: _______@somewebsite.com, .@somewebsite.com, and so on. As I said, I want it to validate a reasonable majority of normal email addresses. So I want to figure out ways to improve it, and that's why I ask questions about it. And I could, of course, be doing that on some other guy's forum, but since it's Larry's book that I'm studying, it's only logical to ask the questions here. Help from someone knowledgeable would be appreciated, even though I obviously can't insist.
  3. Thanks, Jonathon! I wasn't aware of regexlib.com – it's an excellent resource, and I appreciate your posting it. It's Larry who deserves full credit for "quixotic".
  4. Sure, Larry, I understand – and after all, it's your forum, isn't it? You can even ban me from it at a click of a button whenever you wish. But try to look at it in a different perspective. What if someone is searching Google for "regex to validate email" and finds this already popular thread in your forum? They will be attracted by the discussion, and then they may become curious: "who is this guy, Larry Ullman? Oh, interesting, he wrote quite a few books on PHP! And on other languages! Why don't I check them out!" And then that person will buy your books, read them, learn from them, and will become a better web developer and programmer, and at the end there will be one more person in the world liking and respecting you for your teaching and writing abilities. Is this a quixotic pursuit? Most likely, yes. It may even be herculean. But I think it's worth it. On the other hand – learning to forge effective regular expressions – is that a quixotic pursuit? I don't think so, I think it's actually a very reasonable and rational pursuit.
  5. Great. Now let's see if we can further improve it. The pattern I've arrived to (thanks to everyone's help!) is this: ^[\w.-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$ It will validate email addresses like these: somename@somewebsite.com, somename@some-website.com, or somename@somewebsite.co.uk -- -- however, it will also validate these ugly things: somename@some......website.com, somename@--somewebsite..com -- and similar things. To fix that problem, we can try to replace the pattern after @ that includes some number of letters, numbers, periods and dashes: [A-Za-z0-9.-]+ -- with the pattern that would validate if the string starts with (at least one letter or number, followed by either one period sign or at least one dash) used zero or more times, followed by at least one letter or number: ([a-zA-Z0-9]+\.|-+)*[a-zA-Z0-9]+ Which results in the following regex: ^[\w.-]+@([a-zA-Z0-9]+\.|-+)*[a-zA-Z0-9]+\.[A-Za-z]{2,6}$ Now I'd be curious to know what are other possible vulnerabilities in it that may be fixed. Would appreciate some input from the esteemed members of this forum!
  6. Oh, Larry, when one day you finally meet me in person, as I hope, you will realize that wrath and yours truly are completely incompatible. I got the point about filter_var() being the preferred method, thank you very much for reinforcing it, Larry. Still, what I meant by regular expression that's designed to match an email address "doing the job" is this: validate only email addresses that may include letters, dots, dashes, numbers or underscores before @, letters, numbers, dashes and dots but NO underscores after @, and a certain reasonable number of letters after the last dot. No more, no less. Which is another way of saying, "validate about 99.999% of all email addresses used in countries familiar with Latin alphabet". Imperfectly, perhaps, but well enough. (I'm actually astounded by the fact that I somehow managed to express my desire to understand how to create that pattern in such confusing manner, that apparently anyone reading this thread found it easier to understand the complex subject matter of regular expressions a lot better than the simple thing that I was asking. But then again, English isn't my first language... ) Also, I hope, no one would deny that this discussion, even though quite thrilling at times, however, thanks to everyone participating in it, added a bit of useful content to this already content-rich forum.
  7. Thank you, Larry! This is an important clarification. I wasn't just idly waiting for your reply, and did a bit of research. Turns out, there are at least two more answers to my initial question that appear to be correct. Instead of this: ^[\w.-]+@[A-Za-z0-9\.\-]+\.[A-Za-z]{2,6}$ -- or this (without escape characters): ^[\w.-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}$ -- the same "perfect" email validation pattern apparently can be expressed a bit shorter, like this: ^[\w.-]+@[^\W_]+\.[A-Za-z]{2,6}$ -- or even like this: ^[\w.-]+@(?!_)\w+\.[A-Za-z]{2,6}$ I'm wondering if you, Larry, or other guys on the Forum, would agree to take a look at these two examples and confirm that they would do the job in PHP context. (I would understand if no one replies though: people seem to get annoyed by children when they ask endless questions, and I've never really dropped that childhood habit. It's not an empty curiosity in my case, however: I rather think of it as slightly extended "Review and Pursue" section.)
  8. Aha! That's new and interesting, thank you Larry. Also, I added a second question to the previous post, could you please take a look and comment? I'd be grateful for your input. I'm asking my questions because I really want to know, not just to be an ^a(ss|rse)hole$.
  9. Not berating at all: I respect everyone on this forum, and personally you, Larry. I just thought my question was clear from the beginning, namely: So I was a little taken aback by what I perceived as all the guys ignoring or misinterpreting the actual question, and persistently offering a workaround instead. I should have blamed myself instead and rephrased the question right away, but I truly thought what I was asking about was clear from the start. I also honestly thought that you were joking, when you posted the huge chunk of code above. I thought you implied the absurdity of my "quest for perfection" by posting an absurdly long (and not real) regex. As I mentioned before, I'm new to regex. One way or another, your follow-up answer is exactly what I was looking for, Larry, and I thank you very much for it. Also you're right, It does lead to the new question, and not one, but two. And here they are: First Question: The initial regex, as offered in the book, is: ^[\w.-]+@[\w.-]+\.[A-Za-z]{2,6}$ My problem with it was that it would consider some_website.com as valid. The modified version, which would not validate some_website.com, looks like this: ^[\w.-]+@[A-Za-z0-9\.\-]+\.[A-Za-z]{2,6}$ This seems to cover pretty much everything, namely: At least one of upper- and/or lower-case letters, and/or periods, underscores and/or dashes followed by one and only one @ followed by at least one of upper- and/or lower-case letters, and/or periods, and/or underscores followed by one and only one period followed by anywhere from two to six upper- and/or lower-case letters. I don't see what else could be in an email address, so the second regex above seems perfect. And yet, Larry, since you're saying that the huge regex you posted is the actual email validation code, I'm baffled again. What are the vulnerabilities of the second regex above that would make it not work? Second Question: Is escaping the period sign and the dash inside the character class in PHP version of regex necessary? Or, to rephrase that: can it be [A-Za-z0-9.-] instead of [A-Za-z0-9\.\-] ? With greatest respect and admiration, would be grateful for further discussion.
  10. Hm... I think I can see how this may work, but something seems missing here. Aren't you supposed to have * instead of + after [000-\031]? Seriously though, Larry, I appreciate your humorous (even if slightly extreme) input very much, but... let's think calmly. The chapter on regular expressions in your book is teaching how to use regular expressions, not filter_var() function. My question is about regular expressions, not filter_var() function. I think it's fair of me to treat this forum with respect as a trustworthy source of information, ask questions about things that interest me, related to the book I'm trying to learn from, and expect straight answer from knowledgeable people. I'm sure it doesn't have to be so hard. I'm not asking for a good alternative for a regex from that chapter. I'm asking about a minor problem, related to regex specifically, a solution to which I'm sure is clear to you, experienced coders, but is not clear to me. I am NOT asking a question about how to validate emails, in general. I'm asking this: "How to create a regex pattern that would cover things that happen between @ and . in an email address?" Or, to rephrase it: "what is the proper regex pattern that includes: letters, numbers, period sign, and a dash"? You guys know your regex, I'm new to it. Help wold be much appreciated.
  11. Thanks, HartleySan! Again, I'm fully with you and Antonio and Edward, as far as using filter_var, and not having to be perfect in most practical situations. But purely theoretically – this is a forum, dedicated to honing one's programming skills, and we can allow ourselves to play a little, can't we? So, my question to you, HartleySan. The solution that you proposed: [A-Za-z0-9] - am I correct in thinking that it would eliminate the following perfectly valid email address as invalid? somename@some-website.com And if yes, as it seems to be, than what would be a better solution?
  12. You have a point here, Antonio. Still: perfection, in regex, I guess, is when it does exactly what it means to. If regex is written to check for a valid email address, it's supposed to check for a valid email address, no more, no less. This one doesn't do the job. I want the one that does. Yes, a user can submit something that is not an actual email address. But I don't want that user to submit something that is not a valid email address. There's a difference. When things don't do what they are supposed to be doing, next thing that usually happens, the entire civilization crumbles. We mustn't let that happen. Curiosity won't let me sleep this night: how can we remove the damn _ from \w?
  13. From purely practical point of view, Antonio, you're quite right, but I'm in pursuit of perfection with this one. How can we say "any character in \w, except _"?
  14. Thank you, Edward (cool avatar!) This is, of course, perfect solution – but I'm curious about how to improve this particular regex.
×
×
  • Create New...