Jump to content
Larry Ullman's Book Forums

Ajax Cross-Origin Request Question


Recommended Posts

I have an interesting problem.

 

I use Ajax to make cross-origin requests a lot. More specifically, I use a local JS Ajax script and PHP script using cURL to send a request to an external domain, get the HTML for a specific page, and then parse that HTML for whatever information I want. On the whole, I don't think this sort of thing is that unusual (and in fact, Larry briefly talks about how to do this in his JS book).

 

I have never had any problems with this method until yesterday, when I ran into an interesting problem with one site in particular. The site is a popular Japanese-English dictionary. When I try to run a cross-origin search using their dictionary search feature, I always get the same page returned, which is a basic page stating that the server is busy and couldn't process my request.

 

I looked into things a bit more, and if I use the same JS Ajax script and PHP script to grab another random page on the domain not related to the search feature (e.g., the home page), then I can get the HTML fine, so the issue doesn't seem to occur across the entire domain.

 

I also tested things out on IE6 (because there is no COR security in IE6), and I can get the contents of search results pages fine by simply sending an Ajax request to the desired search results page and parsing the returned HTML, which leads me to believe that this is some issue with modern browsers.

 

The only thing I can conclude is that the domain I'm trying to get search results on is somehow detecting that I'm making a cross-origin request, and intentionally blocking it by always returning the server busy page.

 

This is really unusual and the first time I've ever encountered this problem. Has anyone else ever encountered this problem, and if so, do you have any sort of solution?

 

Thank you very much.

Link to comment
Share on other sites

Interesting. I have not personally seen this, I don't think, but your suspicions make sense. Perhaps this site knows its search results are a valuable resource and are blocking it from being accessed remotely. This, of course, would bring into question whether it's legal to use that content, whether it's possible or not.

Link to comment
Share on other sites

The legality of the issue is definitely something I need to look into. I just assumed that if it was possible to get the HTML of pages through Ajax CO requests and cURL, then there is nothing illegal about it.

 

Anyway, this is truly a puzzling issue. I have used CORs for lots of things, but this is the first time I've ever encountered this, and like I said, it's only with the search-results pages on the site, and not the entire domain.

 

Also, due to the crappiness of IE6, I can easily go right through all the site's defenses by making a direct Ajax request from an HTML page and skipping the use of PHP and cURL altogether. As such, by using IE6, I can easily obtain the search results pages as I desire.

 

Recently, I have read a bit about the new way of blocking CORs by changing config files on a server, but as far as I know, even if a site blocks CORs, cURL can still get around that.

 

What I'm really wondering is: Is it possible for a domain/site to detect a COR coming from cURL, and block that as well? Honestly, I didn't even know it was possible.

 

Thanks.

Link to comment
Share on other sites

Just by a pure coincidence, while playing around with this issue, I was able to find a similar example of the problem I am experiencing.

Please execute the following code with cURL enabled:

 

 

<?php

 $ch = curl_init('http://www.rottentomatoes.com/');

 curl_exec($ch);

 curl_close($ch);

?>

 

As you will see, the URL is completely valid and shows a page, but when you try to access it via CORS, it returns an error page of sorts.

Larry, what do you think the cause of this is? An intentional blocking of CORS requests? And more importantly, if the request is blocked, is there any way around it?

Link to comment
Share on other sites

First, just to be clear for everyone out there, it's best to assume that it's illegal to take content from another Web site always. If a site wants/allows you to use its content, it would likely provide a service for that purpose.

 

I don't think these sites are blocking CORS requests; they're blocking non-browser, non-robot requests. I suspect you could work around it by providing a fake user agent. Although this again brings into question the legality.

Link to comment
Share on other sites

You really don't understand why that'd be illegal? Seriously? Just because it's made publicly available does not mean that anyone can redistribute it. On a practical level, understand that when users go to that site to view that content, then that site is getting the traffic (the hits) and possibly the advertising revenue. If you take the same content from their site and use it on your site in any way, you're depriving that site from the deserved numbers, users, advertising, etc. Not to mention driving up their bandwidth for your gain. I could go on about what the original site loses, but hopefully you can see many of the ramifications. As a specific example, my series introducing Yii is very popular and drives a lot of traffic to my site. Anyone can view that content for free, but my hope is that people who come to the site and view that content might sign up for my newsletter or view other blog posts or even buy a book. If you were to reproduce that content on your own site, through whatever means, then you've stolen those opportunities.

 

And if you're not buying the "harm" angle, then I can suggest a more simple explanation of how it's illegal. When you create an original work, you own the copyright to that work (without formally doing anything at all). Using someone else's work as if it were your own without attribution is a violation of copyright law. Even IF you were to use someone's work WITH attribution, it could still be a violation of copyright law, depending upon the use.

 

It's not a question of whether one has to pay to access the content in the first place, it's an issue of who gets credit and who gets the benefits. When you use someone else's content without their permission, the credit and the benefits are being stolen.

Link to comment
Share on other sites

 Share

×
×
  • Create New...