It starts a pool of proxies to send your requests.
Now, you can crawl without thinking about blacklisting!
It is written in Javascript (ES6) with Node.js & AngularJS and it is open source!
How does Scrapoxy work ?¶
When Scrapoxy starts, it creates and manages a pool of proxies.
Your scraper uses Scrapoxy as a normal proxy.
Scrapoxy routes all requests through a pool of proxies.
What Scrapoxy does ?¶
Create your own proxies
Use multiple cloud providers (AWS, DigitalOcean, OVH, Vscale)
Rotate IP addresses
Impersonate known browsers
Exclude blacklisted instances
Monitor the requests
Detect bottleneck
Optimize the scraping
Why Scrapoxy doesn’t support anti-blacklisting ?¶
Anti-blacklisting is a job for the scraper.
When the scraper detects blacklisting, it asks Scrapoxy to remove the proxy from the proxies pool (through a REST API).
What is the best scraper framework to use with Scrapoxy ?¶
You could use the open source Scrapy framework (Python).
Does Scrapoxy have a SaaS mode or a support plan ?¶
Scrapoxy is an open source tool. Source code is highly maintained. You are very welcome to open an issue for features or bugs.
If you are looking for a commercial product in SaaS mode or with a support plan, we recommend you to check the ScrapingHub products (ScrapingHub is the company which maintains the Scrapy framework).
Contribute¶
You can open an issue on this repository for any feedback (bug, question, request, pull request, etc.).
License¶
See the License.
And don’t forget to be POLITE when you write your scrapers!
After this all requests will be proxied using proxies.
Requests with "proxy" set in their meta are not handled by scrapy-proxy-pool. To disable proxying for a request set ; to set proxy explicitly use .
Concurrency
By default, all default Scrapy concurrency options (, , , etc) become per-proxy for proxied requests when RotatingProxyMiddleware is enabled. For example, if you set then spider will be making at most 2 concurrent connections to each proxy, regardless of request url domain.
Customization
keeps track of working and non-working proxies from time to time.
Detection of a non-working proxy is site-specific. By default, uses a simple heuristic: if a response status code is not 200, 301, 302, 404, 500, response body is empty or if there was an exception then proxy is considered dead.
You can override ban detection method by passing a path to a custom BanDectionPolicy in option, e.g.:
The policy must be a class with and methods. These methods can return True (ban detected), False (not a ban) or None (unknown). It can be convenient to subclass and modify default BanDetectionPolicy:
# myproject/policy.py from scrapy_proxy_pool.ban = super(MyPolicy, self).body return ban def exception_is_ban(self, request, exception): # override method completely: don't take exceptions in account return None
Instead of creating a policy you can also implement and methods as spider methods, for example:
class MySpider(scrapy.Spider): # ... def response_is_ban(self, request, response): return b'banned' in response.body def exception_is_ban(self, request, exception): return None
It is important to have these rules correct because action for a failed request and a bad proxy should be different: if it is a proxy to blame it makes sense to retry the request with a different proxy.
Settings
- Whether enable ProxyPoolMiddleware;
- whether to use anonymous proxy, False by default;
- which proxy types to use, only 'http' and 'https' is available. ['http', 'https'] by default;
- which proxy country code to use. 'us' by default;
- proxies refresh interval in seconds, 900 by default;
- stats logging interval in seconds, 30 by default;
- When True, spider is stopped if there are no alive proxies. If False (default), then when there is no alive proxies all dead proxies are re-checked.
- When True, spider will force refresh proxies if there are no alive proxies. If False (default), then when there is no alive proxies send request with host ip
- a number of times to retry downloading a page using a different proxy. After this amount of retries failure is considered a page failure, not a proxy failure. Think of it this way: every improperly detected ban cost you alive proxies. Default: 5.
It is possible to change this option per-request using request.meta key - for example, you can use a higher value for certain pages if you're sure they should work.
- When True, spider will try requests that exceed PROXY_POOL_PAGE_RETRY_TIMES.
I am quite new to scrapy (and my background is not informatics). I have a website that I cant visit with my local ip, since I am banned, I can visit it using a VPN service on browser. To my spider be able to crawl it I set up a pool of proxies that I have found here http://proxylist.hidemyass.com/ . And with that my spider is able to crawl and scrape items but my doubt is if I have to change the proxy pool list everyday?? Sorry if my question is a dumb one...
here my settings.py:
here my middlewares.py:
Another question: if I have a website that is https should I have a proxy pool list for https only? and then another function class HTTPSProxyMiddleware(object) that recives a list HTTPS_PROXIES ?
my rotate_useragent.py:
Another question and last(sorry if is again a stupid one) in settings.py there is a commented default part # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'reviews (+http://www.yourdomain.com)' should I uncomment it and put my personal informations? or just leave it like that? I wanna crawl effeciently but regarding the good policies and good habits to avoid possible ban issues...
I am asking this all because with this things my spiders started to throw errors like
He let go of her throat and while Lynn gasped for air, he rolled her onto her stomach. - Don't. Oh my God. You are welcome.
how to INCREASE number of VIEWS on YouTube using python SCRAPY PROXY rotation
Yura and I think that Petya. - Why Petya. - So only he alone, it turns out, will be able to go for three days with us to the dacha. I asked everyone in advance in the morning. - Why is he to us at the dacha.
Gaze opened up a breathtaking picture of the girl's surprisingly thin, like a tree, body, covered with a smooth tan, on which hairs burned out in the sun stood out like a golden fluff. Tequila had her back to him and seemed completely indifferent to what was happening behind her.
This uncertainty - whether she hears that someone has disturbed her peace and if she hears, then why does not turn around - has led. Nikita into a strange excitement. He could no longer restrain himself, and he already did not care how she would react to it.