Sunday 27 September 2020

Avoiding censorship while web scraping : Testing , Buying and Setting up web proxies on AWS EC2 ft. proxy6.net PART 2

 

Now this post discusses about how to set up proxy on your AWS EC2.

First, if you are completely new to EC2 and want to learn how to set up and connect to EC2, I would highly recommend this video. AWS provide a free micro instance for 1 year so go ahead and try it.:



Now I assume you have set up your EC2 and connected with that using putty.

 

Know your current public IP:

Now to know your public IP you can log in to your AWS console < https://aws.amazon.com/console/  > and check that:



1. We are more interested on how outer world sees us. So, let’s check this on our AWS instance. Type this on your AWS:

>>> curl checkip.dyndns.org

This would return something like:

<html><head><title>Current IP Check</title></head><body>Current IP Address: 245.                                                                                        902.34.171</body></html>

 When there is no proxy this two should match.



2. At this moment if you type:  curl www.nseindia.com . you will get “blocked” message in return.

Ref :

1. https://www.shellhacks.com/linux-proxy-server-settings-set-proxy-command-line/

2. http://www.linuxandubuntu.com/home/how-to-use-proxy-on-linux-command-line#:~:text=How%20to%20set%20a%20proxy,no_proxy%3D%22localhost%2C127.0.

 

Now our goal is to change this IP.

 

Setting up proxies:

1. This one is very simple you just need to change your IP , PORT , user and password in below commands and run on EC2

 

export http_proxy=username:password@hostIP:PORT

export https_proxy=username:password@hostIP:PORT

exprot no_proxy=localhost, 127.0.0.1, *.my.lan

example:

export http_proxy=http://DFER5:qyR7t@238.60.64.145:7700/

 


2. read back these variables:

env | grep -i proxy

 

3. see if your proxy is working or not?

Again, run the same curl command and see the return IP, it should be your proxy IP.

>>> curl checkip.dyndns.org

If you find any difficulty read above 2 reference links.

 

4. if you wish to check PROXY without setting up this variable, you can test using :

curl -x http://IP:PORT/ --proxy-user USER:PASSWORD -L checkip.dyndns.org

 

5. Now let us try to access our blocked website www.nseindia.in

Try: curl www.nseindia.com

This time it will not return earlier block message, but this will hang. And may be after long time it will return curl: (52) Empty reply from server

6. This is because we need to deliver proper headers with curl request. If you try some python modules like nsepy ( pip install nsepy ) this will work.

>> python3

Python 3.8.2 (default, Jul 16 2020, 14:00:26)

[GCC 9.3.0] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> import nsepy

>>> from datetime import date

>>> print(nsepy.get_history(symbol='SBIN',

...                     start=date(2015,1,1),

...                     end=date(2015,1,2)))

   

Output :

 

      Symbol Series  Prev Close    Open   High     Low   Last   Close    VWAP   Volume      Turnover  Trades  Deliverable Volume  %Deliverble

Date

2015-01-01   SBIN     EQ      311.85  312.45  315.0  310.70  314.0  314.00  313.67  6138488  1.925489e+14   58688             1877677       0.3059

2015-01-02   SBIN     EQ      314.00  314.35  318.3  314.35  315.6  315.25  316.80  9935094  3.147389e+14   79553             4221685       0.4249

 

Note : it’s better you add your /etc/environment file as well , to avoid setting this variables all the time , if you want to use proxy permanently . You can refer above 2 reference posts for that.

 

If you have any difficulty following this let me know. I have some topic in my mind for next post like using POSTMAN for effective web scrapping.



No comments:

Post a Comment