Never Ending Security

It starts all here

How To Webscraping With Linux

WEBSCRAPING WITH LINUX


Webscraping with Linux

Webscraping means to scrap out some essential information from web. Though there are many Frameworks and libraries like Nokogiri and Scrapy, you’ll be glad to know that the same can be achieved by our Linux distro.

Linux offers some very powerful tools to support web scraping.

Some tools that can be used are:

  1. html-xml-utils
  2. grep
  3. sed
  4. awk
  5. wget
  6. curl , etc..

These tools used in combination gives satisfying results.

Example :

Let us elaborate this by scraping some IP information like city, ISP from the site http://www.ip-adress.com/ip_tracer.
We’ll be making a bash script to ask the user for IP and the script will return the IP information from the website. But you should have deep knowledge bout the tools and their arguements used in bash scripting.

The Code:

#!/bin/bash
clear
 echo -e "\n\n$(tput setaf 2) $(tput bold)\n\n $(tput sgr0)"
echo ".___ ___________ "
 echo "| |_____ \__ ___/___________ ____ ___________ "
 echo "| \____ \ | | \_ __ \__ \ _/ ___\/ __ \_ __ \\"
 echo "| | |_> > | | | | \// __ \\ \__\ ___/| | \/"
 echo "|___| __/ |____| |__| (____ /\___ >___ >__| "
 echo " |__| \/ \/ \/ "
 echo -e "\t\t\tBY:: H0TMAGN3T"
if [ "$1" = "" ]
 then
 echo -e "$(tput blink)Enter VALID IP\n\n\n"
 else
 echo " $(tput setaf 3)"
 echo -e "FETCHING INFO FOR $1, PLEASE WAIT......\n\n\n"
 curl -s --user-agent "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" http://www.ip-adress.com/ip_tracer/$1 2>/dev/null | hxnormalize 2>/dev/null | grep -e '' -e '' 2>/dev/null | awk -F">" 'NR!=1 && NR!=2 && NR!=3{gsub(/<a|href|=|"|\/|\[|isp|img|alt|ip|address|src|/dev/null | paste -s -d ":\n"
echo "$(tput sgr0)"
 fi

Explanation:

curl fetches the webpage with the IP information. The page is the piped to hxnormalize which will arrange the html tags from fetched webpage in proper manner, then the grep will filter out the and tags as the IP information is present in them, next the awk eliminates extra words and then the paste commands pastes the data in order.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s