so, I wanted to build a quick tool, that would allow tracking how some articles rank on google.
for that, I would need something that would open a google window and search for the result, and write down the link that I’m interested in. So good so far.
searching on google for “python google scrapper” gives back LOTS of results. one more useless than the other.
First of all, google tos, just like any other big company forbids scrapping its results. HOWEVER. there’s no record that says here, google sued someone because of this.
you might end up with your IP blocked for a while. or with a couple of warnings. or with your account disabled. I can’t guarantee anything further, I’ll continue with this, at my own risk.
For the purpose of this experiment, I will not be using any google account.
So I spent the whole day, trying to put a script together for going through google results for a specific query, getting the urls.
Here’s what I got so far:
When you try running with beautiful soup, you’ll stumble upon agreeing to google tos. which is not ideal.
The workaround for that is to actually use the web driver. My solution:
Implemented a python + selenium script that goes to google, agrees to their tos, searches for a query, then cures the list of the results.. then goes to the next page, and does the same for the results.
So far, I got so that I can crawl the first two pages, and I am getting an error that I’m having way too many requests from the same computer (I can try logging in and searching that way)
At the end of the day I fixed all the errors, I have managed to get the whole google results, with around 30 seconds build time. I bet I could make that faster, but for the time being that’s enough and saving all the results to a local database
What’s next with this? For now, not much. This might be available free to use as an API service at some point. Until then it’s a tool that I’m using as my own.
so, I added some refactorings, and I can get the first 10 results from the first google page and write them to the db in about 3.5 seconds.
For messing around purposes, I have a couple of ideas on what to do with this:
-use it as a google incognito custom results searcher(I can customize this as I wish – however, I do like how google does it)
-use it for storing data. I should also store the result position in google and the timestamp. adding this as a todo.
(adding the timestamp, might help me to find out on which position a specific keyword was in google. )
-make it run in less than 2 seconds (4.7 = +1.7 seconds it’s the result from google + currently my run is about +-3 secounds)
-post results available through API method and public URL
Ideas on what to do with it:
-a google API
-a keyword ranking indexing tracker (that would be linked to my WordPress maybe?)
-a backend replacement for google, so I can track what I search for, and maybe analyze it later.
I do have a couple of ideas, and I really consider posting this available to use. I’m not sure how a raspberry pi would work, but for the time being, my computer will work AMAZING.
Do you have any questions? make sure you hit the comment sections. Thanks!