My Experiences with Databases

Oracle,MySQL,SQL SERVER,Python,Azure,AWS,Oracle Cloud,GCP Etc

  • Enter your email address to follow this blog and receive notifications of new posts by email.

  • Total Views

    • 315,905 hits
  • $riram $anka


    The experiences, Test cases, views, and opinions etc expressed in this website are my own and does not reflect the views or opinions of my employer. This site is independent of and does not represent Oracle Corporation in any way. Oracle does not officially sponsor, approve, or endorse this site or its content.Product and company names mentioned in this website may be the trademarks of their respective owners.

Web-Scraping 🐍 – Part 2 – Download scripts from code.activestate.com with Python -Pagination

Posted by Sriram Sanka on October 22, 2022


In my Previous post, we tried to get the blog entries as a file into a directory using web-scraping., Now lets read a web Page Entries and save the links(and content πŸ™‚ ) as files. One can extract the Content from a web Page by reading/validating the tags as needed. In this post we are going to observe the URL Pattern for Reading and downloading the files from code.activestate.com.

code.activestate.com is one of best source to learn Python. It has around 4K+ Scripts available. lets take a look at the source.

Lets Invoke the URL https://code.activestate.com/recipes/langs/python/ in the browser & Jupiter to get the source of the webpage.

We have around 4500+ Scripts from 230 Pages, when you navigate through Pages you can see the URL gets appended with Page id as “/?page=1” at the end.

import requests
from bs4 import BeautifulSoup
import string
url = 'https://code.activestate.com/recipes/langs/python/?page=1'
reqs = requests.get(url)<br>soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup)

If you are not sure how to generate Python Sample Code ,Try with postman as below to get the code Snippet.

You can see the Pattern in the Output.

Take a look at the first link , It reads as https://code.activestate.com/recipes/580811-uno-text-based/?in=lang-python and the Download link reads as https://code.activestate.com/recipes/580811-uno-text-based/download/1/

To Read all the scripts from all the Pages, we can pass the Page number at the end using a simple for loop and we also need to replace /?in=lang-python with /download/1/ in the URL and Append https://code.activestate.com/ as a prefix to the resulted.

for x in range(1, 250, 1):
    try:
        reqs = requests.get("https://code.activestate.com/recipes/langs/python/?page="+str(x))
        soup = BeautifulSoup(reqs.text, 'html.parser')
        for link2 in soup.select(" a[href]"):
            if "lang-python" in link2["href"]:
                src=link2["href"].replace("/recipes","https://code.activestate.com/recipes").replace("/?in=lang-python","/download/1/")
                tit =link2.get_text().replace(string.punctuation, " ").translate(str.maketrans('', '', string.punctuation))
                print(tit.replace(" ","_"),src)
                Download_active_state_files("c:\\Users\\Dell\\Downloads\\blogs\\",src,'UTF-8',tit.replace(" ","_"))
    except:
        pass

here is the complete Code to download all the scripts as .py in the given Directory.

import requests
from bs4 import BeautifulSoup
import string
import os
import urllib.request, urllib.error, urllib.parse
import sys
 

def Download_active_state_files(path,url,enc,title):
    try:                
        response = urllib.request.urlopen(url)
        webContent = response.read().decode(enc)
        os.makedirs(path+'\\'+ 'Code_Active_state', exist_ok=True)
        n=os.path.join(path+'\\'+ 'Code_Active_state',title +'.py')
        f = open(n, 'w',encoding=enc)
        f.write(webContent)
        f.close
    except:
        n1=os.path.join(path+'\\'+  'Code_Active_state_'+'Download_Error.log')
        f1 = open(n1, 'w',encoding=enc) 
        f1.write(url)
        f1.close
for x in range(1, 250, 1):
    try:
        reqs = requests.get("https://code.activestate.com/recipes/langs/python/?page="+str(x))
        soup = BeautifulSoup(reqs.text, 'html.parser')
        for link2 in soup.select(" a[href]"):
            if "lang-python" in link2["href"]:
                src=link2["href"].replace("/recipes","https://code.activestate.com/recipes").replace("/?in=lang-python","/download/1/")
                tit =link2.get_text().replace(string.punctuation, " ").translate(str.maketrans('', '', string.punctuation))
                print(tit.replace(" ","_"),src)
                Download_active_state_files("c:\\Users\\Dell\\Downloads\\blogs\\",src,'UTF-8',tit.replace(" ","_"))
    except:
        pass

You can Compare the files downloaded with the Web Page version.

Hope you like it. πŸ™‚

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: