Posts Tagged ‘web-scraping’

Download Youtube ( Channel ) videos using Python Module – Pytube ,scrapetube.

Posted by Sriram Sanka on April 25, 2023

In this post, I will present the code to download Youtube Individual/Channel Videos step-by-step.

You can download the Videos using pytube module by passing the video URL as an argument. Just Install pytube and run as below.

python -m pip install pytube

This will download the given video URL to your current Directory, what about if you want to store your local copy for your educational purpose when you are offline.? The following will download the all the videos from the given URL to your desired location in the PC.

import requests
import re
from bs4 import BeautifulSoup
from pytube.cli import on_progress
from pytube import YouTube
from pytube import Playlist
from pytube import Channel


def get_youtube(i):
    try:
        yt = YouTube(i,on_progress_callback=on_progress)
        yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first().download('C:\\Users\\Dell\\Videos\\downs\\Sriram_Channel\\')
    except:
        print(f'\nError in downloading:  {yt.title} from -->' + i)
        pass
l = Channel('https://www.youtube.com/channel/UCw43xCtkl26vGIGtzBoaWEw')

for video in l.video_urls:
    get_youtube(video)

You may get regular Expression error in fetching the video information due to changes implemented in the channel URL using symbols. or any other restrictions which are unsupported by the pytube module, either you have to download and install the latest or identify and fix it on your own.

To avoid this, you can use another module called scrapetube

pip install scrapetube

The following code will give you the Video id from the URL provided as input. you need to append ‘https://www.youtube.com/watch?v=’ to mark that as a complete URL.

import scrapetube

videos = scrapetube.get_channel("........t7XvGJ3AGpSon.......")

for video in videos:
    print('https://www.youtube.com/watch?v='+ video['videoId'])

as you can see channel id is the key here, what if you are unaware of channel id.? You can get the channel id either from browser view page source or you can also use “requests” and “BeautifulSoup” module to get the channel ID

import requests
from bs4 import BeautifulSoup

url = input("Please Enter youtube channel URL ")
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
print(soup.find("meta", itemprop="channelId")['content'])

Instead of using Channel from pytube module, to make it easier lets make changes to the original code snippet posted above, using module “scrapetube.get_channel “

import requests
import re
from bs4 import BeautifulSoup
from pytube.cli import on_progress
from pytube import YouTube
from pytube import Playlist
from pytube import Channel
import scrapetube
import json

def get_youtube(i):
    try:
        print('Downloading ' +i)
        yt = YouTube(i,on_progress_callback=on_progress)
        #print(f'\nStarted downloading:  {yt} from -->' + i)
        yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first().download('C:\\Users\\Dell\\Videos\\downs\\Sriram_Channel\\')
    except:
        print('Error Downloading' +i)
        pass
l = scrapetube.get_channel("UCw43xCtkl26vGIGtzBoaWEw")
for video in l:
    #print('Downloading https://www.youtube.com/watch?v='+ video['videoId'])
    get_youtube('https://www.youtube.com/watch?v='+ video['videoId'])

Hope you like it ! Happy Reading.

Posted in Python, pytube, scrapetube, WebScraping, youtube | Tagged: DOWNLOAD, Python, pytube, scrapetube, web-scraping, youtube, youtube channel | Leave a Comment »

Hacking – FATDBA.COM ¯\_(ツ)_/¯ `ॐ`

Posted by Sriram Sanka on October 4, 2022

#Python #WebScraping

Just Kidding !!! Its not Hacking, this is known as WEB-SCRAPING using the Powerful Python.

What is web scraping?

Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database.

You just need a Browser, Simple & a small Python Code to get the content from the Web. First Lets see the Parts of the Code and Verify.

Step 1 : Install & Load the Python Modules

import time
import os
import pandas as pd
import requests
from lxml import etree
import random
import urllib.request, urllib.error, urllib.parse
import urllib.parse
import sys
import urllib.request
import string

Step 2: Define function to get the Name of the Site/Blog to Make it as a Folder.

def get_host(url,delim):
    parsed_url = urllib.parse.urlparse(url)
    return(parsed_url.netloc.replace(delim, "_"))

Step 3: Define a Function to Get the Blog/Page Title

def findTitle(url,delim):
    webpage = urllib.request.urlopen(url).read()
    title = str(webpage).split('<title>')[1].split('</title>')[0]
    return title.replace(delim, "_").translate(str.maketrans('', '', string.punctuation))

Step 4: Define a Function to Generate a Unique string of a given length

def unq_str(len):
    N = len
    res = ''.join(random.choices(string.ascii_uppercase + string.digits, k=N))
    return(str(res))

Step 5: Write the Main Block to Download the Content from the Site/Blog

def Download_blog(path,url,enc):
    try:   
        response = urllib.request.urlopen(url)
        webContent = response.read().decode(enc)
        os.makedirs(path+'\\'+ str(get_host(url,".")), exist_ok=True)
        n=os.path.join(path+'\\'+ str(get_host(url,".")),findTitle(url," ") +'.html')
        f = open(n, 'w',encoding=enc)
        f.write(webContent)
        f.close
    except:
        n1=os.path.join(path+'\\'+  str(get_host(url,"."))+'Download_Error.log')
        f1 = open(n1, 'w',encoding=enc) 
        f1.write(url)
        f1.close

Step 6: Define Another Function to save the Blog posts into a file & Invoke the Main block to get the Blog Content.

def write_post_url_to_file(blog,path):        
    main_sitemap = blog+'/sitemap.xml'
    r = requests.get(main_sitemap)
    root = etree.fromstring(r.content)
    for sitemap in root:
        children = sitemap.getchildren()
        with open(str(path+'\\'+get_host(blog,".")) +'_blog_links.txt', 'a') as f:
            f.write( f'\n{children[0].text}')
    col_name = ['url']
    df_url = pd.read_csv(str(path+'\\'+get_host(blog,".")) +'_blog_links.txt', names=col_name)
    for row in df_url.url:
        print(row)
        Download_blog(path,row,'UTF-8')
        
write_post_url_to_file("https://fatdba.com","c:\\Users\\Dell\\Downloads\\blogs\\")

This will create a file with links and folder with blog name to store all the content/Posts Data.

Sample Output as follows

BOOM !!!

For more interesting posts you can follow me @ Twitter – TheRamOraDBA & linkedin-ramoradba

Posted in download_blogs, Linux, Python, WebScraping, Windows | Tagged: blog, HTML, Python, web-scraping | Leave a Comment »

Not Just Databases

Moderator at

Member at

Subscribe

Categories

Archives

Visits n Views

Blogs I Follow

Follow Blog via Email

Total Views

$riram $anka

Posts Tagged ‘web-scraping’

Download Youtube ( Channel ) videos using Python Module – Pytube ,scrapetube.

Hacking – FATDBA.COM ¯\_(ツ)_/¯ `ॐ`

Not Just Databases

Moderator at

Member at

Subscribe

Categories

Archives

Visits n Views

Blogs I Follow

Follow Blog via Email

Total Views

$riram $anka

Posts Tagged ‘web-scraping’

Download Youtube ( Channel ) videos using Python Module – Pytube ,scrapetube.

Share this

Hacking – FATDBA.COM ¯\_(ツ)_/¯ ॐ

Share this

Hacking – FATDBA.COM ¯\_(ツ)_/¯ `ॐ`