[python]웹 페이지에서 모든 링크 얻는 법

랭귀지/python

[python]웹 페이지에서 모든 링크 얻는 법

유키공 2017. 12. 15. 15:52

웹 페이지에서 모든 링크 얻는 법(1)

... 정규 표현식의 마법을 활용하면.

import re, urllib
htmlSource = urllib.urlopen("http://sebsauvage.net/index.html").read(200000)
linksList = re.findall('<a href=(.*?)>.*?</a>',htmlSource)
for link in linksList:
print link

웹 페이지에서 모든 링크 얻는 법(2)

HTMLParser 모듈을 사용할 수도 있다.

import HTMLParser, urllib

class linkParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.links = []
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            self.links.append(dict(attrs)['href'])

htmlSource = urllib.urlopen("http://sebsauvage.net/index.html").read(200000)
p = linkParser()
p.feed(htmlSource)
for link in p.links:
    print link

HTML 시작 태그를 만날 때마다, handle_starttag() 메쏘드가 호출된다.
예를 들어 <a href="http://google.com>는 다음handle_starttag(self,'A',[('href','http://google.com')]) 메쏘드가 촉발된다.

파이썬 매뉴얼에서 다른 handle_*() 메쏘드들도 참고하자.

(HTMLParser는 검증되지 않았음에 유의하자: 모양이-나쁜 HTML을 만나면 질식사한다. 이 경우, sgmllib 모듈을 사용하고, 다시 정규 표현식으로 돌아가거나 BeautifulSoup를 사용하자.)

웹 페이지에서 모든 링크 얻는 법(3)

아직도 모자란다면?

Beautiful Soup는 HTML로부터 데이터를 잘 추출하는 파이썬 모듈이다.
Beautiful Soup의 메인 페이지에서 아주 나쁜 HTML 코드를 다루는 능력과 그의 단순함을 보여준다. 느리다는 단점이 있다.

http://www.crummy.com/software/BeautifulSoup/에서 얻을 수 있다

import urllib
import BeautifulSoup

htmlSource = urllib.urlopen("http://sebsauvage.net/index.html").read(200000)
soup = BeautifulSoup.BeautifulSoup(htmlSource)
for item in soup.fetch('a'):
print item['href']

웹 페이지에서 모든 링크 얻는 법(4)

아직도 모자라신다면?
좋다. 여기 또 다른 방법이 있다:

보시라! 해석기도 없고 정규 표현식도 없다.

import urllib

htmlSource = urllib.urlopen("http://sebsauvage.net/index.html").read(200000)
for chunk in htmlSource.lower().split('href=')[1:]:
indexes = [i for i in [chunk.find('"',1),chunk.find('>'),chunk.find(' ')] if i>-1]
print chunk[:min(indexes)]

인정한다. 조악하기 이를 데 없는 방법이다.
그러나 작동한다!

저작자표시 (새창열림)