[웹크로링] Beautiful Soup 활용하기

1. Beautiful Soup
(1) Beautiful Soup 이란
- Beautiful Soup는 웹크롤러에서 가장 중요한 요소 중에 하나이다.
- Beautiful Soup를 이용하여 HTML 코드 전체를 대상으로 우리가 원하는
태그를 찾을때 단 한줄로 이작업을 할 수 있다.
- 웹 관련 작업을 하기 위해서 꼭 알아야 할 라이브러리 이다.
- 웹 크롤러를 만들려면 파이썬의 기초 문법, HTML 언어의 내용까지 알고
있어야 한다.
- Beautiful Soup를 사용하기 위해서는 먼저 설치를 해야 한다.
- jupyter notebook 에서 bs4 파일을 생성한다.
- cmd 또는 jupyter notebook 에서 명령를 실행하여 Beautiful Soup 4
  설치를 한다.

  jupyter notebook : !pip list :설치된 라이브러리 목록 확인하기
  jupyter notebook : !pip install bs4
cmd : pip list
cmd : pip install bs4

!pip list

alabaster 0.7.12

argh 0.26.2

astroid 2.4.0

atomicwrites 1.4.0

attrs 19.3.0

autopep8 1.4.4

Babel 2.8.0

backcall 0.1.0

bcrypt 3.1.7

bleach 3.1.4

certifi 2020.4.5.1

!pip install bs4

Collecting bs4

Downloading bs4-0.0.1.tar.gz (1.1 kB)

Collecting beautifulsoup4

Downloading beautifulsoup4-4.9.1-py3-none-any.whl (115 kB)

Collecting soupsieve>1.2

Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB)

Building wheels for collected packages: bs4

Building wheel for bs4 (setup.py): started

Building wheel for bs4 (setup.py): finished with status 'done'

Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1279 sha256=69ea1a0321262b3949984a7548e9221f1de3bb6cb63bcc5d20eca1d0bf88917d

Stored in directory: c:\users\family\appdata\local\pip\cache\wheels\0a\9e\ba\20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca

Successfully built bs4

Installing collected packages: soupsieve, beautifulsoup4, bs4

Successfully installed beautifulsoup4-4.9.1 bs4-0.0.1 soupsieve-2.0.1

!pip list

Package Version --------------------------------------------

alabaster 0.7.12

argh 0.26.2

astroid 2.4.0

atomicwrites 1.4.0

attrs 19.3.0

autopep8 1.4.4

Babel 2.8.0

backcall 0.1.0

bcrypt 3.1.7

beautifulsoup4 4.9.1

bleach 3.1.4

bs4 0.0.1

(2) Beautiful Soup 사용하기

- Beautiful Soup에 대한 정보의 링크
https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
- Beautiful Soup로 웹의 데이터를 가져온다는 것은 웹의 태그를 가져온다는 의미다.
- 태그를 가져오는 방법은 find(), find_all(), select()함수를 사용한다.

1) find() : 조건을 만족하는 태그를 하나만 가져온다.
- Beautiful Soup 객체에는 find()가 있다.
- HTML 코드 안에서 원하는 태그를 가져올 수 있다.
- 찾고 싶은 태그가 없다면 아무 내용이 나오지 않는다.
- 동일한 태그가 여러 개 있을 경우 첫 번째 태그 1개만 가져온다.

from bs4 import BeautifulSoup

ex1 = '''

<html>

<head>

</head>

<body>

text 1

</body>

</html> '''

soup = BeautifulSoup(ex1 , 'html.parser')

soup.find('title')

soup.find('p')

text 1 

- 속성값을 이용하여 원하는 태그를 추출할 수도 있다.

from bs4 import BeautifulSoup

ex1 = '''

<html>

<head>

</head>

<body>

text 1

text 2

text 3

</body>

</html> '''

soup = BeautifulSoup(ex1 , 'html.parser')

soup.find('p',align='center')

text 1

soup.find('p', align="rigth")

text 2

soup.find('p', align="left")

text 3

2) find_all() : 해당 태그가 여러 개 있을 경우 한꺼번에 모두 가져온다.
- 웹 페이지는 동일한 태그가 아주 많이 있기 때문에 find_all() 함수를 사용한다.
- 결과는 리스트 객체에 담아온다.

from bs4 import BeautifulSoup

ex1 = '''

<html>

<head>

</head>

<body>

text 1

text 2

text 3

</body>

</html> '''

soup = BeautifulSoup(ex1 , 'html.parser')

soup.find_all('p')

[ text 1 ,

text 2 ,

text 3 ]

soup.find_all( ['p','img']) #찾을 여러 태그를 리스트에 넣어 찾아온다

[ text 1 ,

text 2 ,

text 3 ,

<img src="c:\temp\image\솔개.png"/>]

soup.find_all(align='center')#속성으로 여러 태그를 찾는다.

[ text 1 ,

text 2 ,

text 3 ]

3) 문장 가져오기
- 화면에 보여지는 내용(문장이나 이미지등)을 가져올 수 있다.
- 아래 소스는 find()로 첫번째 p태그를 가져와서 p태그가 감싸고 있는 내용을 가져온다.
txt = soup.find('p') # 첫번째 p태그 가져오기
txt.string #p태그의 내용가져오기

'text 1'

- 아래 소스는 find_all() 여러태그의 내용을 가져온다.

txt2 = soup.find_all('p')

for i in txt2 :

print(i.string)

text 1
text 2
text 3

- get_text()를 사용하여 태그 내의 텍스트를 가져올 수 있다.

txt3 = soup.find_all('p')

for i in txt3 :

print(i.get_text())

text 1
text 2
text 3

4) select() 함수를 사용하여 원하는 데이터 추출하기
- css_selector를 이용하여 원하는 태그를 찾는 방법이 있다.

※ css_selector
- css(Cascading Style Sheet)는 HTML 등의 마크업 언어로 작성된 문서가
실제로 웹사이트에 표현되는 방법을 정해주는 언어이다.

- css selector는 HTML 등의 마크업 언어로 작성된 문서에서 특정 요소(태그)
를 찾을 수 있으며, 세가지 방법이 있다.

태그이름으로 찾는 selector : p 태그 찾아서 내용 설정한다.
p {

text-align: center;

color: red;

}

아이디로 찾는 selector : 속성 id='para1'인 태그를 찾아 내용 설정한다.
#para1 {

text-align: center;

color: red;

}

클래스로 찾는 selector : 속성 class= 'center'인 태그를 찾아 내용 설정한다.
.center {

text-align: center;

color: red;

}

- select('태그이름') :태그명으로 추출한다.

ex2 = '''

<html>

<head>

</head>

<body>

<div>
 바나나

3000원

10개

바나나가게

<a href = 'https://www.fruit1.com'> banana</a>

</div>

<div>
 체리

100원

50개

체리가게

<a href = 'https://www.fruit2.com'> cherry </a>

</div>

<div>
 오렌지

500원

20개

오렌지가게

<a href = 'https://www.fruit3.com'> orange </a>

</div>

</body>

</html> '''

soup2 = BeautifulSoup(ex2 , 'html.parser')

soup2.select('p') #p태그들 다 검색한다.

[ 바나나

3000원

10개

바나나가게

<a href="https://www.fruit1.com"> banana</a>

, 체리

100원

50개

체리가게

<a href="https://www.fruit2.com"> cherry </a>

, 오렌지

500원

20개

오렌지가게

<a href="https://www.fruit3.com"> orange </a>

]

- select('.클래스명'): class 속성의 값으로 추출

soup2.select(' .name1')

[ 바나나

3000원

10개

바나나가게

<a href="https://www.fruit1.com"> banana </a>

]

- select('#아이디명') : id 속성의 값으로 추출

soup2.select('#fruits1')

[ 바나나

3000원

10개

바나나가게

<a href="https://www.fruit1.com"> banana</a>

]

- select('상위태그 > 하위태그 > 하위태그') : '>'로 단계적으로 태그를 찾는다.
'>' 태그 사이에는 공백이 반드시 들어가야 한다.

soup2.select(' div > p > span')

[ 3000원 ,

10개 ,

바나나가게 ,

100원 ,

50개 ,

체리가게,

500원 ,

20개 ,

오렌지가게]

soup2.select(' div > p > span')[0]

3000원

soup2.select(' div > p > span')[1]

10개

soup2.select(' div > p > span')[2]

바나나가게

- select('상위태그.클래스이름' > '하위태그.클래스이름')

soup2.select('p.name1 > span.store')

[ 바나나가게 ]

- select('#아이디명 > 태그명.클래스명')

soup2.select('#fruits1 > span.store')

[ 바나나가게 ]

- select('태그명[속성1=값'])

soup2.select('a[href]')

[ banana ,

<a href="https://www.fruit2.com"> cherry </a>,

<a href="https://www.fruit3.com"> orange </a>]

soup2.select('a[href]')[0]

<a href="https://www.fruit1.com"> banana </a>

Dreamer - I'm wendy~♥♥

카테고리

태그목록

공지사항

최근에 올라온 글

최근에 달린 댓글

최근에 받은 트랙백

링크

글 보관함

달력

72025 이전 다음

[웹크로링] Beautiful Soup 활용하기

티스토리툴바