1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
| package main
import ( "fmt" "net/http" "os" "regexp" "strconv" "strings" )
func httpGet(url string) (result string, err error) { // https://www.gushiwen.org/default_2.aspx
resp, err := http.Get(url) if err != nil { fmt.Println("http.Get() err:", err) return } defer resp.Body.Close()
buf := make([]byte, 1024*4) for { n, err := resp.Body.Read(buf) if n == 0 || err != nil { break } result += string(buf[:n]) }
return }
// 爬取详细内容 func spiderContent(url string) (title, content string) { result, err := httpGet(url) if err != nil { fmt.Println("httpGet() err:", err) return } // 创建正则表达式对象 匹配标题 re := regexp.MustCompile(`<h1 style="font-size:20px; line-height:22px; height:22px; margin-bottom:10px;">(?s:(.*?))</h1>`) if re == nil { fmt.Println("re is nil") return } // 使用创建的正则表达式对象对返回的result进行匹配FindAllStringSubmatch(需要匹配的字符串,匹配多少次,-1为匹配所有) reTitle := re.FindAllStringSubmatch(result, 1)
for _, v := range reTitle { title = strings.Replace(v[1], "\t", "", -1) title = strings.Replace(title, "\n", "", -1) title = strings.Replace(title, "\r", "", -1) title = strings.Replace(title, " ", "", -1) }
// 匹配内容 contsonID := url[32:44] re = regexp.MustCompile(`<div class="contson" id="contson` + contsonID + `">(?s:(.*?))</div>`) if re == nil { fmt.Println("re is nil") return } reContent := re.FindAllStringSubmatch(result, 1)
for _, v := range reContent { content = strings.Replace(v[1], "\t", "", -1) content = strings.Replace(content, "\n", "", -1) content = strings.Replace(content, "\r", "", -1) content = strings.Replace(content, " ", "", -1) content = strings.Replace(content, "<br/>", "", -1) content = strings.Replace(content, "<p>", "", -1) content = strings.Replace(content, "</p>", "", -1) } return }
func save(index int, titleArr, contentArr []string) { f, err := os.Create(strconv.Itoa(index) + ".txt") if err != nil { fmt.Println("create file errr:", err) return } defer f.Close() n := len(titleArr) for i := 0; i < n; i++ { f.WriteString(titleArr[i] + "\n") f.WriteString(contentArr[i] + "\n") f.WriteString("\n") } return
}
// 爬取每一页的连接 func spiderPage(index int, isExit chan int) { url := "https://www.gushiwen.org/default_" + strconv.Itoa(index) + ".aspx" result, err := httpGet(url) if err != nil { fmt.Println("httpGet() err:", err) return } // 创建正则表达式对象 re := regexp.MustCompile(`<p><a style="font-size:18px; line-height:22px; height:22px;" href="(?s:(.*?))"`) if re == nil { fmt.Println("re is nil") return } reUrls := re.FindAllStringSubmatch(result, -1)
titleArr := make([]string, 0) contentArr := make([]string, 0)
for _, v := range reUrls { title, content := spiderContent(v[1]) titleArr = append(titleArr, title) contentArr = append(contentArr, content) } // 保存到文件 save(index, titleArr, contentArr) isExit <- index }
func doWork(start, end int) { fmt.Printf("正在抓取 %d 到 %d 页面数据\n", start, end)
isExit := make(chan int)
for i := start; i <= end; i++ { go spiderPage(i, isExit) }
for i := start; i <= end; i++ { <-isExit fmt.Printf("已经保存了:%d个\n", i) } }
func main() { var start, end int fmt.Println("请输入起始页:") fmt.Scan(&start) fmt.Println("请输入结束页:") fmt.Scan(&end)
doWork(start, end)
}
|