Răzuire un site web pentru informații guvernamentale cu R

voturi
2

Sunt răzuire un site canadian federal pentru un proiect de cercetare privind petiții on - line. Aceasta este tot site - ul: http://www.oag-bvg.gc.ca/internet/English/pet_lp_e_940.html

Am nevoie pentru a obține aceste informatii pentru fiecare petiție: hyperlink petiției, numărul de petiții, din titlu, problema (e), petitionar (s), data primirii acestora, statutul, rezumat.

De exemplu , în afaceri Aboriginal [ http://www.oag-bvg.gc.ca/internet/English/pet_lpf_e_38167.html ], am început cu următorul cod , dar eu sunt blocat după ce a constatat titlul cu // H1.

 library(rvest)
 library(tm)
 # tm -> making a corpus and saving it
 library(lubridate)

 BASE <- http://www.oag-bvg.gc.ca/internet/English/pet_lp_e_940.html
 url <- paste0(BASE, 'http://www.oag-    bvg.gc.ca/internet/English/pet_lpf_e_38167.html') 
 page <- html(url)
 paras <- html_text(html_nodes(page, xpath='//p'))

 text <- paste(paras, collapse =' ')

 getdata <- function(url){ 
 page <- html(url)
 title <- html_text(html_node(page, xpath='//h1'))

 # The following code is just a copy-paste of a code someone gave me.

 list(title=tit, 
   date=parse_date_time(date, %B %d, %Y), 
   text=paste(text, collapse=' '))
 }


 index <- html(paste0(BASE, index.html))
 links <- html_nodes(index, xpath='//ul/li/a')

 texts <- c() 
 authors <- c()
 dates <- c()
 for (s in slinks){
 page <- paste0(BASE, s)
 cat('.') ## progress
 d <- getdata(page)
 texts <- append(texts, d$text)
 authors <- append(authors, d$author)
 dates <- append(dates, d$date)
 }
Întrebat 19/05/2015 la 01:39
sursa de către utilizator
În alte limbi...                            


1 răspunsuri

voturi
1

library(XML)
library(rvest)
#please use this code only if the website allows you to scrap
#get all HTML links on the home page related to online petition
kk<-getHTMLLinks("http://www.oag-bvg.gc.ca/internet/English/pet_lp_e_940.html") 
#iterate over each title petition with the pattern pet_lpf_e and get all associated petitions under that title
dd<-lapply(grep("pet_lpf_e",kk,value=TRUE),function(x){
  paste0("http://www.oag-bvg.gc.ca",x) %>%
    getHTMLLinks
})
#get all the weblinks
 ee<-do.call(rbind,lapply(dd,function(x)grep("pet_[0-9]{3}_e",x,value=TRUE)))
#iterate over ff and get the details for each petition
ff<-lapply(ee,function(y){
      paste0("http://www.oag-bvg.gc.ca",y) %>%
    html%>%
    html_nodes(c("p","h1"))%>% #h1 is title and p is paragraph
    html_text() %>%
    .[1:7] %>%
    cbind(.,link=paste0("http://www.oag-bvg.gc.ca",y))
})

e.g., 

    > ee`1`

    [1,] "Federal role and action in response to the Obed Mountain Mine coal slurry spill into the Athabasca River watershed"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
    [2,] "Petition: 362 "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
    [3,] "Issue(s): Aboriginal affairs, compliance and enforcement, human/environmental health, toxic substances, water"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
    [4,] "Petitioner(s): Keepers of the Athabasca Watershed Society and Ecojustice"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
    [5,] "Date Received: 24 March 2014"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
    [6,] "Status: Completed"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
    [7,] "Summary: The petition raises concerns about the federal government’s role and actions in response to the October 2013 Obed Mountain Mine coal slurry spill into the Athabasca River watershed. The petition summarizes the events surrounding the spill, and includes information about the toxic substances that may have been contained in the slurry, such as polycyclic aromatic hydrocarbons, arsenic, cadmium, lead, and mercury. According to the petition, about 670 million litres of slurry were released into the environment; the spill had an impact on fish habitat in nearby streams; and the plume may have travelled far downstream and had a potential impact on municipal drinking water. The petitioners ask the government about its approvals and inspections prior to the spill, as well as its response to the spill, including investigations, future monitoring, and habitat remediation. "
         link                                                            
    [1,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [2,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [3,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [4,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [5,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [6,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
    [7,] "http://www.oag-bvg.gc.ca/internet/English/pet_362_e_39682.html"
Publicat 19/05/2015 la 02:42
sursa de către utilizator

Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more