Possible de crawler sans budget ? Oui avec RCrawler !
Crawler et scraper des données est devenu une pratique incontournable pour les SEO depuis plusieurs années. Des solutions payantes existent comme par exemple Screaming Frog, Oncrawl, Botify ou encore Seolyzer. Pour ceux et celles qui n’ont les budgets pour passer sur de telles plateformes, il existe des solutions qui s’appuient sur des languages de programmation comme python ou encore R.
Dans cet article, je vais vous expliquer comment crawler gratuitement en exploiter le package RCrawler. Nous verrons comment configurer les informations à scraper et comment organiser la donnée de sorte qu’elle soit exploitable par la suite. RCrawler est un package très intéressant car nativement il embarque de nombreuses fonctionnalités comme le stockage des fichiers HTML (vous n’aurez pas à re-crawler si vous avez oublié de récupérer des informations) ou encore le crawl en mode headless browser, particulièrement apprécié pour des sites conçus sur des framework Angular ou React.
RCrawler nous y voilà !
# define you setwd() setwd("/path/") # install to be run once install.packages("Rcrawler") # and loading library(Rcrawler) # what we want to extract CustomLabels <- c("title", "Meta_description", "h1", "h2", "h3", "Hreflang", "canonical_tag", "meta_robots" ) # How to grab it, do not hesitate to add other stuff you want to grab by adding xpath CustomXPaths <- c("///title", "//meta[@name='description']/@content", "///h1", "///h2", "///h3", "//link[@rel='alternate']/@hreflang", "//link[@rel='canonical']/@href", "//meta[@rel='robots']/@content") # create proxy configuration if you need it. In this exemple we do not need it # proxy <- httr::use_proxy("190.90.100.205",41000) # use proxy configuration # Crawler settings : I add many options but there are not all compulsory Rcrawler(Website = "https://www.v6protect.fr", #Obeyrobots=TRUE, #RequestsDelay = 10, #dataUrlfilter ="/path", #crawlUrlfilter="/path/", #MaxDepth = 1, ExtractXpathPat = CustomXPaths, PatternsNames = CustomLabels, #Useragent="Mozilla 3.11", NetworkData = TRUE, #inlinks NetwExtLinks =TRUE, #outlinks statslinks = TRUE, #use_proxy = proxy, #ignoreAllUrlParams = TRUE ) # I combine data crawl <-data.frame(do.call("rbind", DATA)) crawl_complete <- cbind(INDEX,crawl) Idurl = as.numeric(crawl_complete$Id) crawl_complete = cbind(Idurl,crawl_complete) # I count inlinks count_to = NetwEdges[,1:2] %>% distinct() %>% group_by(To) %>% summarise(n = n()) # I rename columns count_to = count_to %>% rename(Idurl = To, Inlinks = n) # I join inlinks data with my crawl data df_final = left_join(count_to, crawl_complete,by="Idurl") # I remove columns that I do not need df_final = select(df_final, -Idurl, -Id, -IN) # I rename columns df_final = df_final %>% rename(Outlinks = OUT, Depth = Level) ## PAGERANK calculation links <- NetwEdges[,1:2] %>% #grabing the first two columns distinct() # loading igraph package library(igraph) # Loading website internal links inside a graph object g <- graph.data.frame(links) # this is the main function, don't ask how it works pr <- page.rank(g, algo = "prpack", vids = V(g), directed = TRUE, damping = 0.85) # I grab results inside a dedicated data frame values <- data.frame(pr$vector) values$names <- rownames(values) # delating row names row.names(values) <- NULL # reordering column values <- values[c(2,1)] # renaming columns names(values)[1] <- "PageID" names(values)[2] <- "pagerank" #replacing id with url values$url <- NetwIndex names(values)[3] <- "Url" # out of 10 values$Pagerank<- round(values$pagerank / max(values$pagerank) * 10) # I join my crawl with Pagerank information crawl = left_join(values,df_final,by="Url") # I clean my dataframe by removing columns crawl = select(crawl, -PageID.x,pagerank,-PageID.y) #HERE WE ARE ! You can export you crawl data in an csv file write.csv(crawl, "my_crawl_data.csv", sep=";") ### BONUS ### # FIND MY LAST CRAWL : HTML ListProjects() LastHTMLDATA <- LoadHTMLFiles("xxxxxxx", type = "vector") # or to simply grab the last one: LastHTMLDATA <- LoadHTMLFiles(ListProjects()[1], type = "vector") for(i in 1:nrow(LastHTMLDATA)) { LastHTMLDATA$title[i] <- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//title") LastHTMLDATA$h1[i] <- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//h1") LastHTMLDATA$h2[i] <- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//h2") LastHTMLDATA$h3[i] <- ContentScraper(HTmlText = LastHTMLDATA$html[i] ,XpathPatterns = "//h3") } ## REACT OR ANGULAR CRAWLING SETTINGS ## # RCrawler handly includes Phantom JS, the classic headless browser. # Download and install phantomjs headless browser install_browser() # start browser process br <-run_browser() Rcrawler(Website = "https://www.example.com/", Browser = br) # don't forget to stop browser afterwards stop_browser(br)
0 commentaires