Results 1 to 9 of 9
-
15th Jun 2010, 12:53 PM #1OPRespected Developer
[F#][SNIPPET] Basic web crawler example (multi-core)
Code:open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request |> fun result -> if result.HasError then result.Data else Http.HandleRedirects(result, false).Data let getUrls html sourceUrl = let baseUrl = new Uri(sourceUrl) |> fun u -> u.Scheme + "://" + u.Host new HtmlDocument() |> fun doc -> doc.LoadHtml(html); doc |> fun doc -> doc.DocumentNode.SelectNodes("//a") |> Seq.map (fun node -> node.GetAttributeValue("href", "")) |> Seq.map (fun url -> HttpUtility.HtmlDecode(url).Trim()) |> Seq.map (fun url -> if url.StartsWith("http://") then url elif url.StartsWith("https://") then url elif url.StartsWith("/") then baseUrl + url elif url.StartsWith("#") then "" else baseUrl + "/" + url) |> Seq.filter (fun url -> url.Length > 0) let rec crawl url crawled = Async.Start(async { let data = getData url let urls = getUrls data url |> Seq.filter (fun u -> not(List.exists (fun itm -> itm = u) (crawled))) do printfn "Crawling: %s\nFound: %i URL's" url (Seq.length urls) for u in urls do crawl u (crawled @ [u]) }) (* ================================================ *) (* START CRAWLING *) (* ================================================ *) let url = "http://thepiratebay.org/" let rec memCleaner() = (* Clean memory every 10 seconds *) System.Threading.Thread.Sleep(10000) GC.Collect() memCleaner() ServicePointManager.DefaultConnectionLimit <- 10 Http.MaxRedirects <- 2 Http.Timeout <- 10000 Http.KeepAlive <- true Http.UseCompression <- true Console.BufferWidth <- 256 Console.BufferHeight <- 768 Console.Title <- "F# Web Crawler" (* Start the crawler and mem cleaner *) Async.Start(async{memCleaner()}) crawl url [url] stdin.Read() |> ignore
Video showing a slightly modified version:
Ignore the heavy hiphop tune and other nonsense. I was a bit drunk when I was recording it, lol .Hyperz Reviewed by Hyperz on . [F#][SNIPPET] Basic web crawler example (multi-core) open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request Rating: 5
-
15th Jun 2010, 01:01 PM #2
-
15th Jun 2010, 01:03 PM #3OPRespected Developer
-
16th Jun 2010, 11:23 PM #4MemberWebsite's:
litewarez.net litewarez.com triniwarez.com:/ makes me wonder how fast the googlebot system is..
Great tut Hyperz, any chance a C# version for comparison of speeds ??Join Litewarez.net today and become apart of the community.
Unique | Clean | Advanced (All with you in mind)
Downloads | Webmasters
Notifications,Forum,Chat,Community all at Litewarez Webmasters
-
16th Jun 2010, 11:28 PM #5ლ(ಠ益ಠლ)Website's:
extremecoderz.comlool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! i'll post some speed-tests using my little engine.
EDIT: chomps ur CPU a little - why so? What part needs so much calculation?
-
17th Jun 2010, 12:11 AM #6OPRespected Developer
A C# version would have at least 10 times more code to do the same thing.
This doesn't show my method . But sure, bring it on. It's time to kick ass and chew bubblegum !
It has a queue of 500.000 urls. Each time it crawls a page it extracts 50 urls on average. Each of those 50 urls has to be checked against the entire queue to filter out dupes. And this gets done more than 5 times in a second at moments.
500.000 * 50 * 5 = 125.000.000 string matches per second. Yes, that will indeed keep a quad core busy. I'm surprised it only eats 50%. Just goes to show how good it scales . Keep in mind that filtering dupes is not the only thing that happens a few times a second here.
-
17th Jun 2010, 09:24 AM #7MemberWebsite's:
litewarez.net litewarez.com triniwarez.comI dont think it matters about the amount of code, im looking for a comparison is speeds against F# / C#
But them stats are pretty nice Hyperz !
Actually your using libraries aswell here
Code:open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net
Join Litewarez.net today and become apart of the community.
Unique | Clean | Advanced (All with you in mind)
Downloads | Webmasters
Notifications,Forum,Chat,Community all at Litewarez Webmasters
-
17th Jun 2010, 10:47 AM #8OPRespected Developer
It does matter because I'm not gonna bother with a C# port. But the answer to your question is that F# will be faster. Not because it's a better language per s?, they both are .NET anyway. But rather because it's a functional language. Some stuff will perform better when written in C#, some stuff will run better with F#. This is one of the things that will run better with a functional language because of the asynchronous stuff that's involved.
F# is gonna be better choice over C# for:
- stuff that needs to run in parallel
- complex maths
- AI / neural networks
- situations where a lot of data needs to be handled
- writing a compiler
- ray-tracing
- DSL's
- few other things
For other stuff you really want to use C#. Did you know they wrote the F# 2.0 compiler in F# itself in under 10.000 lines of code ?
And yes it uses the SL library. As I stated in the 1st post this was made to test the Http part of it.
-
17th Jun 2010, 10:50 AM #9ლ(ಠ益ಠლ)Website's:
extremecoderz.comIm not even gonna entertain async. im going to throw it all in an array and threadPool it like i showed you on TeamSpeak.
Sponsored Links
Thread Information
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)
Similar Threads
-
Snippet of the Day
By SplitIce in forum Web Development AreaReplies: 41Last Post: 26th Aug 2012, 06:09 PM -
Plz Help To Add A Php Snippet Into My DLE Index !
By JoomlaZ in forum Web Development AreaReplies: 0Last Post: 7th Jul 2011, 01:18 PM -
[C#] Tiny Web Server (snippet)
By Hyperz in forum Web Development AreaReplies: 6Last Post: 24th Jun 2010, 01:19 PM -
[F#] Strong random password generator (Multi-core)
By Hyperz in forum Web Development AreaReplies: 2Last Post: 19th Jun 2010, 11:45 AM -
A Snippet from my latest project
By litewarez in forum Tutorials and GuidesReplies: 19Last Post: 21st Jun 2009, 05:17 PM
themaPoster - post to forums and...
Version 5.19 released. Open older version (or...