- But it appears that for both Facebook and Yahoo, those same clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range (pdf), which means they could easily be handled on a single computer—even a laptop.
The story is similar at Yahoo, where it appears the median task size handed to Yahoo’s cluster is 12.5 gigabytes. (pdf) That’s bigger than what the average desktop PC could handle, but it’s no problem for a single powerful server.
Unless you deal with large datasets, I think it's difficult to understand just how difficult designing meaningful analysis becomes. In the end, when it comes to market analysis, you'd probably do just as well going with the 'gut feeling' of someone that has demonstrated a keen understanding of the market. Otherwise, mining data like user behavior is probably only going to allow you to tweak things around the edges. The problem being, you might learn two important things, and in reacting to them, work counter to a third and possibly more important thing that you didn't even know to look for. -That's why I think A/B testing is most often bunk. As an aside, I just started changing the way tags are stored, and have been trying to anticipate what information might be best to store with them so that we can use them more effectively. I can't anticipate using a separate machine to crunch these data any time soon.
I've been wondering: how is data persistently stored on hubski?
Magnetically. ;) Hubski has no database. Instead, data is stored in flat files as s-expressions (lists, it's Lisp!). All the data you see here is actually in memory, loaded on startup, plus newly added data. Tags have existed as elements within post and user data, but I am currently creating an independent directory for them, so they can have their own associated elements that can be updated, sorted more quickly, etc. There's definitely some work to be done with all this as things progress. It should be fun.
Ugh, filled with so many buzzwords and pointless conjecture from an author who has obviously never actually worked with the technologies he is writing about. It's so painful to read articles about "big data" or "the cloud". The devil is in the details when it comes to any companies needs, so trying to put some stupid buzzword on every company is just pointless. But I get a certain kick out of reading articles from bloggers and "technology" enthusiasts that haven't actually worked with the things they write about, though it's not the same entertainment value they were hoping to provide. :) And then at the bottom of the article... Yeah... kind of knew that was coming. Also, And you know, the author of that article. Hah.Read more about our obsession with The Cloud
But buying into something as faddish as the supposed importance of the size of one’s data is the kind of thing only pointy-haired Dilbert bosses would do.