One of the great many things I love about Bristol is that there’s a lot of big companies here, and strangely, a lot congregated around a tiny little place called Filton, which funnily enough, is where I work…
HP is one of these such companies, and for me the fact it’s HP Research is even better, since R&D is one of the pathways I think would be interesting to pursue when I’m all done at university.
Anyway so this evening for the February Girl Geek Dinner, I headed over to HP Bristol for a talk on Big Data, Security and Twitterbots with Miranda Mobray who’s worked at HP for most of her career.
First off, she started talking about what is Big Data, and said 3 things define big data:
Volume: Well, it’s big. I can’t remember the actual amount but a few thousand terabytes and you have a big data set. Congratulations!
Velocity: data you want to process within a few hours, not a few weeks. This obviously causes a problem considering the sheer size of it…
Variety: data that’s in more than one format…which err isn’t strictly true? If you have data that’s in more than one format you’re just going to convert it to one singular format before starting processing to make it easier, so what does it matter if it’s not all in one format…
Miranda mentioned how a lot of recent searches on google about Big Data included “hadoop”, which is a programming language which is great for parallelism which would solve a few problems – however, a programming language is not the be all and end all of fixing a problem: 2 things stand in the way of this.
1. If your data set is getting exponentially bigger (the amount of data on the internet, in 2012, doubled every 18 months. Now it’s a heck of a lot faster than even that statistic), the algorithms need to scale, and therefore, parralellism is all well and good, but if it’s only designed to handle today’s amount of data, tomorrow’s will make it tip over…
2. Ultimately, if you don’t know the right questions, all you really have is…a big data set, so analysing the data is an even bigger problem than just from a technical perspective of “how the hell do we work with this”. Kind of like asking Deep thought what the answer to the meaning of Life is…
Miranda then went on to talk about the data processing related to security that she does at HP, particularly relating to attacks from DNS events. Some of the statistics – HP collects 120,000 DNS events per day, and they hold the data for 90 days – this is around 2 Petabytes of data. Whilst there’s an automated system which throws away 98% of this since a lot of DNS data isn’t relating to an attack or even figuring out if there was an attack, this is still quite a lot. On average there’s 1 bad event in 1 million DNS events, so for HP that’s…7 events per minute. The security team therefore have to have a way of suppressing or automating the clean up – so either they get notified about a bad event, then it gets put into the backlog for them to deal with later, or the user in question gets an automated email suggesting they get rid of the bad stuff on their machine.
In the break Miranda asked for questions and the usual topics such as NSA, GCHQ and various reports about how much data Facebook gives to other websites, and how HP and Miranda in particular are hoping to make sure their practices are ethical. This brought up the “Netflix Report”, in which Netflix published movies people had rented. From this collated data it was found the analysers could figure out the person’s name, based on their movie habits, which is erm. Quite scary.
On went the next section of the talk, discussing twitterbots. A few statistics:
11% of twitter users in 2011 were bots.
8% of links on twitter are bad…I don’t know if this is current.
40% of bots will get followed back on twitter, but only 20% of friend requests from bots on facebook get accepted – if the bot has mutual facebook friends, it’s 50%.
Miranda spoke about the types of twitter bots – scandals, marketing, even the Internet of Toaster bots and detailed cases where they’d been banned, such as one of theRealDonaldTrump accounts. Interestingly, speaking of the scandal bots which pick a female celebrity and say “dead/pregnant/naked!!!” with a link, Miranda mentioned she’d asked if it were possible to sue the owner of the bot for this kind of malicious tweeting – apparently, it’s not illegal which is kind of strange.
Trying not to creep us all out, Miranda also mentioned bot accounts such as ISSAbove and the Internet of Things usage of twitter, and other people brought up their own bots they’d made for service reasons, such as scraping twitter and the web for any mention of their favourite basketball teams or scores, and flight bots which allow you to request the next flight time and cost for various journeys.
Overall very interesting, and a great excuse to have a nosey at HP! I apologise for how lengthy this is since I haven’t really summarised…but I really found the whole talk very interesting, and I might even look at Hadoop as next on my “programming languages to try” list.