Jump to Navigation

Why Big Data matters

Big Data and Analytics Sydney 2013

I attended the Big Data and Analytics Innovation Summit 2013 to look into the best practises in the field and what other organisations were doing that might be applicable to our work in the Library. This blog starts with an introduction to the field of Big Data and finishes with thoughts on how we might utilise Big Data techniques for the UQ Library community.

What is Big Data?

In a world where all of us are interacting with digital systems ever more frequently, there is the potential for these systems to monitor and report on our behaviours in ways that would have been previously unfeasible. This has recently become worldwide news with the Snowden revelations, however governments monitoring the electronic communications of their citizens is not new and the Echelon project monitoring private and commercial satellite communications predates it by decades. There are numerous other applications of this knowledge as well, particularly in healthcare, physics and astronomy. Searches on the Internet for symptoms of diseases enable us to monitor the outbreak of viruses, for instance Google reporting is able to indicate the spread of influenza well before other traditional means.

The data to make predictions about disease, neutrons, terrorism or civil dissent that governments and corporations are using is vast, on a scale which traditional techniques would have made untenable; The Hadron Collider alone has 150 million sensors pushing out data 40 million times per second. The phrase which is used to describe the unique capture, storage and utilisation requirements of this new age of information is Big Data.

Capturing data is the first step in the process and wherever there is a digital device there is a potential source of data. To some extent the more data you capture the better as it all represents information about your organisation and thus is almost inevitably useful. This can include webpage requests, web searches, emails, blogs, machine accesses, social media, video, audio and a wide range of custom data capturing systems.

Once the data is stored, a process called "mining" begins whereby you inspect the data to see if any useful trends become evident. Visualisations of the data, particularly non static ones enabled by modern web applications, can then be used to present the information in a variety of ways.

How is Big Data applicable to us?

Few of the Library's current systems are on the scale that they would fall under the remit of Big Data techniques. Modern drives even on desktop machines easily handle terabytes of information, 64 bit computers provide vast memory address spaces. With hardware continually improving, the target of what can be done with "traditional" hardware is always on the move, and where once a terabyte of information might have needed specialised systems, today that is no longer the case.

The sorts of information that Big Data systems are storing and parsing does have a direct relevance to the UQ Library in understanding and responding to the needs of the University community. One of the less positive parts of the Big Data and Analytics Conference I attended was seeing how many of the best implementations were in well funded areas such as betting and casinos. They have virtually unlimited resources and the monetary self interest to implement cutting edge solutions. Take for instance this story of a successfully prosecuted case of fraud:

A lady on a betting table starts winning an inordinate amount of the time, this is being monitored digitally and a flag is raised. Security use camera footage of the lady's face to make a facial ID for her, essentially a set of uniquely identifying measurements of her face and utilise this to track her journey through the casino to which entrance she used. As it is a carpark entrance, they match her parking ticket and soon identify her car. Taking her license plate they discover it is a rental car and are able to access her rental details and thus her identity. They then do a search for her known associates, finding that 18 years previously she had shared a room at college with an employee of the casino. Matching this employee they realise it is the person sitting across from her dealing the cards. On inspecting the table footage, available from multiple angles, It was figured out they were using a system of hand signals to cheat and both of these people ended up in prison.

Whilst we may not want to track people's faces digitally, we might be interesting in monitoring the flow of people in other ways. Whilst we can monitor how many students are using our Internet terminals via their logins, we don't know about the resources where a login isn't required or where they can't login because there aren't any free resources? Students who enter the campus with a mobile phone in which Wifi is activated can be identified by their Mac addresses. If we were to monitor this, which casinos do whilst hashing the Mac address for privacy reasons, we could monitor the flow of their movements around the University to identify potential problems in the levels of service we are providing to them. For instance tracking how many students enter into library spaces and leave in a short space of time would enable us to identify unfulfilled levels of demand. This might also lead to an improved system for booking resources within our spaces, such as reserving a PC from your mobile phone, or simply displaying usage levels to our users so they can self organise times when the spaces are likely to be free.

A vast amount of information is present in the behaviour of students in relation to their study. A student looking for resources on a topic usually uses search, however that search is done purely by an external service. This ignores key data such as resources their lecturer might recommend or that fellow students from current and previous semesters might have utilised. We can recommend resources to them that previously only a person with intimate knowledge of the field would be able to, all by monitoring the behaviour of the other users who match their course profile. People are spending more and more time searching and parsing data, anything we can do to improve and streamline this process is going to have positive implications on UQ services and general academic success. Information is becoming ever more pervasive and we can do more to help our users navigate it and avoid information overload.

Social media is becoming ever more utilised in the academic space. Whilst much of the previous data discussed is structured and relatively easy to analyse, social media provides data in an unstructured form and thus it requires more advanced techniques to parse meaning from it.

eSpace Altmetric ordered searchExample Altmetric PageIn the Library's technology unit we are already using an external service Altmetric to add article level information about social media. Currently this doesn't enable us to precisely understand the sentiments of the users, so we know that people are talking about it but not what they are saying. Efforts to parse sentiment do yield generally useful results but understanding things like sarcasm and context is difficult for an automated system. Whilst Facebook is leading the attempts to deal with this by adding the ability for users to give their status updates an emotional context, until that gets wider uptake we will still just be able to accurately tell our users that something is being discussed for good or bad. Also for providing services ourselves, as we may perhaps move further into the social media space for engaging with the library community, we would want to monitor that space to see how our services are being received. With a sufficiently advanced system, we could do this in near real time.

These are just a few ways we might capture more data about our users and apply it to respond to their needs. Traditionally we have tried to service the needs of our users from our own understanding and perhaps a feedback form or periodic survey. Whilst there will always be a place for human judgement in understanding our users, there is a realm of data we can collect which means the application of the art of customer service can be moved further towards a science.

The Scopus Custom Dataset Project

UQ Organisation Information ToolThe Scopus Custom Dataset Project was initially created to assist with the University's ERA submission process. It provides custom metrics and benchmarking on citation rates and other data for a wide range of areas useful in UQ understanding and presenting its position amongst other Australian universities.

To enable this, the Scopus Custom Dataset was purchased and a developer (me) was hired to work with it. The dataset is an ever expanding cluster of files, each one representing information about academic documents published since 2005. Currently this takes the form of over 15 million complex XML documents, which we import into a MySQL database on a medium sized server.

Initially the importing has been done using Zend Framework 1 based PHP scripts, parsing each document individually before importing the various rows into the MySQL database. Running this on one machine takes roughly 1 ½ days to import per year of data. During the first import this was parallelised onto multiple servers to speed up the process. This required assigning each server a primary key space, so that each server database could be dumped and then imported into the centralised server. Though this was much faster it didn't achieve its potential due to many of these servers being virtual machines (VM) living on the same physical machine. If it could be run on multiple machines with their own resources, the whole process could be reduced to 2 - 3 days.

UQ Organisation Information ToolPart of my attendance at the Big Data conference was to look at ways to potentially reduce this even further. On discussion with users in attendance at the conference, a number of them were using Hadoop to parallelise their processing. Whilst I was hoping to learn some technical details about implementing Hadoop and perhaps even Amazon's Elastic Map Reduce, the prep work I had done getting an understanding of the technologies before attending the conference covered the material that was presented. Essentially this confirmed what we'd already understood about the ability to use Hadoop to reformat the XML into a more MySQL friendly format, speeding up the import process. A Hadoop cluster has been developed at UQ, and we have asked for time on this when it becomes available. Of interest was that the main presenter on Hadoop used Elastic Map Reduce for development, but used their own Hadoop cluster for production. I presume this was pricing related, but didn't get a chance to ask for further detail.

In respect to storing and querying the data the most interesting phrase which was just mentioned in passing at the conference was NewSQL. These systems are attempts to utilise some of the scalability of NoSQL solutions whilst maintaining ACID compliant databases. Whilst it sounds like a potential Holy Grail of database systems, it will be interesting to see what their limitations and infrastructure requirements are, also importantly what performance levels they can achieve.