Consuming Huge Amounts Of Unstructured Data


The recent presentation about Oracle Data Storage started me thinking about data.  I’ve worked with the Oracle database since 1995.  Lots of structured data in fields, tables and databases.  A certain amount of my job has been to turn these large amounts of data into usable information.  The result of which itself may in fact be unstructured, such as “the trend is …”.

But when I think of it, so much of what I actually consume and interact with, is in fact, very unstructured.  I started making a list of everything.

So much Content on the Web!

Google
Blogs and feedback
Webpages, articles, including animated GIF files that help explain things
News
Linked IN
Pictures such as on Flickr
Newsgroups, Forums
Youtube/videos
Ebay, Amazon
Online radio stations
Twitter
Job websites
Facebook, Myspace
Social media

Communication:

Email
Text messages
Spam email that needs to be ignored or deleted

Filetypes I’ll View or Download:

PDF
PowerPoint (PPT)
Word Documents
Software or patches to install
MP3 sound files
MP4 or Quicktime files

In Electronic Job Searches, there are:
Job descriptions, Resumes, Job Applications

Traditional Media Paperbased (treeware):

Advertising
Newspapers
Magazines
Books
Flyers
Postcards
Junk/Direct mail
Business cards
Brochures
Point Of Purchase displays
Graphics and text on products
Billboards
Bills:  Phone, etc.
Bank statements

Traditional Media Not Paperbased:

TV shows, news, advertisements
Radio
Movies:  includes product placement, and “the message”
Music
DVDs
CDs

Person to Person:

Face to face conversations
Phone calls
Meetings
Speaking engagements, listening, or presenting
Appointments with professionals like doctors

Electronic Storage Media I use:
Hard drives, USB, CD, DVD, camera

It makes me think of a few things.

No kidding we are bombarded by hundreds or thousands of messages a day!  And that’s only an estimate for the advertising messages.

When I research issues, it can become rather overwhelming.  There is so much on the web.  It’s easy to keep going from one website to another.  None may have the particular answer, but a potential piece of it.  I can open dozens of Firefox windows, each with multiple tabs.

Google has managed to index all the content really well.  You get a pretty meaningful result when you do searches most of the time.  Although often I’m looking for some pretty specific things, or solutions to obscure problems and it can take a while.  Or, I’ve concluded that sometimes I’m the first to come up with a solution, because I wasn’t able to find any solutions on the web up to that point.

In some senses, things were easier before the web, when we read books.  Partially, because you could only read so many in a day. Books were also more thoroughly thought out then.  It cost money to publish, so you wanted to be sure it was accurate and saleable.  There was no flaming and ranting online to wade through, or engage in.  Letters to the editor were, edited.

Earlier this year, I started scanning the collection of paper technical binders I’d created into PDF files.  In this way, I keep the information, but lose the heavy and bulky paper.    With PDF files, I can also scan them using cygwin and grep to look for keywords, such as “tablespace”.

The irony for me is that my job is dealing with structured data.  But so much of what I consume is in fact unstructured.   Hmm.

Just a quick post today.  Please add anything I’ve missed.

Advertisements

2 Responses to Consuming Huge Amounts Of Unstructured Data

  1. Steve says:

    Rodger —
    You might be interested in what’s going on with DIY Book Scanning. Many active fora on hardware, software development, book scanning projects, legal & philosophical issues of turning hard copy into digital formats with scanning, OCR, and indexing technologies. Not only PDF but new formats such as DejaVu are
    being considered.

    Visit http://diybookscanner.org/

    • rodgersnotes says:

      Thanks. As for my own notes, I scanned them into PDF files, as my other post (For Your Eyes Only) mentioned. When you think of it, it’s almost library science to keep track of everything these days.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: