OLAP, Darwin and Evolution

Edited 21st Sept 05 to include picture of herbarium sheet.


 


In my article ‘Business Intelligence (BI): The way it is without the blah blah’, I mention that I was aware of some exiting research using the SQL Server OLAP engine, the results of which where hopefully about to hit Nature magazine.  I couldn’t say anymore at the time for fear of jeopardising the conclusions’ debut in Nature (If Nature isn’t first to publish, it’s not in Nature).  I claimed this research would demonstrate OLAP, as a technology, can be used as a useful tool in fields well beyond those fenced in by the concept of Business Intelligence.  Well this research has indeed made it to the hallowed pages of this erudite organ; if you are prepared to pay the fee, you can find it here ‘What Henslow taught Darwin’ @ http://www.nature.com/nature/journal/v436/n7051/index.html#Feature.


 


The article doesn’t single out SQL Server per se, but I can assure you, as someone who is in touch with one of the authors, Analysis Services OLAP engine made a significant contribution to the research mentioned.  Hopefully follow up material from the authors will document how OLAP was used.  Apparently there’s so much interest in this story there’s even talk of a film!


 


Let me give you an abstract to show how SQL Server Analysis Services OLAP has contributed to mankind’s understanding of its own history and the study of biology therein.


 


Darwin was a student (1829-31) of Professor John S. Henslow of Cambridge.  Henslow is well known for arranging Charles Darwin’s berth on the good ship HMS Beagle, the ship that took Darwin to the Galapagos islands.  The Galapagos islands are where Darwin discovered ‘Darwin’s Finches’, birds that had ‘evolved’, by appearance and behaviour, to such an extent that Darwin had to question whether they were different varieties of the same species, or different species altogether.  In pursuit of the answer to this question, Darwin formed his seminal concepts that became the backbone to On the Origin of Species.


 


In the shadow of the afore mentioned article, we must now understand that it was Henslow who had first recognised variation in species, and that it was this recognition and the desire to seek its verification that put Darwin on the Beagle.  Furthermore, it was the fidelity of Darwin to fulfil this ambition, with its incumbent rigours in procedure, that ultimately lead him to break free of the creationist shackles that had hindered his mentor, and from which, sadly, his mentor would never be freed.


 


The research in this article covers 10,172 plants collected by Henslow.  By creating an OLAP cube to analyse the sheets of paper holding these plant samples, the researchers show that Henslow organised his documentation by none other than the variation he observed to be within the limits of species variation.  Nobody is known to have attempted this before.  It is of the stuff that makes a good film that we can now visit his herbarium at Cambridge and see, with newly enlightened eyes, how the plants within it have been carefully arranged to illuminate this most potent observation; a place where, hitherto, this brilliant enlightenment has been so shrouded in the dark silence of ignorance.


 


Below, courtesy of the authors, is an image of one of Henslow’s herbarium sheets: Phleum arenarium. Eight numbered individuals are arranged in order of increasing height.  Plants 1-5 were collected 3rd June 1829 at Mildenhall, Suffolk by J.S. Henslow.  Plants 6-8 were collected in June 1822 at Liverpool by W. Wilson.



SQL Server Integration Services Samples - Including Fuzzy Logic

At a recent Technet presentation, someone came up to me and asked me if I could post my Fuzzy Logic sample package to my Blog.  I confess I haven’t got round to wrapping it all up, but now it looks like I’m off the hook as one has been supplied with the samples that come with the product.


Here’s the complete list of samples (as it is today)


Installing Sample Integration Services Packages - Provides instructions on installing sample packages from the stand-alone Microsoft Windows installer.
 
Calculated Columns Package Sample - Demonstrates how to use the Aggregate and Derived Column transformations to compute values and write the results to a file.
 
Capture Data Lineage Package Sample - Demonstrates how to get lineage information about data, add it to the data flow, and write the source data and the lineage information to a table.
 
Process XML Data Package Sample - Demonstrates how to extract data from an XML data file by using an XPath operation, convert the data to a flat-file format, and insert the data into a text file.
 
Data Cleaning Package Sample - Demonstrates how to clean data by applying exact and fuzzy matching of new customers to existing customers, and how to identify duplicate customers by using fuzzy grouping.
 
Execute SQL Statements in a Loop Package Sample - Demonstrates how to run SQL CREATE TABLE statements in a loop, evaluate and clean data values, and insert data into tables.
 
Execute Process Package Sample - Demonstrates how to use the Execute Process task to run an executable that expands a CAB file, read the expanded data, and insert the data into a table.
 
SMO Tables DBCC Package Sample -  Demonstrates how to enumerate the user tables in the AdventureWorks database by using SMO, use a Script task to retrieve the schema and table names, put the names into two variables, and use an Execute SQL task that executes a DBCC CHECKCONSTRAINTS command using the two variables.
 
AWDataWarehouseRefresh Package Sample - Demonstrates how to refresh the AdventureWorksDW data warehouse from the AdventureWorks OLTP database.
 
Synchronizing Adventure Works DW Partitions Package Sample - Demonstrates how to automate the synchronization of partitions between relational and Analysis Services databases by using an Integration Services package. The sample package uses the Script task and the Analysis Services Execute DDL task, and illustrates the use of the ASSL and AMO technologies.”

Bill Gates - A short chat with Microsoft’s Chief Software Architect

Its not really SQL Server, but this short 16min video comes from the guy who ultimately takes resposibility for SQL.  Bill is asked a number of interesting questions that put his contimued relationship with Microsoft under the spotlight  - like why does he bother working?  If you haven’t seen Bill speak much - this is a very good cameo.


http://channel9.msdn.com/Showpost.aspx?postid=111598


PS its also a good intro to Channel9

SQL 2005: Making sense of the new Optimistic Concurrency Controls and the readers that block writers

For a long time now Oracle enthusiasts have bashed SQL Server for not having Optimistic Concurrency Controls.  They would demonstrate a SQL Server reader blocking a writer and possibly a writer blocking a reader.  Thus, having appeared to demonstrate that SQL can’t handle these two operations concurrently, they would therefore conclude that SQL can’t scale - certainly not to enterprise level workloads; workloads that involve lots of people writing to a database at the same time as lots of people wanting to read from it.


 


This, of course, is a rather dirty trick; you only have to look at http://www.microsoft.com/sql/evaluation/compare/benchmarks.mspx to see SQL Server 2000 performance against various typical 3rd party LOB application workloads that involve reading and writing.   So how did/do they perform these mischievous tricks - here’s how:


 


Open connection 1


 


USE AdventureWorks


GO


BEGIN TRANSACTION


            UPDATE Production.Product


            SET StandardCost = 5.0,


                        ListPrice = 11.50


            WHERE Name LIKE ‘%sock%’


 


Open connection 2


 


USE AdventureWorks


GO


            SELECT  Name,


                        StandardCost,


                        ListPrice


            FROM Production.Product


            WHERE Name LIKE ‘%Sock%’


 


You will now see that connection 2 will not return anything - instead it just sits there doing nothing.  Here the writer on connection 1 blocks the reader on connection 2.  This blocking will continue until connection 1 either commits or rollbacks the transaction it started with BEGIN TRANSACTION.


 


Well it would be very easy, at this stage, to get into a religious slanging match - and you know I don’t do religion very easily.  Luckily, with SQL 2005, I don’t even have to think about get religious about this situation, because SQL Server 2005 now supports the necessary isolation levels to stop this happening.


 


At this point I’m reminded of an old joke told by Tommy Cooper:  Bloke goes to see a Doctor, he lifts his arm in the air and says, “Doctor, every time I do this it hurts.”  And the Doctor says, “Well don’t do it then.”


 


I’m reminded of this joke because if you begin a transaction, update a value and then you find it causes other people problems - I would suggest you don’t do it.  Why begin a transaction, do some work and then walk off without committing it?


 


Okay - so you might have a ‘really big update’ that takes ages to run and it can’t be batched up into smaller transactions and you need to run some very long running reports that have to have a transactionally consistant view of the data.  Indeed; you might also be very unlucky and get struck by lightening!


 


In SQL 2000 there are various ways to obviate the encumbrance of this ‘really big update’ and nasty report – but I don’t want to go down that road, like I said I want to avoid a religious debate.  Now, in SQL 2005, we have exactly the same capability as Oracle to make such an operation very easy for the developer.  Like Oracle, this capability employs the concept of ‘versioning’, which means each connection in the above scenario will work with its own version of the data, thus avoiding contention. 


 


However as with Oracle, versioning takes up both processing and i/o at the server, resources that are potentially very expensive and not to be wasted if you don’t have to.  Does Oracle do any benchmarks with this feature turned on?  I don’t think so, and nor would SQL for that matter.


 


I will argue that versioning is frequently used as a way to make developers jobs easier and not because it’s actually necessary.  By that I also mean it makes it easier for developers to write sloppy code - and by that you can happily infer I believe it makes it easier to employ cheaper developers.


 


So how do we go about employing these cheaper developers, sorry, reducing contention on the database?  To do this we now have two new isolation levels:


 


1. SNAPSHOT ISOLATION (Transaction-level Snapshot)


 


Administrators must set the new ALLOW_SNAPSHOT_ISOLATION  database option to allow Snapshot Isolation.


 


In order to start a transaction that uses Snapshot Isolation a developer must SET TRANSACTION ISOLATION LEVEL SNAPSHOT


 


RESULT: Every statement within a Snapshot Isolation Transaction sees the same version of data comprised only of committed changes which occurred before the start             



of the transaction.  Other statements inside other transactions do not see the changes made inside this Snapshot Isolation Transaction.


 


2. READ COMMITTED WITH SNAPSHOT ISOLATION (Statement-level Snapshot)


 


Administrators must set the new READ_COMMITTED_SNAPSHOT database option to allow Read Committed Snapshot Isolation


 


No further application level changes are required to have statements use Read-Committed Isolation.


 


RESULT: Each statement sees a version of the data that was committed just before the statement began, instead of when the resource is read.  This is merely a new implementation of read committed that is non-locking and non-blocking; the data is accurate only as at the start of the statement.


 


 


 

 
© 2007 - MyoKyawHtun.com