2017年2月4日星期六

How to Process and Deploy an SSAS Cube

How to Process and Deploy an SSAS Cube

Summary

This is in an example of deploying and processing a Cube in SSAS.


Problem

In our situation there was a problem when we deploying and processing a cube it show some error.

Errors and Warnings from Response

Internal error: The operation terminated unsuccessfully.

OLE DB error: OLE DB or ODBC error: Login failed for user 'NT AUTHORITY\NETWORK SERVICE'.; 28000; Cannot open database "target" requested by the login. The login failed.; 42000.

Errors in the high-level relational engine. A connection could not be made to the data source with the DataSourceID of 'Target', Name of 'Target'.

Errors in the OLAP storage engine: An error occurred while the dimension, with the ID of 'Dim Hos Code', Name of 'Dim Hos Code' was being processed.

Errors in the OLAP storage engine: An error occurred while the 'Hosp Code' attribute of the 'Dim Hos Code' dimension from the 'SSAStask' database was being processed.

Server: The operation has been cancelled.


Solution

Deploying a cube is very simple. So for removing this error I am doing some process.

First select the database name or project name from solution explorer, right click the click on process.

After clicking on process it first ask “Would you like to build and deploy the project first?” so if this is a first time to deploying or processing a cube then click yes otherwise no.

I am clicking Yes. Now the process has started. Here it is showing that deployment completed successfully. This means in your analysis service database, the structure of database is created, but you can’t access your cube, because the process is not complete. It shows the next wizard.

Now click Run. See here it is showing some error “process failed”:

So for removing this error go to solution explorer. Double click on .ds file.

Now here go to Impersonation Information tab and select “use a specific windows user name and password”

And put your username and password here. Now click OK

Again select the database name or project name from solution explorer, right click and then click on process. Here is also see deployment completed successfully. Now click on run, this time it is showing process succeeded.

So the cube is ready, you can access cube either from BIDS or from SSMS by connecting the analysis service.

Here we can drag and drop column based on requirements.

 

2016年11月22日星期二

10 Online Big Data Courses

1. Udacity
Udacity is a MOOC (Massive Open Online Course), so-called because it aims to be “audacious for you, the student”. Dumb name aside, Udacity has dozens of courses tailored to skill levels, from people who are entirely new to tech to proficient computer scientists. They start with Intros to Computer Science, Descriptive Statistics and Inferential Statistics, then progress to more in-depth, tech-specific tutorials focused around R, MongoDB, Machine Learning and more. The courses aren’t free- the “Intro to Computer Science” is $150 a month- but you do get a 14-day free trial to decide if the course is right for you.

2. EMC
Big data “technology has evolved faster than the workforce skills to make sense of it and organizations across sectors must adapt to this new reality or perish”, reads the ominous course blurb for EMC’s Big Science and Big Data Analytics course. Luckily, EMC are here to help you adapt to the rapidly-evolving big data environment, walking you through basic and advanced data analysis methods, as well walking you through the basic tools of the trade and the end-to-end analytics lifecycle. However, EMC isn’t cheap- the starter kit is $600, and the full course will set you back $5,000.

3. Coursera
Coursera is also a MOOC, and it’s completely free. All of the courses are taught in conjunction with a leading University- so you can learn Data Science with the University of Washington’s Bill Howe, Machine Learning with Stanford’s Andrew Ng, or Statistics with Alison Gibbs & Jeffrey Rosenthal of the University of Toronto. Course durations vary wildly between programmes (Data Science is 8 weeks, Statistics is 47), and stipulate recommended working hours per week (typically between 5 and 10). Coursera is a fantastic free resource for those looking to take the first steps into data science exploration.

4. CalTech’s Learning from Data
CalTech offer a free introductory course to machine learning online, with video recordings of lectures by Professor Yaser Abu-Moustafa. The course covers the basic theory and algorithms related to machine learning, as well as a variety of of commercial, financial and medical applications. Complete with 8 homework sets and a final exam, this a great taster of machine learning for the self-motivated.

5. MIT Open Courseware
While not a course in itself, OCW is an initative by the Massachusetts Institute of Technology to publish all of their course materials online, and make the accessible to all. Whilst this has the obvious disadvantage of no hands-on teaching, it is a great opportunity to explore the course materials (including exam papers as well as recommended reading) of one of the best STEM schools in the world. Check out Data Mining and Advanced Data Structures for a taste.

6. Jigsaw Academy
Jigsaw Academy is an online school based out of India, specialising in Analytics. They offer courses for beginner, intermediate and advanced levels, ranging from a broad overview of analytics for total newcomers, to in-depth investigations of analytics in finance and retail. They’re currently offering their Beginner’s course at a discounted price of Rs. 8,000 for students in India, and $149 for international students.

7. Stanford’s OpenClassroom
OpenClassroom‘s tagline is: “Full courses. Short Videos. Free for everyone.”  Which pretty much sums up everything you need to know about the initiative. A particular highlight is the machine learning course, devised by Andrew Ng (whose course also appears on Coursera, of which he is the co-founder). The course takes you through everything from Linear Regression to Naive Bayes algorithms, and the course image is a dog in a wizard hat. If that’s not enough to convice you to take a look, I don’t know what is.

8. Big Data University
Big Data University offer courses across the big data and data science ecosystem, including database-specific training, real-time analytics, 11 different courses on Hadoop and even relational management systems for beginners. The courses are self-paced, and mostly free (although you do sometimes have to pay to access the specific technologies, such as IBM SmartCloud Enterprise).

9. Code School
Code School offer many courses on specific programming languages, such as R, Java, and a course on mastering Github. Not all of their courses are free, but some of them (including Try R, helpfully) are. Their courses are also broken up into “Levels”, and Code School is by far the best-looking website on this list, if that swings it for you.
10. Udemy
Udemy is a MOOC based out of Silicon Valley, with the simple mission of allowing anyone to learn anything. With 4 million students, 10 thousand instructors and 10 million course enrollments, they’re well on their way. If you search for “Big Data” in their courses, there are over 160 results, helping you to learn anything from mastering Hadoop to developing a big data strategy for your business. Since Udemy is home to so many different instructors with different levels of qualification and experience, price and quality vary considerably, but it’s still worth having a look at what’s available for pretty modest costs

 

2016年11月12日星期六

K-means聚类算法

 

     K-means也是聚类算法中最简单的一种了,但是里面包含的思想却是不一般。最早我使用并实现这个算法是在学习韩爷爷那本数据挖掘的书中,那本书比较注重应用。看了Andrew Ng的这个讲义后才有些明白K-means后面包含的EM思想

     聚类属于无监督学习,以往的回归、朴素贝叶斯、SVM等都是有类别标签y的,也就是说样例中已经给出了样例的分类。而聚类的样本中却没有给定y,只有特征x,比如假设宇宙中的星星可以表示成三维空间中的点集。聚类的目的是找到每个样本x潜在的类别y,并将同类别y的样本x放在一起。比如上面的星星,聚类后结果是一个个星团,星团里面的点相互距离比较近,星团间的星星距离就比较远了

     在聚类问题中,给我们的训练样本是,每个,没有了y

     K-means算法是将样本聚类成k个簇(cluster),具体算法描述如下:

1 随机选取k个聚类质心点(cluster centroids)为

2 重复下面过程直到收敛 {

               对于每一个样例i,计算其应该属于的类

              

               对于每一个类j,重新计算该类的质心

              

}

     K是我们事先给定的聚类数,代表样例ik个类中距离最近的那个类,的值是1k中的一个。质心代表我们对属于同一个类的样本中心点的猜测,拿星团模型来解释就是要将所有的星星聚成k个星团,首先随机选取k个宇宙中的点(或者k个星星)作为k个星团的质心,然后第一步对于每一个星星计算其到k个质心中每一个的距离,然后选取距离最近的那个星团作为,这样经过第一步每一个星星都有了所属的星团;第二步对于每一个星团,重新计算它的质心(对里面所有的星星坐标求平均)。重复迭代第一步和第二步直到质心不变或者变化很小

     下图展示了对n个样本点进行K-means聚类的效果,这里k2

    

     K-means面对的第一个问题是如何保证收敛,前面的算法中强调结束条件就是收敛,可以证明的是K-means完全可以保证收敛性。下面我们定性的描述一下收敛性,我们定义畸变函数(distortion function)如下

    

     J函数表示每个样本点到其质心的距离平方和。K-means是要将J调整到最小。假设当前J没有达到最小值,那么首先可以固定每个类的质心,调整每个样例的所属的类别来让J函数减少,同样,固定,调整每个类的质心也可以使J减小。这两个过程就是内循环中使J单调递减的过程。当J递减到最小时,c也同时收敛。(在理论上,可以有多组不同的c值能够使得J取得最小值,但这种现象实际上很少见)

     由于畸变函数J是非凸函数,意味着我们不能保证取得的最小值是全局最小值,也就是说k-means对质心初始位置的选取比较感冒,但一般情况下k-means达到的局部最优已经满足需求。但如果你怕陷入局部最优,那么可以选取不同的初始值跑多遍k-means,然后取其中最小的J对应的c输出

     下面累述一下K-meansEM的关系,首先回到初始问题,我们目的是将样本分成k个类,其实说白了就是求每个样例x的隐含类别y,然后利用隐含类别将x归类。由于我们事先不知道类别y,那么我们首先可以对每个样例假定一个y吧,但是怎么知道假定的对不对呢?怎么评价假定的好不好呢?我们使用样本的极大似然估计来度量,这里是就是xy的联合分布P(x,y)了。如果找到的y能够使P(x,y)最大,那么我们找到的y就是样例x的最佳类别了,x顺手就聚类了。但是我们第一次指定的y不一定会让P(x,y)最大,而且P(x,y)还依赖于其他未知参数,当然在给定y的情况下,我们可以调整其他参数让P(x,y)最大。但是调整完参数后,我们发现有更好的y可以指定,那么我们重新指定y,然后再计算P(x,y)最大时的参数,反复迭代直至没有更好的y可以指定

     这个过程有几个难点,第一怎么假定y?是每个样例硬指派一个y还是不同的y有不同的概率,概率如何度量。第二如何估计P(x,y)P(x,y)还可能依赖很多其他参数,如何调整里面的参数让P(x,y)最大。这些问题在以后的篇章里回答

     这里只是指出EM的思想,E步就是估计隐含类别y的期望值,M步调整其他参数使得在给定类别y的情况下,极大似然估计P(x,y)能够达到极大值。然后在其他参数确定的情况下,重新估计y,周而复始,直至收敛

     上面的阐述有点费解,对应于K-means来说就是我们一开始不知道每个样例对应隐含变量也就是最佳类别。最开始可以随便指定一个给它,然后为了让P(x,y)最大(这里是要让J最小),我们求出在给定c情况下,J最小时的(前面提到的其他未知参数),然而此时发现,可以有更好的(质心与样例距离最小的类别)指定给样例,那么得到重新调整,上述过程就开始重复了,直到没有更好的指定。这样从K-means里我们可以看出它其实就是EM的体现,E步是确定隐含类别变量M步更新其他参数来使J最小化。这里的隐含类别变量指定方法比较特殊,属于硬指定,从k个类别中硬选出一个给样例,而不是对每个类别赋予不同的概率。总体思想还是一个迭代优化过程,有目标函数,也有参数变量,只是多了个隐含变量,确定其他参数估计隐含变量,再确定隐含变量估计其他参数,直至目标函数最优