• 【数据分析】基于Matlab检测多元数据集中的异常值


    1 内容介绍

    回归模型诊断的角度对统计数据中的异常值进行探测与分析,并基于Matlab语言,结合具体实例,给出回归诊断在统计数据异常值探测中的应用.

    2 部分代码

    function outliers_demo(N,f)

    % Demo of 'DetectMultiVarOuliers' for 2D normally distributed data with

    % outliers.

    %

    % OPTIONS:

    %   - N     : number of random samples. N=1E3 is default.

    %   - f     : real number between (0 and 0.5) indicating the fraction of

    %             outliers. f=0.05 is the default setting.

    if nargin<1 || isempty(N), N=1E3; end

    if nargin<2 || isempty(f), f=0.05; end

    % Generate a random normally distributed dataset with outliers 

    % -------------------------------------------------------------------------

    % Random covariance matrix

    t=rand(1)*pi;

    U=[cos(t) -sin(t);sin(t) cos(t)];

    Co=U*diag(rand(1,2))*U';

    % Random samples

    N=round(max(N,10));

    X=mvnrnd([0 0],repmat(Co,[1 1 N]));

    % Replace round(f*N) samples with uniformly distributed outliers

    r=1.5*norm(max(X));

    f=min(0.5,f);

    M=round(f*N);

    chi_crt=chi2inv(0.99,2);

    cnt=0; k=0;

    X_out=cell(1);

    while cnt

        

        % random sample 

        x=randn(N,2);    

        x=bsxfun(@rdivide,x,sqrt(sum(x.^2,2)));

        x=bsxfun(@times,r*sqrt(rand(N,1)),x);

        

        % check for outlyingness

        MD=sum(x'.*(Co\x')); % Mahalanobis distance (squared)

        chk=MD>chi_crt;    

        if sum(chk)==0, continue; end

        

        k=k+1;

        cnt=cnt+sum(chk);

        X_out{k}=x(chk,:);

    end

    X_out=cell2mat(X_out);

    X_out=X_out(1:M,:);

    X(1:M,:)=X_out;

    % Note that outliers were placed at the beginning of the data matrix so that

    % they are easy to keep track of, however, we do not supply this prior

    % knowledge to the 'DetectMultiVarOuliers' function so it doesn't know

    % which of the samples are outliers.

    % Now attempt to find the outliers using 'DetectMultVarOutliers' function

    % -------------------------------------------------------------------------

    [mu,S,RD,chi_crt]=DetectMultVarOutliers(X);

    % Note that the function automatically plots robust Mahalanobis distance (RD)

    % vs sample index. In this plot, "basis subset" denotes the samples in X

    % that were used to obtain robust estimates of covariance (S) and mean (mu). 

    % In this simulation, outliers are the samples with RD greater than critical

    % values of chi^2 distribution. However, there may be situations where this

    % is not the case. For this reason, tests for outlyingness based on RD 

    % should always be confirmed visually. See documentation of 

    % 'DetectMultVarOutliers' for additional information.

    % To improve detection accuracy you can supply an estimate of the number of 

    % outliers in X (n_out) as [...]=DetectMultVarOutliers(X,n_out)

    % Plot simulated data and estimated parameters

    % -------------------------------------------------------------------------

    hf1=gcf;

    hf2=figure('color','w'); 

    pause(0.1)

    warning('off','MATLAB:HandleGraphics:ObsoletedProperty:JavaFrame');

    j_hf=get(hf2,'JavaFrame'); 

    j_hf.setMaximized(true);

    warning('on','MATLAB:HandleGraphics:ObsoletedProperty:JavaFrame');

    drawnow

    pause(0.1)

    axis equal

    hold on

    x_min=min(X);

    x_max=max(X);

    d=x_max-x_min;

    XLim=[x_min(1)-0.04*d(1) x_max(1)+0.04*d(1)];

    YLim=[x_min(2)-0.04*d(2) x_max(2)+0.04*d(2)];

    set(gca,'box','on','XLim',XLim,'YLim',YLim)

    % N-M samples designed as inliers

    h1=scatter(X((M+1):end,1),X((M+1):end,2),'b','MarkerFaceColor','b','SizeData',30);

    % M simulated outliers

    h2=scatter(X_out(:,1),X_out(:,2),'r','MarkerFaceColor','r','SizeData',30);

    % Samples with RD greater than critical value of chi^2 for alpha=0.01/2

    id_out=RD>chi_crt(4);

    h3=scatter(X(id_out,1),X(id_out,2),'w','MarkerFaceColor','none','MarkerEdgeColor','k','SizeData',100);

    h=legend([h1 h2 h3],{'normally distributed samples' 'simulated outliers' 'potential outliers'});

    set(h,'FontSize',20,'Location','EastOutside')

    pause(0.5)

    figure(hf1)

    3 运行结果

    4 参考文献

    [1]潘丽静. 基于Matlab语言的统计数据异常值检验[J]. 赤峰学院学报:自然科学版, 2012(21):2.

    [2] Hadi, A.S., 1992. Identifying multiple outliers in multivariate data. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 54(3), pp. 761-771.  

    [3]Weiszfeld, E., 1937. Sur le point par lequel la somme des distances den points donnés est minimum. Tohoku Mathematics Journal, Vol. 43, pp. 355–386.

    博主简介:擅长智能优化算法神经网络预测信号处理元胞自动机图像处理路径规划无人机雷达通信无线传感器等多种领域的Matlab仿真,相关matlab代码问题可私信交流。

    部分理论引用网络文献,若有侵权联系博主删除。

  • 相关阅读:
    requests正常scrapy异常---终极解决方案
    [BLIP]-多模态Language-Image预训练模型
    获取微信openid和基本信息的总结
    如何添加葫芦儿派盘
    注意,2022年CCF会士评选结果揭晓
    Python离线断网情况下安装numpy、cv2和matplotlib等常用第三方包
    09—DOM和BOM
    【Verilog 教程】7.4Verilog CIC 滤波器设计
    《深入浅出.NET框架设计与实现》阅读笔记(四)
    Live800:一个优秀的客服应具备哪些技能?
  • 原文地址:https://blog.csdn.net/matlab_dingdang/article/details/126850328