复盘导读:在分析北京二手房数据时,我想探究一个常识问题:“带电梯到底能让房子每平米贵多少钱?”。然而,初次 SQL 计算的结果令人大跌眼镜:没电梯的房子竟然比有电梯的更贵!是数据错了,还是常识错了?本文复盘了我是如何通过排查“时空错位”与“隐藏维度”,并结合经纬度空间计算,最终剥离出真实的资产溢价密码。

遭遇悖论:没电梯的房子凭什么更贵?

起初,为了计算电梯带来的溢价,我直接用房源的 price (成交单价) 减去 communityAverage (小区均价),然后按是否有电梯进行 GROUP BY

初版 SQL 结果出来后,我惊呆了:
计算出的溢价竟然是负值,甚至无电梯的均价高于有电梯的均价!

1
2
3
4
5
6
7
with new_biao as (
select id,price,communityAverage,(price - communityAverage) as premium_per_sqm,elevator,ladderRatio
from real_estate_db.beijing_house_sales
where communityAverage >0 and communityAverage is not null and price > 0)
select elevator as `是否有电梯`,count(*) as `成交样本量`, avg(premium_per_sqm) as avg_price
from new_biao
group by elevator

M

经过深入的业务反思,我发现了两个致命的“降维打击”陷阱:

  1. 时空错位(Time-Space Misalignment)
    数据集中的 communityAverage 通常是爬虫抓取那一刻(如2018年)的静态均价,但 price 却是房源在 2012-2017 年不同时间点的历史成交价!用几年前的成交价去减去 2018 年的小区均价,必然出现大面积倒挂。
  2. 隐藏维度的掩盖
    北京大量无电梯的“老破小”往往占据着绝佳的核心地段或顶级学区,而远郊虽然全是带电梯的高楼大厦,但底价极低。简单的二元对比,掩盖了地段和密度的本质差异。

破局:用窗口函数重塑时间标尺

为了解决“时空错位”,必须放弃静态的 communityAverage,改为利用历史订单数据,动态计算每个小区、每年的真实均价。
这里,SQL 的高级窗口函数派上了巨大用场。

1
2
3
4
5
6
7
8
9
10
with new_biao as (
select id,Cid,price,tradeTime,buildingStructure,elevator,ladderRatio,year(tradeTime) as tradeYear,
avg(price) over (partition by Cid,year(tradeTime)) as avg_price_that_yaer
from real_estate_db.beijing_house_sales
where price > 0 )
select tradeYear as `交易年份`,elevator as `是否有电梯(1/0无)`,`buildingStructure` as `建筑结构`,multiIf(ladderRatio<=0.33,'优质低密',ladderRatio<=0.5,'常规居住','高密塔楼') as ladder_type,count(*) as `成交样本量`, round(avg(price-avg_price_that_yaer),2) as `平均每平米溢价`
from new_biao
group by tradeYear,elevator , `buildingStructure` ,ladder_type
having count(*) >= 50
order by tradeYear,`buildingStructure`,elevator

N