statsbomb 有一些有趣的公开数据,比如梅西的职业生涯,阿森纳的不败赛季,最近在上足球数据可视化的课程,刚好拿来玩一玩。
通过 competitions 找到英超联赛 0304赛季的 id,其中只有阿森纳的比赛,这次随便获取一场比赛来试手,2004 年 4 月 25 日北伦敦德比,阿森纳在白鹿巷球场 2:2 战平热刺,高举起英格兰超级联赛冠军奖杯,就它了。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mplsoccer import Pitch, Sbopen
match_id = 3749068
parser = Sbopen()
events, related, freeze, tactics = parser.event(match_id)
# 检查一下数据集
events.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3156 entries, 0 to 3155
Data columns (total 77 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3156 non-null object
1 index 3156 non-null int64
2 period 3156 non-null int64
3 timestamp 3156 non-null object
4 minute 3156 non-null int64
5 second 3156 non-null int64
6 possession 3156 non-null int64
7 duration 2355 non-null float64
8 match_id 3156 non-null int64
9 type_id 3156 non-null int64
10 type_name 3156 non-null object
11 possession_team_id 3156 non-null int64
12 possession_team_name 3156 non-null object
13 play_pattern_id 3156 non-null int64
14 play_pattern_name 3156 non-null object
15 team_id 3156 non-null int64
16 team_name 3156 non-null object
17 tactics_formation 3 non-null float64
18 player_id 3145 non-null float64
19 player_name 3145 non-null object
20 position_id 3145 non-null float64
21 position_name 3145 non-null object
22 pass_recipient_id 801 non-null float64
23 pass_recipient_name 801 non-null object
24 pass_length 879 non-null float64
25 pass_angle 879 non-null float64
26 pass_height_id 879 non-null float64
27 pass_height_name 879 non-null object
28 end_x 1571 non-null float64
29 end_y 1571 non-null float64
30 body_part_id 899 non-null float64
31 body_part_name 899 non-null object
32 sub_type_id 322 non-null float64
33 sub_type_name 322 non-null object
34 x 3139 non-null float64
35 y 3139 non-null float64
36 outcome_id 434 non-null float64
37 outcome_name 434 non-null object
38 under_pressure 670 non-null float64
39 counterpress 82 non-null float64
40 off_camera 65 non-null float64
41 aerial_won 38 non-null object
42 out 34 non-null float64
43 ball_recovery_recovery_failure 11 non-null object
44 pass_switch 38 non-null object
45 foul_committed_advantage 5 non-null object
46 foul_won_advantage 5 non-null object
47 technique_id 50 non-null float64
48 technique_name 50 non-null object
49 pass_assisted_shot_id 16 non-null object
50 pass_goal_assist 3 non-null object
51 shot_open_goal 1 non-null object
52 shot_statsbomb_xg 23 non-null float64
53 end_z 19 non-null float64
54 shot_key_pass_id 16 non-null object
55 shot_first_time 8 non-null object
56 goalkeeper_position_id 23 non-null float64
57 goalkeeper_position_name 23 non-null object
58 pass_cross 15 non-null object
59 dribble_overrun 3 non-null object
60 ball_recovery_offensive 2 non-null object
61 pass_shot_assist 13 non-null object
62 foul_won_defensive 9 non-null object
63 pass_deflected 2 non-null object
64 half_start_late_video_start 2 non-null object
65 substitution_replacement_id 5 non-null float64
66 substitution_replacement_name 5 non-null object
67 foul_committed_card_id 2 non-null float64
68 foul_committed_card_name 2 non-null object
69 dribble_nutmeg 1 non-null object
70 shot_one_on_one 1 non-null object
71 pass_cut_back 1 non-null object
72 block_offensive 1 non-null object
73 foul_committed_penalty 1 non-null object
74 foul_won_penalty 1 non-null object
75 bad_behaviour_card_id 1 non-null float64
76 bad_behaviour_card_name 1 non-null object
dtypes: float64(26), int64(10), object(41)
memory usage: 1.9+ MB
# 查看有哪些事件
events['type_name'].unique()
array(['Starting XI', 'Half Start', 'Pass', 'Ball Receipt', 'Carry',
'Block', 'Ball Recovery', 'Pressure', 'Duel', 'Clearance',
'Foul Committed', 'Foul Won', 'Dribbled Past', 'Dribble', 'Shot',
'Goal Keeper', 'Dispossessed', 'Interception', 'Miscontrol',
'50/50', 'Offside', 'Half End', 'Substitution', 'Error',
'Tactical Shift', 'Bad Behaviour'], dtype=object)
由于换人了以后阵容战术就会变动,所以绘制首发传球图需要在第一次换人之前,找到第一次换人的事件
然后过滤掉:
first_sub = events.loc[
events['type_name'] == 'Substitution'].loc[
events['team_name'] == "Arsenal"].iloc[0]
first_sub
id 7900f48c-1308-4aa5-9ec8-37af5c06a6d7
index 2335
period 2
timestamp 00:21:42.293000
minute 66
...
block_offensive NaN
foul_committed_penalty NaN
foul_won_penalty NaN
bad_behaviour_card_id NaN
bad_behaviour_card_name NaN
Name: 2334, Length: 77, dtype: object
_filter = (
events.type_name == 'Pass') & (
events.team_name == "Arsenal") & (
events.index < first_sub['index']) & (
events.outcome_name.isnull())
_filter.head()
0 False
1 False
2 False
3 False
4 True
dtype: bool
# 获取必要的数据
passes = events.loc[_filter, [
'x', 'y', 'end_x', 'end_y',
'player_id', 'player_name', 'pass_recipient_name', 'pass_recipient_id'
]]
passes.head()
x | y | end_x | end_y | player_id | player_name | pass_recipient_name | pass_recipient_id | |
---|---|---|---|---|---|---|---|---|
4 | 60.0 | 40.0 | 62.3 | 43.5 | 15516.0 | Thierry Henry | Dennis Bergkamp | 15042.0 |
6 | 62.0 | 43.5 | 47.2 | 30.7 | 15042.0 | Dennis Bergkamp | Patrick Vieira | 15515.0 |
9 | 47.2 | 29.5 | 38.2 | 12.1 | 15515.0 | Patrick Vieira | Ashley Cole | 12529.0 |
12 | 38.2 | 12.1 | 31.7 | 26.3 | 12529.0 | Ashley Cole | Sulzeer Jeremiah ''Sol' Campbell | 15637.0 |
15 | 28.8 | 29.5 | 48.4 | 38.8 | 15637.0 | Sulzeer Jeremiah ''Sol' Campbell | Gilberto Aparecido da Silva | 40221.0 |
对于每个球员,计算传球和接球的平均位置,然后计算每个球员到接球球员的传球次数,传球线路的粗细与之成正比
scatter = pd.DataFrame()
for i, _id in enumerate(passes['player_id'].unique()):
pass_x = passes.loc[passes['player_id'] == _id]['x'].to_numpy()
pass_y = passes.loc[passes['player_id'] == _id]['y'].to_numpy()
rec_x = passes.loc[passes['pass_recipient_id'] == _id]['end_x'].to_numpy()
rec_y = passes.loc[passes['pass_recipient_id'] == _id]['end_y'].to_numpy()
scatter.at[i, 'player_id'] = _id
# 计算每个点的 x 和 y,位置为传球和接球的平均值
scatter.at[i, 'x'] = np.mean(np.concatenate([pass_x, rec_x]))
scatter.at[i, 'y'] = np.mean(np.concatenate([pass_y, rec_y]))
# 计算传球数
scatter.at[i, 'count'] = passes.loc[
passes['player_id'] == _id].count().iloc[0]
# 位置大小
scatter['marker_size'] = scatter['count'] / scatter['count'].max() * 1500
scatter
player_id | x | y | count | marker_size | |
---|---|---|---|---|---|
0 | 15516.0 | 73.757692 | 33.834615 | 10.0 | 306.122449 |
1 | 15042.0 | 66.135897 | 37.082051 | 15.0 | 459.183673 |
2 | 15515.0 | 54.474194 | 33.134409 | 49.0 | 1500.000000 |
3 | 12529.0 | 52.920313 | 7.568750 | 35.0 | 1071.428571 |
4 | 15637.0 | 36.142857 | 19.846429 | 13.0 | 397.959184 |
5 | 40221.0 | 52.817460 | 43.530159 | 32.0 | 979.591837 |
6 | 38412.0 | 35.815909 | 48.084091 | 23.0 | 704.081633 |
7 | 40222.0 | 49.040351 | 70.389474 | 31.0 | 948.979592 |
8 | 19312.0 | 61.132308 | 18.466154 | 32.0 | 979.591837 |
9 | 20015.0 | 8.210000 | 41.750000 | 8.0 | 244.897959 |
10 | 24972.0 | 68.381081 | 67.370270 | 15.0 | 459.183673 |
由于事件数据集只有简单的球员名字和 id,绘制图片时使用名字全称过长,比如 Laureano Bisan-Etame Mayer 这位大哥估计不少人反应不过来是谁,Lauren 就明白了。
需要从另一个数据集中获取
lineup = parser.lineup(match_id)
arsenal = lineup.loc[lineup['team_name'] == 'Arsenal']
scatter = pd.merge(
scatter,
arsenal[['player_id', 'player_nickname', 'jersey_number']],
on='player_id'
)
计算线路宽度,需要根据传球和接球的组合对传球数据框进行分组,并计算他们之间的传球次数。最后一步设置了忽略传球次数少于 2 次的球员的阈值。可以尝试不同的值,根据可视化背后的信息调整它
# 计算球员之间的传球次数
passes['pair_key'] = passes.apply(
lambda x: '-'.join(sorted([str(x['player_id']),
str(x['pass_recipient_id'])])),
axis=1,
)
lines = passes.groupby(['pair_key']).x.count().reset_index()
lines.rename({'x': 'pass_count'}, axis='columns', inplace=True)
# 设定一个阈值,可以尝试研究它在更改时如何变化
lines = lines[lines['pass_count'] > 2]
lines.head()
pair_key | pass_count | |
---|---|---|
1 | 12529.0-15515.0 | 20 |
2 | 12529.0-15516.0 | 4 |
3 | 12529.0-15637.0 | 8 |
4 | 12529.0-19312.0 | 21 |
7 | 12529.0-40221.0 | 6 |
# 绘制球场
pitch = Pitch(line_color='grey')
fig, ax = pitch.grid(
grid_height=0.9, title_height=0.06, axis=False,
endnote_height=0.04, title_space=0, endnote_space=0,
)
# 球场上的位置
pitch.scatter(
scatter.x, scatter.y, s=scatter.marker_size,
color='red', edgecolors='grey', linewidth=1, alpha=1,
ax=ax['pitch'], zorder=3,
)
# 填充球员名字
for i, row in scatter.iterrows():
pitch.annotate(
row.player_nickname, xy=(row.x, row.y), c='black',
va='center', ha='center', weight="bold",
size=14, ax=ax["pitch"], zorder=4,
)
for i, row in lines.iterrows():
player1 = float(row['pair_key'].split('-')[0])
player2 = float(row['pair_key'].split('-')[1])
# 取球员的平均位置在他们之间画一条线
player1_x = scatter.loc[scatter['player_id'] == player1]['x'].iloc[0]
player1_y = scatter.loc[scatter['player_id'] == player1]['y'].iloc[0]
player2_x = scatter.loc[scatter['player_id'] == player2]['x'].iloc[0]
player2_y = scatter.loc[scatter['player_id'] == player2]['y'].iloc[0]
passes_count = row['pass_count']
# 调整线宽,传球的次数越多,线越宽
line_width = passes_count / lines['pass_count'].max() * 10
pitch.lines(
player1_x, player1_y, player2_x, player2_y,
alpha=1, lw=line_width, zorder=2, color='red', ax=ax['pitch']
)
fig.suptitle('2004-04-25, Tottenham Hotspur vs Arsenal', fontsize=30)
plt.show()