Python大数据(2019-07-13至2019-07-22实习培训)

默认分类 · 2019-07-22 · 296 人浏览
Python大数据(2019-07-13至2019-07-22实习培训)

附python3操作MySQL(增删改查、事务)

导入支持库

In [1020]:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
from pandas import Series, DataFrame


#在GUI中显示
# %matplotlib tk 
#在行内显示(默认)
%matplotlib inline

mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False

载入数据 合并数据 data_obj

In [226]:

users_df = pd.read_csv('./data/users.dat', sep='::', engine='python', names=['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code'])
movies_df = pd.read_csv('./data/movies.dat', sep='::', engine='python', names=['MovieID', 'Title', 'Genres'])
ratings_df = pd.read_csv('./data/ratings.dat', sep='::', engine='python', names=['UserID', 'MovieID', 'Rating', 'Timestamp'])
# print(users_df)
# print(movies_df)
# print(ratings_df)

combine_movies_ratings_df = pd.merge(movies_df, ratings_df, on='MovieID')
combine_movies_ratings_users_df = pd.merge(combine_movies_ratings_df, users_df, on='UserID')
data_obj = combine_movies_ratings_users_df
data_obj

Out[226]:

MovieIDTitleGenresUserIDRatingTimestampGenderAgeOccupationZip-code
01Toy Story (1995)Animation\Children's\Comedy15978824268F11048067
148Pocahontas (1995)Animation\Children's\Musical\Romance15978824351F11048067
2150Apollo 13 (1995)Drama15978301777F11048067
3260Star Wars: Episode IV - A New Hope (1977)Action\Adventure\Fantasy\Sci-Fi14978300760F11048067
4527Schindler's List (1993)Drama\War15978824195F11048067
5531Secret Garden, The (1993)Children's\Drama14978302149F11048067
6588Aladdin (1992)Animation\Children's\Comedy\Musical14978824268F11048067
7594Snow White and the Seven Dwarfs (1937)Animation\Children's\Musical14978302268F11048067
8595Beauty and the Beast (1991)Animation\Children's\Musical15978824268F11048067
9608Fargo (1996)Crime\Drama\Thriller14978301398F11048067
10661James and the Giant Peach (1996)Animation\Children's\Musical13978302109F11048067
11720Wallace & Gromit: The Best of Aardman Animatio...Animation13978300760F11048067
12745Close Shave, A (1995)Animation\Comedy\Thriller13978824268F11048067
13783Hunchback of Notre Dame, The (1996)Animation\Children's\Musical14978824291F11048067
14914My Fair Lady (1964)Musical\Romance13978301968F11048067
15919Wizard of Oz, The (1939)Adventure\Children's\Drama\Musical14978301368F11048067
16938Gigi (1958)Musical14978301752F11048067
171022Cinderella (1950)Animation\Children's\Musical15978300055F11048067
181028Mary Poppins (1964)Children's\Comedy\Musical15978301777F11048067
191029Dumbo (1941)Animation\Children's\Musical15978302205F11048067
201035Sound of Music, The (1965)Musical15978301753F11048067
211097E.T. the Extra-Terrestrial (1982)Children's\Drama\Fantasy\Sci-Fi14978301953F11048067
221193One Flew Over the Cuckoo's Nest (1975)Drama15978300760F11048067
231197Princess Bride, The (1987)Action\Adventure\Comedy\Romance13978302268F11048067
241207To Kill a Mockingbird (1962)Drama14978300719F11048067
251246Dead Poets Society (1989)Drama14978302091F11048067
261270Back to the Future (1985)Comedy\Sci-Fi15978300055F11048067
271287Ben-Hur (1959)Action\Adventure\Drama15978302039F11048067
281545Ponette (1996)Drama14978824139F11048067
291566Hercules (1997)Adventure\Animation\Children's\Comedy\Musical14978824330F11048067
.................................
10001792763Thomas Crown Affair, The (1999)Action\Thriller57275958492554M25492843
10001802770Bowfinger (1999)Comedy57274958490923M25492843
10001812805Mickey Blue Eyes (1999)Comedy\Romance57274958492371M25492843
10001822827Astronaut's Wife, The (1999)Sci-Fi\Thriller57272958490886M25492843
10001832840Stigmata (1999)Thriller57273958492512M25492843
10001842841Stir of Echoes (1999)Thriller57272958492512M25492843
10001852858American Beauty (1999)Comedy\Drama57275958490857M25492843
10001862881Double Jeopardy (1999)Action\Thriller57273958490981M25492843
10001872907Superstar (1999)Comedy57271958492554M25492843
10001883005Bone Collector, The (1999)Thriller57272958490923M25492843
10001893053Messenger: The Story of Joan of Arc, The (1999)Drama\War57273958492345M25492843
10001903081Sleepy Hollow (1999)Horror\Romance57274958492512M25492843
10001913113End of Days (1999)Action\Thriller57273958490981M25492843
10001923114Toy Story 2 (1999)Animation\Children's\Comedy57275958492554M25492843
10001933176Talented Mr. Ripley, The (1999)Drama\Mystery\Thriller57273958492554M25492843
10001943177Next Friday (1999)Comedy57271958489748M25492843
10001953273Scream 3 (2000)Horror\Mystery\Thriller57274958492479M25492843
10001963301Whole Nine Yards, The (2000)Comedy\Crime57275958490142M25492843
10001973316Reindeer Games (2000)Action\Thriller57273958490699M25492843
10001983317Wonder Boys (2000)Comedy\Drama57274958489879M25492843
10001993408Erin Brockovich (2000)Drama57275958489879M25492843
10002003409Final Destination (2000)Drama\Thriller57274958490143M25492843
10002013481High Fidelity (2000)Comedy57274958489879M25492843
10002023483Road to El Dorado, The (2000)Animation\Children's57273958490143M25492843
10002033484Skulls, The (2000)Thriller57271958489902M25492843
10002043513Rules of Engagement (2000)Drama\Thriller57274958489970M25492843
10002053535American Psycho (2000)Comedy\Horror\Thriller57272958489970M25492843
10002063536Keeping the Faith (2000)Comedy\Romance57275958489902M25492843
10002073555U-571 (2000)Action\Thriller57273958490699M25492843
10002083578Gladiator (2000)Action\Drama57275958490171M25492843

1000209 rows × 10 columns

1.查看每部电影中两类性别的平均分

In [79]:

# a_movie = data_obj.groupby(['MovieID', 'Gender']).mean()
# out = DataFrame(a_movie['Rating'])
# out
out = DataFrame.pivot_table(data_obj, values='Rating', index='MovieID', columns='Gender', aggfunc='mean')
out

Out[79]:

GenderFM
MovieID
14.4444444.285714
23.6666673.500000
32.5000002.000000
4NaN3.000000
5NaN4.000000
63.0000004.000000
72.7500003.500000
9NaN5.000000
105.0000003.714286
114.0000003.000000
133.0000004.000000
145.0000004.000000
15NaN2.000000
164.0000003.777778
174.2857143.900000
18NaN4.000000
19NaN2.000000
214.0000003.166667
22NaN4.000000
233.000000NaN
243.0000003.333333
253.8333334.500000
264.000000NaN
283.0000003.000000
293.0000004.500000
30NaN4.500000
324.0000003.916667
343.6666673.866667
352.0000005.000000
363.7500003.714286
.........
9534.7500004.800000
9544.7500004.000000
9554.0000005.000000
9563.000000NaN
957NaN4.000000
9594.500000NaN
9604.000000NaN
9614.0000004.000000
9635.0000005.000000
9653.5000004.000000
968NaN3.714286
9694.2000004.181818
9714.5000004.333333
9735.000000NaN
9743.000000NaN
9763.0000004.000000
9783.000000NaN
982NaN4.000000
9861.0000005.000000
991NaN3.000000
9945.0000004.000000
996NaN3.000000
999NaN4.000000
1003NaN2.000000
10051.0000002.000000
10073.0000002.000000
10095.0000002.750000
1010NaN3.000000
1011NaN2.000000
10123.0000003.000000

554 rows × 2 columns

2.查看每部电影中两类性别的均分差异最大的前3的电影

In [113]:

noNaN = out.dropna()
noNaN = noNaN.copy()
noNaN['|F-M|'] = abs(noNaN['F'] - noNaN['M'])
noNaN.sort_values("|F-M|",ascending=False)
noNaN.head(3)

Out[113]:

GenderFM\F-M\
MovieID
14.4444444.2857140.158730
23.6666673.5000000.166667
32.5000002.0000000.500000

3.查找出热门电影热度榜单 (计算表中电影出现的次数,取前3)

In [275]:

ratings_df.groupby('MovieID').count().sort_values('UserID', ascending=False).head(3)

Out[275]:

UserIDRatingTimestamp
MovieID
2858342834283428
260299129912991
1196299029902990

4.每部电影的评分(平均分) 获取排名前10的高分榜单

In [141]:

DataFrame(ratings_df.groupby('MovieID').mean().sort_values('Rating', ascending=False)['Rating'].head(10))

Out[141]:

Rating
MovieID
19785.0
16115.0
19635.0
19505.0
19395.0
19375.0
18275.0
18215.0
17695.0
15635.0

5.前十大热门电影__的评分

In [274]:

hot = ratings_df.groupby('MovieID').count().sort_values('UserID', ascending=False)['Rating']
DataFrame(hot.head(10))

Out[274]:

Rating
MovieID
28583428
2602991
11962990
12102883
4802672
20282653
5892649
25712590
12702583
5932578

6.前20名高分电影的__热度 (使用第3.热度表)

In [276]:

height20 = ratings_df.groupby('MovieID').mean().sort_values('Rating', ascending=False)['Rating']
DataFrame(height20.head(20))

Out[276]:

Rating
MovieID
9895.000000
38815.000000
18305.000000
33825.000000
7875.000000
32805.000000
36075.000000
32335.000000
31725.000000
36565.000000
32454.800000
534.750000
25034.666667
29054.608696
20194.560510
3184.554558
8584.524966
7454.520548
504.517106
5274.510417

7.查询出前十好电影

好电影:

1.看的人多(点击在1000以上)

2.分数还高(按Rating降序)

In [277]:

DataFrame(height20[hot[hot>1000].head(20).index].sort_values(ascending=False))

Out[277]:

Rating
MovieID
5274.510417
11984.477725
2604.453694
27624.406263
5934.351823
20284.337354
28584.317386
25714.315830
11974.303710
11964.292977
6084.254676
1104.234957
16174.219406
23964.127480
5894.058513
12104.022893
12703.990321
12653.953029
4803.763847
15803.739953

使用的数据集是美国人口普查的数据

In [286]:

census_df = pd.read_csv('./data/census.csv', engine='python')
census_df

Out[286]:

SUMLEVREGIONDIVISIONSTATECOUNTYSTNAMECTYNAMECENSUS2010POPESTIMATESBASE2010POPESTIMATE2010...RDOMESTICMIG2011RDOMESTICMIG2012RDOMESTICMIG2013RDOMESTICMIG2014RDOMESTICMIG2015RNETMIG2011RNETMIG2012RNETMIG2013RNETMIG2014RNETMIG2015
0403610AlabamaAlabama477973647801274785161...0.002295-0.1931960.3810660.582002-0.4673691.0300150.8266441.3832821.7247180.712594
1503611AlabamaAutauga County545715457154660...7.242091-2.915927-3.0123492.265971-2.5307997.606016-2.626146-2.7220022.592270-2.187333
2503613AlabamaBaldwin County182265182265183193...14.83296017.64729321.84570519.24328717.19787215.84417618.55962722.72762620.31714218.293499
3503615AlabamaBarbour County274572745727341...-4.728132-2.500690-7.056824-3.904217-10.543299-4.874741-2.758113-7.167664-3.978583-10.543299
4503617AlabamaBibb County229152291922861...-5.527043-5.068871-6.201001-0.1775370.177258-5.088389-4.363636-5.4037290.7545331.107861
5503619AlabamaBlount County573225732257373...1.807375-1.177622-1.748766-2.062535-1.3699701.859511-0.848580-1.402476-1.577232-0.884411
65036111AlabamaBullock County109141091510887...-30.953709-5.180127-1.13026314.354290-16.167247-29.001673-2.8255241.50701717.243790-13.193961
75036113AlabamaButler County209472094620944...-14.032727-11.684234-5.6554131.085428-6.529805-13.936612-11.586865-5.5570581.184103-6.430868
85036115AlabamaCalhoun County118572118586118437...-6.155670-4.611706-5.524649-4.463211-3.376322-5.791579-4.092677-5.062836-3.912834-2.806406
95036117AlabamaChambers County342153417034098...-2.7316393.8490922.872721-2.2872221.349468-1.8210924.7011813.781439-1.2902282.346901
105036119AlabamaCherokee County259892598625976...6.3393271.1131805.488706-0.076806-3.2398666.4161671.4202645.7573840.230419-2.931307
115036121AlabamaChilton County436434363143665...-1.372935-2.6533690.4800440.456017-2.253483-0.823761-2.4475040.8686510.957636-1.752709
125036123AlabamaChoctaw County138591385813841...-15.455274-0.737028-8.766391-1.274984-5.291205-15.528177-0.737028-8.766391-1.274984-5.291205
135036125AlabamaClarke County258332584025767...-6.194363-17.667705-0.318345-8.686428-5.613667-6.077488-17.509958-0.159172-8.486280-5.411736
145036127AlabamaClay County139321393213880...-10.744102-13.3451304.9028715.7026483.912450-10.816697-13.3451304.9771575.7767083.986270
155036129AlabamaCleburne County149721497214973...-3.673524-5.1518807.3458213.654485-3.123961-3.673524-5.1518807.3458213.654485-3.123961
165036131AlabamaCoffee County499484994850177...0.3776407.675579-13.146535-3.6028592.2147742.16646011.513368-10.438741-0.7678225.350738
175036133AlabamaColbert County544285442854514...-0.0734231.0650511.7623901.835688-0.1102600.5139641.4690352.2764202.5332490.588052
185036135AlabamaConecuh County132281322813208...-4.861559-7.504690-6.107224-14.6454162.684140-4.861559-7.504690-6.107224-14.6454162.684140
195036137AlabamaCoosa County115391175811758...-33.930581-10.291443-4.313831-22.958017-5.387581-34.017138-10.380162-4.403703-23.049483-5.387581
205036139AlabamaCovington County377653776537796...6.696899-4.6126680.7402713.697932-0.3169456.881460-4.5599520.7931473.750759-0.264121
215036141AlabamaCrenshaw County139061390613853...1.7297923.950156-1.8649363.0846483.4395042.6667635.099293-0.5020984.7345775.087600
225036143AlabamaCullman County804068041080473...-1.404233-1.0196284.0712475.0871427.915406-1.031427-0.6341594.5429165.5933878.417777
235036145AlabamaDale County502515025150358...-10.749798-5.277150-15.236079-11.979785-5.107706-9.575283-0.776637-12.640155-9.503292-1.998668
245036147AlabamaDallas County438204382043803...-15.635599-11.308243-16.745678-9.344789-14.687232-15.727573-11.378047-16.792849-9.368689-14.711389
255036149AlabamaDeKalb County711097111571142...0.294677-9.302391-1.7488070.2678300.0281411.375159-8.656001-1.0295391.1981870.956790
265036151AlabamaElmore County793037929679465...3.2355760.8227171.760531-1.5070572.0678203.6745111.5581762.306047-0.9511752.757093
275036153AlabamaEscambia County383193831938309...-3.449988-3.855889-4.822706-1.1898311.190902-3.397716-3.803428-4.769999-1.1369501.243830
285036155AlabamaEtowah County104430104427104442...-1.0159192.062637-1.931884-1.726932-2.082234-0.6325542.446383-1.518596-1.234901-1.588308
295036157AlabamaFayette County172411724117231...-5.015601-0.646640-3.7259370.296745-2.797536-5.132243-0.705426-3.7850790.237396-2.857058
..................................................................
3163502355131WisconsinWashington County131887131885131967...-0.7948760.785279-2.2154651.601149-0.434498-0.4315041.162817-1.7633302.1047960.059931
3164502355133WisconsinWaukesha County389891389938390076...-0.7657992.1288600.0381320.760109-0.7198580.1024483.1805271.1897272.0776330.593567
3165502355135WisconsinWaupaca County524105241052422...3.111756-2.2418736.292687-0.441031-0.4806173.359933-2.0119376.561277-0.134227-0.173022
3166502355137WisconsinWaushara County244962449624506...4.930022-2.404973-4.097017-4.906711-4.3977935.174486-2.160399-3.810226-4.535615-4.024395
3167502355139WisconsinWinnebago County166994166994167059...0.3167122.8898730.833819-2.406192-4.5579850.8425733.5023351.531624-1.545153-3.685304
3168502355141WisconsinWood County747497474974807...-4.081523-5.019090-6.901200-5.596471-3.958322-3.733590-4.562809-6.442917-5.040889-3.414223
31694048560WyomingWyoming563626563767564516...-0.3815309.6362144.487115-4.788275-3.2210910.28968010.6948705.440390-3.727831-2.091573
31705048561WyomingAlbany County362993629936428...3.7089562.637812-3.544634-3.334877-9.9111696.7361196.4330320.7195871.429233-5.166460
31715048563WyomingBig Horn County116681166811672...4.8682582.80493016.815908-8.0264205.0958614.8682583.14492117.236306-7.6083785.513554
31725048565WyomingCampbell County461334613346244...-2.84347915.601020-5.895711-8.55091110.916963-2.64960615.558684-5.916543-8.50940210.978525
31735048567WyomingCarbon County158851588515837...-7.581980-13.0814413.178134-2.970641-23.300971-7.392431-12.6369263.623073-2.338590-22.600668
31745048569WyomingConverse County138331383313826...-12.84749915.49382019.035533-20.550587-0.070403-12.77491516.50272020.093063-19.3582331.126443
317550485611WyomingCrook County708370837114...-1.544618-4.2025641.3978196.37825818.629317-0.982939-3.6422222.0967297.07154719.309219
317650485613WyomingFremont County401234012340222...2.7470837.782673-4.990688-12.331633-13.6736103.0935628.027411-4.747240-12.013555-13.352750
317750485615WyomingGoshen County132491324713408...14.2936493.961413-8.079028-7.017803-11.89945014.8861324.841727-6.903896-5.761986-10.635133
317850485617WyomingHot Springs County481248124813...3.3226046.2086093.095336-6.017222-5.4541645.1915696.0016562.888981-6.224712-5.663940
317950485619WyomingJohnson County856985698581...4.995063-4.058912-0.812583-10.7157420.9336525.227392-4.058912-0.812583-10.7157420.933652
318050485621WyomingLaramie County917389188192271...-1.20042815.5472744.787847-1.2261330.278940-0.97332017.9145546.003143-0.2078191.673640
318150485623WyomingLincoln County181061810618091...-9.802564-11.56680113.5645566.1259891.555544-9.691801-11.56680113.6196966.2344141.662823
318250485625WyomingNatrona County754507545075472...7.18931923.06616224.322042-0.958472-0.0610577.68967423.74950825.085233-0.1105930.793743
318350485627WyomingNiobrara County248424842492...-0.4018490.80645229.066295-12.6033877.492114-0.4018490.80645229.066295-12.6033877.492114
318450485629WyomingPark County282052820528259...4.5829518.0577657.641997-9.252437-2.8789806.48663911.12738910.877797-5.5857310.856839
318550485631WyomingPlatte County866786678678...4.3730945.3920732.6345936.0557594.6622704.3730944.9331732.1764035.5987204.207414
318650485633WyomingSheridan County291162911629146...0.9585598.4254874.5463733.678069-3.2984062.1225249.3427785.5230014.781489-2.198937
318750485635WyomingSublette County102471024710244...-23.74178415.272374-40.870074-16.596273-22.870900-21.09290716.828794-39.211861-14.409938-20.664059
318850485637WyomingSweetwater County438064380643593...1.07264316.243199-5.339774-14.252889-14.2488641.25522116.243199-5.295460-14.075283-14.070195
318950485639WyomingTeton County212942129421297...-1.5895650.97269519.52592914.143021-0.5648490.6545272.40857821.16065816.3086711.520747
319050485641WyomingUinta County211182111821102...-17.755986-4.916350-6.902954-14.215862-12.127022-18.136812-5.536861-7.521840-14.740608-12.606351
319150485643WyomingWashakie County853385338545...-11.637475-0.827815-2.013502-17.7814911.682288-11.990126-1.182592-2.250385-18.0201681.441961
319250485645WyomingWeston County720872087181...-11.752361-8.04005912.3725831.5336356.935294-12.032179-8.04005912.3725831.5336356.935294

3193 rows × 100 columns

2010年中 以每个州(STNAME)人口中 最多的 3 个县的人口总和为这个州人口的衡量标准,问哪 3 个州对应的人口最多?

DataFrame(census_df.sort_values(['STNAME', 'POPESTIMATE2010'],ascending=[True,False]).groupby(['STNAME']).head(3).groupby('STNAME')['POPESTIMATE2010'].sum().sort_values(ascending=False).head(3))

Out[329]:

POPESTIMATE2010
STNAME
California50264474
Texas31726014
New York24148828

各个州、县 在 2010 年至 2012 年间人口平均量是怎么样的?

In [351]:

temp = census_df[["POPESTIMATE2010","POPESTIMATE2011","POPESTIMATE2012"]]
# census_df["three-year-total"] = temp.sum(axis=1) #axis 0为列,1为行
census_df["three-year-avg"] = temp.mean(axis=1)
# census_df[['STNAME', 'CTYNAME', 'three-year-total', 'three-year-avg']]
census_df[['STNAME', 'CTYNAME', 'three-year-avg']].head(10)

Out[351]:

STNAMECTYNAMEthree-year-avg
0AlabamaAlabama4.800786e+06
1AlabamaAutauga County5.502933e+04
2AlabamaBaldwin County1.867493e+05
3AlabamaBarbour County2.724200e+04
4AlabamaBibb County2.274533e+04
5AlabamaBlount County5.762000e+04
6AlabamaBullock County1.070733e+04
7AlabamaButler County2.067500e+04
8AlabamaCalhoun County1.178303e+05
9AlabamaChambers County3.405533e+04

在 2010 年至 2012 年间人口增加数量最多的是哪3个县?

In [352]:

census_df['Δ/➖'] = census_df['POPESTIMATE2012'] - census_df['POPESTIMATE2010']
DataFrame(census_df[['STNAME', 'CTYNAME', 'Δ/➖']].sort_values('Δ/➖', ascending=False).head(3))

Out[352]:

STNAMECTYNAMEΔ/➖
2566TexasTexas845378
190CaliforniaCalifornia721976
329FloridaFlorida502131

yuehui.163.com 数据分析

In [1041]:

girls_df = pd.read_csv('./data/girls', encoding='utf-8', sep='::', engine='python')
girls_df.head(10)

Out[1041]:

会员ID昵称年龄省市市区县星座学历学历转分数工资婚否身高体重职业性别头像
030135985做个小女人47北京朝阳区狮子座本科609000116270其他行业0http://yuehui1.nos.netease.com/images/portrait...
137827353云海35北京西城区摩羯座硕士700116955主管/经理0http://yuehui1.nos.netease.com/images/portrait...
247845164紫色柔情47北京海淀区摩羯座本科606500116355教师0http://yuehui1.nos.netease.com/images/portrait...
3615430496回眸一笑48北京海淀区摩羯座本科606500116354教师0http://yuehui1.nos.netease.com/images/portrait...
4615995687空谷幽兰47北京海淀区摩羯座本科606500116355教师0http://yuehui1.nos.netease.com/images/portrait...
532714149风中百合65北京西城区射手座本科606500116350其他行业0http://yuehui1.nos.netease.com/images/portrait...
615735842小草55北京水瓶座本科602500116558工程师0http://yuehui1.nos.netease.com/images/portrait...
7700172079回眸一笑50北京朝阳区天蝎座专科509000116360传媒/影视0http://yuehui1.nos.netease.com/images/portrait...
819335879雨后彩虹56北京朝阳区摩羯座专科506500116863其他行业0http://yuehui1.nos.netease.com/images/portrait...
930493341烟雨蒙蒙46北京东城区水瓶座本科600116059其他行业0http://yuehui1.nos.netease.com/images/portrait...

-1. 得到男女生数据总和

In [1042]:

sex_count = DataFrame(girls_df.groupby('性别').count()['会员ID'])
sex_count.rename(columns={'会员ID': '人数'}, inplace=True)
sex_count

Out[1042]:

人数
性别
025484
119197

0. 分析每个地区的男女生比例

In [1043]:

temp_copy = girls_df
# 改名
temp_copy['性别'] = girls_df['性别'].replace([0,1],['女','男'])
# 总和
total = temp_copy.groupby('省市')[['会员ID']].count()
# 人数 / 总和
province_sex = temp_copy.groupby(['省市', '性别'])[['会员ID']].count()
evey_province_sex_percent = province_sex / total
evey_province_sex_percent.rename(columns={'会员ID': '人数所占百分比'}, inplace=True)
evey_province_sex_percent

Out[1043]:

人数所占百分比
省市性别
上海0.821717
0.178283
乌鲁木齐0.413987
0.586013
兰州0.473684
0.526316
北京0.819867
0.180133
南京0.571429
0.428571
南宁0.583333
0.416667
南昌0.363636
0.636364
台湾0.615385
0.384615
合肥0.461538
0.538462
呼和浩特0.615385
0.384615
哈尔滨0.552898
0.447102
天津0.872491
0.127509
太原0.600000
0.400000
广州0.600000
0.400000
成都0.388889
0.611111
.........
沈阳0.625578
0.374422
济南0.700000
0.300000
海口0.475000
0.525000
澳门0.333333
0.666667
石家庄0.700000
0.300000
福州0.416667
0.583333
西宁0.526316
0.473684
西安0.546218
0.453782
贵阳0.666667
0.333333
郑州0.413043
0.586957
重庆0.663551
0.336449
银川0.459550
0.540450
长春0.500000
0.500000
长沙0.526316
0.473684
香港0.410324
0.589676

68 rows × 1 columns

1. 查看那5个星座发布的交友信息最多

In [1044]:

# 计数
mostPostStart = DataFrame(girls_df.groupby('星座').count()['会员ID'])
# 改一下列名
mostPostStart.rename(columns={'会员ID': '发部次数'}, inplace=True)
# 输出前5
mostPostStart.sort_values(by='发部次数', ascending=False).head(5)

Out[1044]:

发部次数
星座
摩羯座6053
双子座5331
天秤座3806
金牛座3492
巨蟹座3422

2. 查看薪资排名前10 的省份

薪资为保密,0,空值的不进行计算

薪资在区间内的取其均值

50000以上的视为50000 2000以下的视为2000

In [1045]:

# 过滤掉工资为 -1 的行 不统计
have_income = girls_df[~girls_df['工资'].isin([-1])]

# 省份人数总工资 / 计数
DataFrame(have_income.groupby('省市').sum()['工资'] / girls_df.groupby('省市').count()['会员ID'], columns=['平均薪资']).sort_values(by='平均薪资', ascending=False).head(10)

Out[1045]:

平均薪资
省市
香港14634.017207
澳门12677.777778
台湾10237.891738
杭州9697.530864
成都9047.668038
北京8617.494440
南京8280.864198
福州8011.316872
长沙7850.227420
广州7762.962963

3. 查看哪10个年龄的交友需求比较多 (过滤掉所有无效数据)

In [1046]:

# 计数
mostPostAge = DataFrame(girls_df.groupby('年龄').count()['会员ID'].sort_values(ascending=False))
mostPostAge.rename(columns={'会员ID': '发布次数'}, inplace=True)
mostPostAge.head(10)

Out[1046]:

发布次数
年龄
362134
372054
341925
321913
291898
331855
351852
311846
391839
301793

4. 得到各个年龄段的男女的人数

In [1047]:

# 创建副本
temp_copy = girls_df
temp_copy['性别']=temp_copy['性别'].replace([0,1],['女','男'])
# 按年龄 性别 计算数量
all_age_count = DataFrame(DataFrame(temp_copy.groupby(['年龄', '性别']).count())['会员ID'])
all_age_count.rename(columns={'会员ID': '人数'}, inplace=True)
all_age_count

Out[1047]:

人数
年龄性别
1899
68
19135
88
20132
108
21216
116
22456
257
23326
132
24472
154
25495
179
26824
206
27903
252
281103
358
291344
554
301251
542
311255
591
321201
712
.........
742
2
756
7611
6
772
5
7817
6
7916
15
8015
17
814
12
8217
8314
11
8415
29
8523
17
8611
9
878
6
8859
45
89142
107

142 rows × 1 columns

5. 各个学历的占比

In [1076]:

# 合并 空 和 高中以下学历
# girls_df['学历']=girls_df['学历'].replace(['空'],['高中以下'])
# degree_percent.groupby(['学历']).count()
# temp_copy.groupby(['学历']).count()
# 计算 每个学历 对应的总数
degree_temp = DataFrame(DataFrame(temp_copy.groupby(['学历']).count())['会员ID'])
# degree_temp
# 换列名
degree_temp.rename(columns={'会员ID': '百分比'}, inplace=True)

# degree_temp

# degree_temp['学历']['高中以下'] + degree_temp['学历']['空'] 

# 按列计算每个元素所占百分比
DataFrame(degree_temp['百分比'].apply(lambda x :str((x/degree_temp['百分比'].sum()) * 100) + '%')).sort_values(by='百分比', ascending=False)

Out[1076]:

百分比
学历
高中8.690494841207672%
高中以下8.430876658982566%
6.049551263400551%
硕士5.049126026722768%
本科46.08894160828988%
专科24.03706273360041%
博士1.6539468677961549%

6. 得出前10个平均学历最高的城市

无学历的默认为高中学历

假设学历满分为100

高中以下30分,高中40分,专科50分,本科60分,硕士70分,博士80分

In [1049]:

degree_height = girls_df.groupby(['省市', '市区县']).mean().sort_values(by='学历转分数', ascending=False)[['学历转分数']]
# 换列名
degree_height.rename(columns={'学历转分数': '平均学历分数'}, inplace=True)
degree_height.head(10)

Out[1049]:

平均学历分数
省市市区县
哈尔滨宁安市80.000000
抚远县70.000000
南宁隆安县66.666667
石家庄元氏县60.000000
哈尔滨东宁县60.000000
木兰县60.000000
石家庄栾城县60.000000
南京58.252427
白下区58.000000
哈尔滨让胡路区58.000000

7. 得到身材相对匀称 人数最多的10个省份

按照BMI指数得到结果

按照:70kg÷(1.75×1.75)=22.86)的比例

结果 如果在18.5-23.9之间的 属于标准身材

In [1067]:

body_thin = girls_df
# body_thin.copy()
# 过滤掉 身高 / 体重 为 0 的行 不统计
body_thin = body_thin[~body_thin['身高'].isin([0])]
body_thin = body_thin.copy()

body_thin['BMI'] = body_thin['体重'] / (body_thin['身高'] * body_thin['身高'] / 10000)

thin_filter = body_thin[(body_thin['BMI'] >= 18.5) & (body_thin['BMI'] <= 23.9)]
thin_temp = thin_filter.groupby('省市').count()[['会员ID']].sort_values(by='会员ID', ascending=False).head(10)
thin_temp

Out[1067]:

会员ID
省市
北京1657
上海1492
天津1271
重庆1037
成都915
长沙880
兰州804
西宁778
西安771
香港739

8. 三个图

身高 体重散点图

top10 薪资省份(图任选)

满足bmi指数的人数最多的10个省份(柱状图,类似效果)

In [1051]:

temp = girls_df
temp = temp[(girls_df['身高'] < 240) & (girls_df['体重'] < 400)]
plt.scatter(temp['身高'], temp['体重'])

Out[1051]:

<matplotlib.collections.PathCollection at 0x1d418b075f8>

# 过滤掉工资为 -1 的行 不统计
# 省份人数总工资 / 计数
income_10 = DataFrame(girls_df[~girls_df['工资'].isin([-1])].groupby('省市').sum()['工资'] / girls_df.groupby('省市').count()['会员ID'], columns=['平均薪资']).sort_values(by='平均薪资', ascending=False).head(10)
plt.bar(Series(income_10.index), income_10['平均薪资'], width=0.5)
plt.show()

plt.bar(Series(thin_temp.index), thin_temp['会员ID'], width=0.5)
plt.show()


python 大数据
Theme Jasmine by Kent Liao And Bill

本站由提供云存储服务